Difference between revisions of "Allen's REINFORCE notes"

From Humanoid Robots Wiki
Jump to: navigation, search
(Motivation)
Line 10: Line 10:
  
 
Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. These optimal parameters are defined as
 
Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. These optimal parameters are defined as
<math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>. Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory determined by the policy is the highest over all policies.
+
<math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>. Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory (<math> \tau </math>) determined by the policy is the highest over all policies.  
  
=== Learning ===
+
=== Overview ===
  
Learning involves the agent taking actions and the environment returning a new state and reward.
+
# Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions. Remember a policy is a mapping from observations to outputs. If the space is continuous, it may make more sense to make output be one mean and one standard deviation for each component of the action.  
* Input: <math>s_t</math>: States at each time step
+
# Repeat:
* Output: <math>a_t</math>: Actions at each time step
 
* Data: <math>(s_1, a_1, r_1, ... , s_T, a_T, r_T)</math>
 
* Learn <math>\pi_\theta : s_t -> a_t </math> to maximize <math> \sum_t r_t </math>
 
  
 
=== State vs. Observation ===
 
=== State vs. Observation ===

Revision as of 23:58, 24 May 2024

Allen's REINFORCE notes

Links

Motivation

Recall that the objective of Reinforcement Learning is to find an optimal policy which we encode in a neural network with parameters . These optimal parameters are defined as . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory () determined by the policy is the highest over all policies.

Overview

  1. Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions. Remember a policy is a mapping from observations to outputs. If the space is continuous, it may make more sense to make output be one mean and one standard deviation for each component of the action.
  2. Repeat:

State vs. Observation