Changes

Allen's REINFORCE notes

6 bytes added, 23:58, 24 May 2024

no edit summary

Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. These optimal parameters are defined as

<math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>. Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory (<math> \tau </math>) determined by the policy is the highest over all policies.

=== ~~Learning~~ Overview ===

~~Learning involves the agent taking actions~~ # Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions. Remember a policy is a mapping from observations to outputs. If the ~~environment returning a new state~~ space is continuous, it may make more sense to make output be one mean and ~~reward.~~ * Input: <math>s_t</math>: States at one standard deviation for each ~~time step~~* Output: <math>a_t</math>: Actions at each time step* Data: <math>(s_1, a_1, r_1, ..component of the action. ~~, s_T, a_T, r_T)</math>~~* Learn <math>\pi_\theta # Repeat: ~~s_t -> a_t </math> to maximize <math> \sum_t r_t </math>~~

=== State vs. Observation ===

Allen12

53

edits

Humanoid Robots Wiki β

Changes

Allen's REINFORCE notes

Humanoid Robots Wiki ^β