Difference between revisions of "Allen's REINFORCE notes"
Line 9: | Line 9: | ||
=== Motivation === | === Motivation === | ||
− | Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. These optimal parameters are defined as | + | Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. <math> \pi_\theta </math> is a mapping from observations to actions. These optimal parameters are defined as |
<math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>. Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory (<math> \tau </math>) determined by the policy is the highest over all policies. | <math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>. Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory (<math> \tau </math>) determined by the policy is the highest over all policies. | ||
=== Overview === | === Overview === | ||
− | + | <syntaxhighlight lang="bash" line> | |
− | + | Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions | |
− | + | For # of episodes: | |
− | + | While not terminated: | |
− | + | Get observation from environment | |
− | + | Use policy network to map observation to action distribution | |
− | + | Randomly sample one action from action distribution | |
− | + | Compute logarithmic probability of that action occurring | |
− | + | Step environment using action and store reward | |
− | + | Calculate loss over entire trajectory as function of probabilities and rewards | |
</syntaxhighlight> | </syntaxhighlight> | ||
=== Loss Function === | === Loss Function === |
Revision as of 00:05, 25 May 2024
Allen's REINFORCE notes
Contents
Links
Motivation
Recall that the objective of Reinforcement Learning is to find an optimal policy which we encode in a neural network with parameters . is a mapping from observations to actions. These optimal parameters are defined as . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory () determined by the policy is the highest over all policies.
Overview
1 Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
2 For # of episodes:
3 While not terminated:
4 Get observation from environment
5 Use policy network to map observation to action distribution
6 Randomly sample one action from action distribution
7 Compute logarithmic probability of that action occurring
8 Step environment using action and store reward
9 Calculate loss over entire trajectory as function of probabilities and rewards