Difference between revisions of "Allen's REINFORCE notes"

Revision as of 00:05, 25 May 2024

Allen's REINFORCE notes

Recall that the objective of Reinforcement Learning is to find an optimal policy $\pi ^{*}$ which we encode in a neural network with parameters $\theta ^{*}$ . $\pi _{\theta }$ is a mapping from observations to actions. These optimal parameters are defined as $\theta ^{*}={\text{argmax}}_{\theta }E_{\tau \sim p_{\theta }(\tau )}\left[\sum _{t}r(s_{t},a_{t})\right]$ . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory ( $\tau$ ) determined by the policy is the highest over all policies.

Overview

‎

1 Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
2 For # of episodes:
3   While not terminated:
4     Get observation from environment
5     Use policy network to map observation to action distribution
6     Randomly sample one action from action distribution
7     Compute logarithmic probability of that action occurring
8     Step environment using action and store reward
9   Calculate loss over entire trajectory as function of probabilities and rewards

Loss Function

@@ Line 9: / Line 9: @@
 === Motivation ===
-Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. These optimal parameters are defined as
+Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. <math> \pi_\theta </math> is a mapping from observations to actions. These optimal parameters are defined as
 <math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>. Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory (<math> \tau </math>) determined by the policy is the highest over all policies.
 === Overview ===
-# Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions. Remember a policy is a mapping from observations to outputs. If the space is continuous, it may make more sense to make output be one mean and one standard deviation for each component of the action.
+‎<syntaxhighlight lang="bash" line>
-‎<syntaxhighlight lang="python" line>
+Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
-# For # of episodes:
+For # of episodes:
-## While not terminated:
+  While not terminated:
-### Get observation from environment
+    Get observation from environment
-### Use policy network to map observation to action distribution
+    Use policy network to map observation to action distribution
-### Randomly sample one action from action distribution
+    Randomly sample one action from action distribution
-### Compute logarithmic probability of that action occurring
+    Compute logarithmic probability of that action occurring
-### Step environment using action and store reward
+    Step environment using action and store reward
-## Calculate loss over entire trajectory as function of probabilities and rewards
+  Calculate loss over entire trajectory as function of probabilities and rewards
 </syntaxhighlight>
 === Loss Function ===

Difference between revisions of "Allen's REINFORCE notes"

Revision as of 00:05, 25 May 2024

Contents

Links

Motivation

Overview

Loss Function

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools