Changes

Allen's REINFORCE notes

439 bytes added, 00:03, 25 May 2024

no edit summary

# Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions. Remember a policy is a mapping from observations to outputs. If the space is continuous, it may make more sense to make output be one mean and one standard deviation for each component of the action.

‎<syntaxhighlight lang="python" line># ~~Repeat~~For # of episodes:## While not terminated:### Get observation from environment### Use policy network to map observation to action distribution### Randomly sample one action from action distribution### Compute logarithmic probability of that action occurring### Step environment using action and store reward## Calculate loss over entire trajectory as function of probabilities and rewards</syntaxhighlight>

=== ~~State vs. Observation~~ Loss Function ===

Allen12

53

edits

Changes

Allen's REINFORCE notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools