53
edits
Changes
no edit summary
<syntaxhighlight lang="bash" line>
Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
For \# of episodeseach episode:
While not terminated:
Get observation from environment
Step environment using action and store reward
Calculate loss over entire trajectory as function of probabilities and rewards
Recall loss functions are differentiable with respect to each parameter - thus, calculate how changes in parameters correlate with changes in the loss
Based on the loss, use a gradient descent policy to update weights
</syntaxhighlight>
=== Loss Function ===