53
edits
Changes
no edit summary
</syntaxhighlight>
=== Objective Function ===
The goal of reinforcement learning is to maximize the expected reward over the entire episode. We use <math>R(\tau)</math> to denote the total reward over some trajectory <math>\tau</math> defined by our policy. Thus we want to maximize <math>E_{\tau ~ \pi_\theta}[R(\tau)]</math>
=== Loss Function ===
The goal of REINFORCE is to optimize the expected cumulative reward.We do so using gradient descent