53
edits
Changes
no edit summary
=== Objective Function ===
The goal of reinforcement learning is to maximize the expected reward over the entire episode. We use <math>R(\tau)</math> to denote the total reward over some trajectory <math>\tau</math> defined by our policy. Thus we want to maximize <math>E_{\tau ~ \sim \pi_\theta}[R(\tau)]</math>. We can use the definition of expected value to expand this as <math>\sum_\tau P(\tau | \theta) R (\tau)</math>, where the probability of a given trajectory occurring can further be expressed as P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_theta(a_t | s_t) P(s_{t + 1} | s_t, a_t)
=== Loss Function ===
The goal of REINFORCE is to optimize the expected cumulative reward. We do so using gradient descent