53
edits
Changes
→Objective Function
The goal of reinforcement learning is to maximize the expected reward over the entire episode. We use <math>R(\tau)</math> to denote the total reward over some trajectory <math>\tau</math> defined by our policy. Thus we want to maximize <math>E_{\tau \sim \pi_\theta}[R(\tau)]</math>. We can use the definition of expected value to expand this as <math>\sum_\tau P(\tau | \theta) R (\tau)</math>, where the probability of a given trajectory occurring can further be expressed as <math> P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) </math>.
Now we want to find the gradient of <math> J (\theta) </math>, namely<math>\nabla_\theta \sum_\tau P(\tau | \theta) R(\tau) </math>. The important step here is called the Log Derivative Trick. ==Log Derivative Trick==<math>\nabla
=== Loss Function ===
The goal of REINFORCE is to optimize the expected cumulative reward. We do so using gradient descent