Changes

Jump to: navigation, search

Allen's REINFORCE notes

No change in size, 26 May
no edit summary
=== Loss Computation ===
It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} | s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We , we get our final gradient expression <math> \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) </math>.
Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] </math>. This is why our loss is equal to <math> -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t </math>, since using the chain rule to take its derivative gives us the formula for the gradient for our backwards pass (see Dennis' Optimization Notes).
53
edits

Navigation menu