Changes

Allen's REINFORCE notes

1,100 bytes added, 00:46, 26 May 2024

no edit summary

Thus, using this idea, we can rewrite our gradient as <math> \sum_\tau R(\tau) p(\tau | \theta) \nabla_\theta \log P(\tau | \theta) </math>. Finally, using the definition of expectation again, we have <math> \nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \right] </math>

=== Loss ~~Function~~ Computation ===

~~The goal~~ It is tricky for us to give our policy the notion of ~~REINFORCE~~ "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is , instead of examining the behavior of the entire episode, we want to ~~optimize~~ create a summation over timesteps. We know that <math> R(\tau) </math> is the ~~expected cumulative~~ total rewardover all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occuring given the policy is <math> P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} | s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We ~~do so using~~ get our final gradient ~~descent~~expression \sum_\tau P(\tau | \theta) R \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t).

Allen12

53

edits

Changes

Allen's REINFORCE notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools