Changes

Jump to: navigation, search

Allen's PPO Notes

1,108 bytes added, 26 May
no edit summary
=== Links ===
Hugging face deep rl course
 
=== Advantage Function ===
<math> A(s, a) = Q(s, a) - V(s) </math>. Intuitively: extra reward we get if we take action at state compared to the mean reward at that state. We use this advantage function to tell us how good the action is - if its positive, the action is better than others at that state so we want to move in that direction, and if its negative, the action is worse than others at thtat state so we move in the opposite direction. Since it's often difficult and expensive to compute the Q value for all state-action pairs, we replace Q(s, a) with our sampled reward from the action. We can improve policy gradients using this objective function instead of the reward for stability.
=== Motivation ===
=== Ratio Function ===
Intuitively, if we want to measure the divergence between our old and current policies, we want some way of figuring out the difference between action-state pairs in the old and new policies. We denote this as <math> r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t|s_t)}</math>. A ratio greater than one indicates the action is more likely in the current policy than the old policy, and if its between 0 and 1, it indicates the opposite.This ratio function replaces the log probability in the policy objective function as the way of accounting for the change in parameters.  Let's step back for a moment and think about why we might want to do this. In standard policy gradients, after we use a trajectory to update our policy, the experience gained in that trajectory is now incorrect with respect to our current policy. We resolve this using importance sampling. If the actions of the old trajectory have become unlikely, the influence of that experience will be reduced. Thus, prior to clipping, our new loss function can be written in expectation form as <math> E \left[r_t(\theta)A_t\right]. === Clipping ===It's easier to understand this clipping when we break it down based on why we are clipping. Let's consider some possible cases:
53
edits

Navigation menu