Difference between revisions of "Allen's PPO Notes"

From Humanoid Robots Wiki
Jump to: navigation, search
Line 7: Line 7:
 
#Falling "off the cliff" might mean it's impossible to recover
 
#Falling "off the cliff" might mean it's impossible to recover
 
How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepsilon, 1 + \varepsilon]</math> removing incentive to go too far.
 
How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepsilon, 1 + \varepsilon]</math> removing incentive to go too far.
 +
 +
=== Ratio Function ===
 +
Intuitively, if we want to measure the divergence between our old and current policies, we want some way of figuring out the difference between action-state pairs in the old and new policies. We denote this as <math> r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t|s_t)}. A ratio greater than one indicates the action is more likely in the current policy than the old policy, and if its between 0 and 1, it indicates the opposite.

Revision as of 19:47, 26 May 2024

Advantage Function

. Intuitively: extra reward we get if we take action at state compared to the mean reward at that state. We use this advantage function to tell us how good the action is - if its positive, the action is better than others at that state so we want to move in that direction, and if its negative, the action is worse than others at thtat state so we move in the opposite direction.

Motivation

Intuition: Want to avoid too large of a policy update

  1. Smaller policy updates more likely to converge to optimal
  2. Falling "off the cliff" might mean it's impossible to recover

How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to removing incentive to go too far.

Ratio Function

Intuitively, if we want to measure the divergence between our old and current policies, we want some way of figuring out the difference between action-state pairs in the old and new policies. We denote this as <math> r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t|s_t)}. A ratio greater than one indicates the action is more likely in the current policy than the old policy, and if its between 0 and 1, it indicates the opposite.