Open main menu

Humanoid Robots Wiki β

Changes

Allen's PPO Notes

473 bytes added, 26 May
no edit summary
#Falling "off the cliff" might mean it's impossible to recover
How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepsilon, 1 + \varepsilon]</math> removing incentive to go too far.
 
=== Ratio Function ===
Intuitively, if we want to measure the divergence between our old and current policies, we want some way of figuring out the difference between action-state pairs in the old and new policies. We denote this as <math> r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t|s_t)}. A ratio greater than one indicates the action is more likely in the current policy than the old policy, and if its between 0 and 1, it indicates the opposite.
53
edits