Changes

Allen's PPO Notes

473 bytes added, 19:47, 26 May 2024

no edit summary

#Falling "off the cliff" might mean it's impossible to recover

How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepsilon, 1 + \varepsilon]</math> removing incentive to go too far.

=== Ratio Function ===

Intuitively, if we want to measure the divergence between our old and current policies, we want some way of figuring out the difference between action-state pairs in the old and new policies. We denote this as <math> r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t|s_t)}. A ratio greater than one indicates the action is more likely in the current policy than the old policy, and if its between 0 and 1, it indicates the opposite.

Allen12

53

edits

Humanoid Robots Wiki β

Changes

Allen's PPO Notes

Humanoid Robots Wiki ^β