Changes

Allen's PPO Notes

7 bytes added, 19:28, 26 May 2024

no edit summary

#Smaller policy updates more likely to converge to optimal

#Falling "off the cliff" might mean it's impossible to recover

How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepislon, 1 + \varepsilon] </math> removing incentive to go too far.

Allen12

53

edits

Humanoid Robots Wiki β

Changes

Allen's PPO Notes

Humanoid Robots Wiki ^β