Allen's PPO Notes

From Humanoid Robots Wiki

Revision as of 19:28, 26 May 2024 by Allen12 (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to: navigation, search

Intuition: Want to avoid too large of a policy update

Smaller policy updates more likely to converge to optimal
Falling "off the cliff" might mean it's impossible to recover

How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to $[1-\varepsilon ,1+\varepsilon ]$ removing incentive to go too far.

Retrieved from "http://54.204.126.50/index.php?title=Allen%27s_PPO_Notes&oldid=1294"