Difference between revisions of "Allen's PPO Notes"

Revision as of 19:28, 26 May 2024

Intuition: Want to avoid too large of a policy update

How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to $[1-\varepsilon ,1+\varepsilon ]$ removing incentive to go too far.

@@ Line 2: / Line 2: @@
 #Smaller policy updates more likely to converge to optimal
 #Falling "off the cliff" might mean it's impossible to recover
-How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepislon, 1 + \varepsilon]</math> removing incentive to go too far.
+How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepsilon, 1 + \varepsilon]</math> removing incentive to go too far.