Difference between revisions of "Allen's PPO Notes"
Line 2: | Line 2: | ||
#Smaller policy updates more likely to converge to optimal | #Smaller policy updates more likely to converge to optimal | ||
#Falling "off the cliff" might mean it's impossible to recover | #Falling "off the cliff" might mean it's impossible to recover | ||
− | How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\ | + | How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepsilon, 1 + \varepsilon]</math> removing incentive to go too far. |
Revision as of 19:28, 26 May 2024
Intuition: Want to avoid too large of a policy update
- Smaller policy updates more likely to converge to optimal
- Falling "off the cliff" might mean it's impossible to recover
How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to removing incentive to go too far.