Difference between revisions of "Allen's PPO Notes"

From Humanoid Robots Wiki
Jump to: navigation, search
(Created page with "Intuition: Want to avoid too large of a policy update #Smaller policy updates more likely to converge to optimal #Falling "off the cliff" might mean it's impossible to recover...")
 
Line 2: Line 2:
 
#Smaller policy updates more likely to converge to optimal
 
#Smaller policy updates more likely to converge to optimal
 
#Falling "off the cliff" might mean it's impossible to recover
 
#Falling "off the cliff" might mean it's impossible to recover
How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepislon, 1 + \varepsilon] removing incentive to go too far.
+
How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepislon, 1 + \varepsilon]</math> removing incentive to go too far.

Revision as of 19:28, 26 May 2024

Intuition: Want to avoid too large of a policy update

  1. Smaller policy updates more likely to converge to optimal
  2. Falling "off the cliff" might mean it's impossible to recover

How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to Failed to parse (unknown function "\varepislon"): {\displaystyle [1-\varepislon, 1 + \varepsilon]} removing incentive to go too far.