Allen's PPO Notes - Revision history

Allen12 at 01:33, 27 May 2024

2024-05-27T01:33:43Z

Allen12 at 01:28, 27 May 2024

2024-05-27T01:28:27Z

Allen12 at 20:17, 26 May 2024

2024-05-26T20:17:22Z

Allen12 at 19:47, 26 May 2024

2024-05-26T19:47:23Z

Allen12 at 19:38, 26 May 2024

2024-05-26T19:38:10Z

Allen12 at 19:28, 26 May 2024

2024-05-26T19:28:16Z

Allen12 at 19:28, 26 May 2024

2024-05-26T19:28:00Z

Allen12: Created page with "Intuition: Want to avoid too large of a policy update #Smaller policy updates more likely to converge to optimal #Falling "off the cliff" might mean it's impossible to recover..."

2024-05-26T19:27:49Z

Created page with "Intuition: Want to avoid too large of a policy update #Smaller policy updates more likely to converge to optimal #Falling "off the cliff" might mean it's impossible to recover..."

New page

Intuition: Want to avoid too large of a policy update
#Smaller policy updates more likely to converge to optimal
#Falling "off the cliff" might mean it's impossible to recover
How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepislon, 1 + \varepsilon] removing incentive to go too far.

@@ Line 17: / Line 17: @@
 === Clipping ===
-Our clipped objective function is <math> E_t
+Our clipped objective function is <math> E_t </math>
 It's easier to understand this clipping when we break it down based on why we are clipping. Let's consider some possible cases:
 # The ratio is in the range. If the ratio is in the range, we have no reason to clip - if advantage is positive, we should encourage our policy to increase the probability of that action, and if negative, we should decrease the probability that the policy takes the action.
 # The ratio is lower than <math> 1 - \epsilon </math>. If the advantage is positive, we still want to increase the probability of taking that action. If the advantage is negative, then doing a policy update will decrease further the probability of taking that action, so we instead clip the gradient to 0 and don't update our weights - even though the reward here was worse, we still want to explore.
 # The ratio is greater than <math> 1 + \epsilon </math>. If the advantage is positive, we already have a higher probability of taking the action than in the previous policy. Thus, we don't want to update further, and get to greedy. If the advantage is negative, we clip it to <math> 1 - \epsilon </math> as usual.

@@ Line 14: / Line 14: @@
 Intuitively, if we want to measure the divergence between our old and current policies, we want some way of figuring out the difference between action-state pairs in the old and new policies. We denote this as <math> r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t|s_t)} </math>. A ratio greater than one indicates the action is more likely in the current policy than the old policy, and if its between 0 and 1, it indicates the opposite. This ratio function replaces the log probability in the policy objective function as the way of accounting for the change in parameters.
-Let's step back for a moment and think about why we might want to do this. In standard policy gradients, after we use a trajectory to update our policy, the experience gained in that trajectory is now incorrect with respect to our current policy. We resolve this using importance sampling. If the actions of the old trajectory have become unlikely, the influence of that experience will be reduced. Thus, prior to clipping, our new loss function can be written in expectation form as <math> E \left[r_t(\theta)A_t\right].
+Let's step back for a moment and think about why we might want to do this. In standard policy gradients, after we use a trajectory to update our policy, the experience gained in that trajectory is now incorrect with respect to our current policy. We resolve this using importance sampling. If the actions of the old trajectory have become unlikely, the influence of that experience will be reduced. Thus, prior to clipping, our new loss function can be written in expectation form as <math> E \left[r_t(\theta)A_t\right] </math>. If we take the gradient, it actually ends up being a nearly identical equation, only with the <math> \pi_\theta(a_t | s_t) </math> being scaled by a proportional factor <math> \pi_{\theta_{old}}(a_t | s_t) </math>.
 === Clipping ===
 It's easier to understand this clipping when we break it down based on why we are clipping. Let's consider some possible cases:

@@ Line 1: / Line 1: @@
 === Advantage Function ===
-<math> A(s, a) = Q(s, a) - V(s) </math>. Intuitively: extra reward we get if we take action at state compared to the mean reward at that state. We use this advantage function to tell us how good the action is - if its positive, the action is better than others at that state so we want to move in that direction, and if its negative, the action is worse than others at thtat state so we move in the opposite direction.
+<math> A(s, a) = Q(s, a) - V(s) </math>. Intuitively: extra reward we get if we take action at state compared to the mean reward at that state. We use this advantage function to tell us how good the action is - if its positive, the action is better than others at that state so we want to move in that direction, and if its negative, the action is worse than others at thtat state so we move in the opposite direction. Since it's often difficult and expensive to compute the Q value for all state-action pairs, we replace Q(s, a) with our sampled reward from the action. We can improve policy gradients using this objective function instead of the reward for stability.
 === Motivation ===
@@ Line 9: / Line 12: @@
 === Ratio Function ===
-Intuitively, if we want to measure the divergence between our old and current policies, we want some way of figuring out the difference between action-state pairs in the old and new policies. We denote this as <math> r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t|s_t)}. A ratio greater than one indicates the action is more likely in the current policy than the old policy, and if its between 0 and 1, it indicates the opposite.
+Intuitively, if we want to measure the divergence between our old and current policies, we want some way of figuring out the difference between action-state pairs in the old and new policies. We denote this as <math> r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t|s_t)} </math>. A ratio greater than one indicates the action is more likely in the current policy than the old policy, and if its between 0 and 1, it indicates the opposite. This ratio function replaces the log probability in the policy objective function as the way of accounting for the change in parameters.

← Older revision		Revision as of 19:47, 26 May 2024
Line 7:		Line 7:
	#Falling "off the cliff" might mean it's impossible to recover		#Falling "off the cliff" might mean it's impossible to recover
	How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepsilon, 1 + \varepsilon]</math> removing incentive to go too far.		How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepsilon, 1 + \varepsilon]</math> removing incentive to go too far.
		+
		+	=== Ratio Function ===
		+	Intuitively, if we want to measure the divergence between our old and current policies, we want some way of figuring out the difference between action-state pairs in the old and new policies. We denote this as <math> r_t(\theta) = \frac{\pi_\theta(a_t \| s_t)}{\pi_{\theta_{old}}(a_t\|s_t)}. A ratio greater than one indicates the action is more likely in the current policy than the old policy, and if its between 0 and 1, it indicates the opposite.

@@ Line 2: / Line 2: @@
 #Smaller policy updates more likely to converge to optimal
 #Falling "off the cliff" might mean it's impossible to recover
-How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepislon, 1 + \varepsilon]</math> removing incentive to go too far.
+How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to <math>[1-\varepsilon, 1 + \varepsilon]</math> removing incentive to go too far.