53
edits
Changes
no edit summary
=== Clipping ===
Our clipped objective function is <math> E_t</math>
It's easier to understand this clipping when we break it down based on why we are clipping. Let's consider some possible cases:
# The ratio is in the range. If the ratio is in the range, we have no reason to clip - if advantage is positive, we should encourage our policy to increase the probability of that action, and if negative, we should decrease the probability that the policy takes the action.
# The ratio is lower than <math> 1 - \epsilon </math>. If the advantage is positive, we still want to increase the probability of taking that action. If the advantage is negative, then doing a policy update will decrease further the probability of taking that action, so we instead clip the gradient to 0 and don't update our weights - even though the reward here was worse, we still want to explore.
# The ratio is greater than <math> 1 + \epsilon </math>. If the advantage is positive, we already have a higher probability of taking the action than in the previous policy. Thus, we don't want to update further, and get to greedy. If the advantage is negative, we clip it to <math> 1 - \epsilon </math> as usual.