Allen's REINFORCE notes

Recall that the objective of Reinforcement Learning is to find an optimal policy $\pi ^{*}$ which we encode in a neural network with parameters $\theta ^{*}$ . $\pi _{\theta }$ is a mapping from observations to actions. These optimal parameters are defined as $\theta ^{*}={\text{argmax}}_{\theta }E_{\tau \sim p_{\theta }(\tau )}\left[\sum _{t}r(s_{t},a_{t})\right]$ . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory ( $\tau$ ) determined by the policy is the highest over all policies.

Overview

‎

Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
For each episode:
  While not terminated:
    Get observation from environment
    Use policy network to map observation to action distribution
    Randomly sample one action from action distribution
    Compute logarithmic probability of that action occurring
    Step environment using action and store reward
  Calculate loss over entire trajectory as function of probabilities and rewards
  Recall loss functions are differentiable with respect to each parameter - thus, calculate how changes in parameters correlate with changes in the loss
  Based on the loss, use a gradient descent policy to update weights

Objective Function

The goal of reinforcement learning is to maximize the expected reward over the entire episode. We use $R(\tau )$ to denote the total reward over some trajectory $\tau$ defined by our policy. Thus we want to maximize $E_{\tau \sim \pi _{\theta }}[R(\tau )]$ . We can use the definition of expected value to expand this as $\sum _{\tau }P(\tau |\theta )R(\tau )$ , where the probability of a given trajectory occurring can further be expressed as $P(\tau |\theta )=P(s_{0})\prod _{t=0}^{T}\pi _{\theta }(a_{t}|s_{t})P(s_{t+1}|s_{t},a_{t})$ .

Now we want to find the gradient of $J(\theta )$ , namely $\nabla _{\theta }\sum _{\tau }P(\tau |\theta )R(\tau )$ . Since the reward function isn't a dependent on the parameters. We can rearrange: $\sum _{\tau }R(\tau )\nabla _{\theta }P(\tau |\theta )$ . The next step here is what's called the Log Derivative Trick.

Suppose we'd like to find $\nabla _{x_{1}}\log(f(x_{1},x_{2},x_{3},...))$ . By the chain rule this is equal to ${\frac {\nabla _{x_{1}}f(x_{1},x_{2},x_{3}...)}{f(x_{1},x_{2},x_{3}...)}}$ . Thus, by rearranging, we can take the gradient of any function with respect to some variable as $\nabla _{x_{1}}f(x_{1},x_{2},x_{3},...)=f(x_{1},x_{2},x_{3},...)\nabla _{x_{1}}\log(f(x_{1},x_{2},x_{3},...)$ .

Thus, using this idea, we can rewrite our gradient as $\sum _{\tau }R(\tau )p(\tau |\theta )\nabla _{\theta }\log P(\tau |\theta )$

Loss Function

The goal of REINFORCE is to optimize the expected cumulative reward. We do so using gradient descent

Humanoid Robots Wiki ^β

Allen's REINFORCE notes

Contents

Links

Motivation

Overview

Objective Function

Loss Function

Humanoid Robots Wiki β

Allen's REINFORCE notes

Contents

Links

Motivation

Overview

Objective Function

Loss Function

Humanoid Robots Wiki ^β