Difference between revisions of "Allen's REINFORCE notes"

From Humanoid Robots Wiki
Jump to: navigation, search
(Created page with "Allen's REINFORCE notes === Links === * [http://www.incompleteideas.net/book/RLbook2020.pdf] Category:Reinforcement Learning === Motivation === Consider a problem wh...")
 
 
(37 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
=== Links ===
 
=== Links ===
  
* [http://www.incompleteideas.net/book/RLbook2020.pdf]
+
* [http://www.incompleteideas.net/book/RLbook2020.pdf RLbook2020]
 +
* [https://samuelebolotta.medium.com/2-deep-reinforcement-learning-policy-gradients-5a416a99700a Deep RL: Policy Gradients]
  
 
[[Category:Reinforcement Learning]]
 
[[Category:Reinforcement Learning]]
  
=== Motivation ===  
+
=== Motivation ===
  
Consider a problem where we have to train a robot to pick up some object. A traditional ML algorithm might try to learn some function f(x) = y, where given some position x observed via the camera we output some behavior y. The trouble is that in the real world, the correct grab location is some function of the object and the physical environment, which is hard to intuitively ascertain by observation.  
+
Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. <math> \pi_\theta </math> is a mapping from observations to actions. These optimal parameters are defined as
 +
<math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>. Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory (<math> \tau </math>) determined by the policy is the highest over all policies.  
  
The motivation behind reinforcement learning is to repeatedly take observations, then sample the effects of actions on those observations (reward and new observation/state). Ultimately, we hope to create a policy <math>pi</math> that maps states or observations to optimal actions.
+
=== Overview ===
  
=== Learning ===
+
‎<syntaxhighlight lang="bash" >
 +
Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
 +
For each episode:
 +
  While not terminated:
 +
    Get observation from environment
 +
    Use policy network to map observation to action distribution
 +
    Randomly sample one action from action distribution
 +
    Compute logarithmic probability of that action occurring
 +
    Step environment using action and store reward
 +
  Calculate loss over entire trajectory as function of probabilities and rewards
 +
  Recall loss functions are differentiable with respect to each parameter - thus, calculate how changes in parameters correlate with changes in the loss
 +
  Based on the loss, use a gradient descent policy to update weights
  
Learning involves the agent taking actions and the environment returning a new state and reward.
+
</syntaxhighlight>
* Input: <math>s_t</math>: States at each time step
 
* Output: <math>a_t</math>: Actions at each time step
 
* Data: <math>(s_1, a_1, r_1, ... , s_T, a_T, r_T)</math>
 
* Learn <math>\pi_\theta : s_t -> a_t </math> to maximize <math> \sum_t r_t </math>
 
  
=== State vs. Observation ===  
+
=== Objective Function ===
  
A state is a complete representation of the physical world while the observation is some subset or representation of s. They are not necessarily the same in that we can't always infer s_t from o_t, but o_t is inferable from s_t. To think of it as a network of conditional probability, we have
+
The goal of reinforcement learning is to maximize the expected reward over the entire episode. We use <math>R(\tau)</math> to denote the total reward over some trajectory <math>\tau</math> defined by our policy. Thus we want to maximize <math>E_{\tau \sim \pi_\theta}[R(\tau)]</math>. We can use the definition of expected value to expand this as <math>\sum_\tau P(\tau | \theta) R (\tau)</math>, where the probability of a given trajectory occurring can further be expressed as <math> P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) </math>.  
  
* <math> s_1 -> o_1 - (\pi_\theta) -> a_1 </math> (policy)
+
Now we want to find the gradient of <math> J (\theta) </math>, namely
* <math> s_1, a_1 - (p(s_{t+1} | s_t, a_t) -> s_2 </math> (dynamics)
+
<math>\nabla_\theta \sum_\tau P(\tau | \theta) R(\tau) </math>. Since the reward function isn't a dependent on the parameters. We can rearrange: <math> \sum_\tau R(\tau) \nabla_\theta  P(\tau | \theta) </math>. The next step here is what's called the Log Derivative Trick.
  
Note that theta represents the parameters of the policy (for example, the parameters of a neural network). Assumption: Markov Property - Future states are independent of past states given present states. This is the fundamental difference between states and observations.  
+
Suppose we'd like to find <math>\nabla_{x_1}\log(f(x_1, x_2, x_3, ...))</math>. By the chain rule this is equal to <math>\frac{\nabla_{x_1}f(x_1, x_2, x_3 ...)}{f(x_1, x_2, x_3 ...)}</math>. Thus, by rearranging, we can take the gradient of any function with respect to some variable as <math>\nabla_{x_1}f(x_1, x_2, x_3, ...)= f(x_1, x_2, x_3,...)\nabla_{x_1}\log(f(x_1, x_2, x_3, ...)</math>.
  
=== Problem Representation ===
+
Thus, using this idea, we can rewrite our gradient as <math> \sum_\tau R(\tau) p(\tau | \theta) \nabla_\theta \log P(\tau | \theta) </math>.
  
States and actions are typically continuous - thus, we often want to model our output policy as a density function, which tells us the distribution of probabilities of actions at some given state.
+
=== Loss Computation ===
  
The reward is a function of the state and action r(s, a) -> int, which tells us what states and actions are better. We often use and tune hyperparameters for reward functions to make model training faster
+
It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} | s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities, we get our final gradient expression <math> \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) </math>.
  
=== Markov Chain & Decision Process===
+
Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] </math>. This is why our loss is equal to <math> -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t </math>, since using the chain rule to take its derivative gives us the formula for the gradient for our backwards pass (see Dennis' Optimization Notes).
 
 
Markov Chain: <math> M = {S, T} </math>, where S - state space, T- transition operator. The state space is the set of all states, and can be discrete or continuous. The transition probabilities is represented in a matrix, where the i,j'th entry is the probability of going into state i at state j, and we can express the next time step by multiplying the current time step with the transition operator.
 
 
 
Markov Decision Process: <math> M = {S, A, T, r} </math>, where A - action space. T is now a tensor, containing the current state, current action, and next state. We let <math> T_{i, j, k} = p(s_t + 1 = i | s_t = j, a_t = k) </math>. r is the reward function.
 
 
 
=== Reinforcement Learning Algorithms - High-level ===
 
 
 
# Generate Samples (run policy)
 
# Fit a model/estimate something about how well policy is performing
 
# Improve policy
 
# Repeat
 
 
 
Policy Gradients - Directly differentiate objective with respect to the optimal theta and then perform gradient descent
 
 
 
Value-based: Estimate value function or q-function of optimal policy (policy is often represented implicitly)
 
 
 
Actor-Critic: Estimate value function or q-function of current policy, and find a better policy gradient
 
 
 
Model-based: Estimate some transition model, and then use it to improve a policy
 
 
 
=== REINFORCE ===
 
 
 
-
 
 
 
=== Temporal Difference Learning ===
 
 
 
Temporal Difference (TD) is a model for estimating the utility of states given some state-action-outcome information. Suppose we have some initial value <math>V_0(s) </math>, and we get some information <math> (s, a, s', r(s, a) </math>. We can then use the update equation <math>V_{t+1}(s) = (1- \alpha)V_{t}(s)+\alpha(R(s, a, s') + \gamma V_i(s')) </math>. Here <math>\alpha</math> represents the learning rate, which is how much new information is weighted relative to old information, while <math>\gamma</math> represents the discount factor, which can be thought of how much getting a reward in the future factors into our current reward.
 
 
 
=== Q Learning ===
 
 
 
Q Learning gives us a way to extract the optimal policy after learning. Instead of keeping track of the values of individual states, we keep track of Q values for state-action pairs, representing the utility of taking action a at state s.
 
 
 
How do we use this Q value? Two main ideas.
 
 
 
Idea 1: Policy iteration - if we have a policy <math> \pi </math> and we know <math> Q^pi (s, a) </math>, we can improve the policy, by deterministically setting the action at each state be the argmax of all possible actions at the state.
 
 
 
<math> Q_{i+1} (s,a) = (1 - \alpha) Q_i (s,a) + \alpha (r(s,a) + \gamma V_i(s'))</math>
 
 
 
Idea 2: Gradient update - If <math> Q^pi(s, a) > V^pi(s) </math>, then a is better than average. We will then modify the policy to increase the probability of a.
 

Latest revision as of 01:23, 26 May 2024

Allen's REINFORCE notes

Links[edit]

Motivation[edit]

Recall that the objective of Reinforcement Learning is to find an optimal policy which we encode in a neural network with parameters . is a mapping from observations to actions. These optimal parameters are defined as . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory () determined by the policy is the highest over all policies.

Overview[edit]

Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
For each episode:
  While not terminated:
    Get observation from environment
    Use policy network to map observation to action distribution
    Randomly sample one action from action distribution
    Compute logarithmic probability of that action occurring
    Step environment using action and store reward
  Calculate loss over entire trajectory as function of probabilities and rewards
  Recall loss functions are differentiable with respect to each parameter - thus, calculate how changes in parameters correlate with changes in the loss
  Based on the loss, use a gradient descent policy to update weights

Objective Function[edit]

The goal of reinforcement learning is to maximize the expected reward over the entire episode. We use to denote the total reward over some trajectory defined by our policy. Thus we want to maximize . We can use the definition of expected value to expand this as , where the probability of a given trajectory occurring can further be expressed as .

Now we want to find the gradient of , namely . Since the reward function isn't a dependent on the parameters. We can rearrange: . The next step here is what's called the Log Derivative Trick.

Suppose we'd like to find . By the chain rule this is equal to . Thus, by rearranging, we can take the gradient of any function with respect to some variable as .

Thus, using this idea, we can rewrite our gradient as .

Loss Computation[edit]

It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that is the total reward over all timesteps. Thus, we can rewrite the component at some timestep t as , where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is . Since the probabilities of and are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities, we get our final gradient expression .

Rewriting this into an expectation, we have . Using the formula for discounted reward, we have our final formula . This is why our loss is equal to , since using the chain rule to take its derivative gives us the formula for the gradient for our backwards pass (see Dennis' Optimization Notes).