53
edits
Changes
no edit summary
=== Motivation ===
=== Learning ===
* Learn <math>\pi_\theta : s_t -> a_t </math> to maximize <math> \sum_t r_t </math>
=== State vs. Observation === A state is a complete representation of the physical world while the observation is some subset or representation of s. They are not necessarily the same in that we can't always infer s_t from o_t, but o_t is inferable from s_t. To think of it as a network of conditional probability, we have * <math> s_1 -> o_1 - (\pi_\theta) -> a_1 </math> (policy)* <math> s_1, a_1 - (p(s_{t+1} | s_t, a_t) -> s_2 </math> (dynamics) Note that theta represents the parameters of the policy (for example, the parameters of a neural network). Assumption: Markov Property - Future states are independent of past states given present states. This is the fundamental difference between states and observations. === Problem Representation === States and actions are typically continuous - thus, we often want to model our output policy as a density function, which tells us the distribution of probabilities of actions at some given state. The reward is a function of the state and action r(s, a) -> int, which tells us what states and actions are better. We often use and tune hyperparameters for reward functions to make model training faster === Markov Chain & Decision Process=== Markov Chain: <math> M = {S, T} </math>, where S - state space, T- transition operator. The state space is the set of all states, and can be discrete or continuous. The transition probabilities is represented in a matrix, where the i,j'th entry is the probability of going into state i at state j, and we can express the next time step by multiplying the current time step with the transition operator. Markov Decision Process: <math> M = {S, A, T, r} </math>, where A - action space. T is now a tensor, containing the current state, current action, and next state. We let <math> T_{i, j, k} = p(s_t + 1 = i | s_t = j, a_t = k) </math>. r is the reward function. === Reinforcement Learning Algorithms - High-level === # Generate Samples (run policy)# Fit a model/estimate something about how well policy is performing# Improve policy# Repeat Policy Gradients - Directly differentiate objective with respect to the optimal theta and then perform gradient descent Value-based: Estimate value function or q-function of optimal policy (policy is often represented implicitly) Actor-Critic: Estimate value function or q-function of current policy, and find a better policy gradient Model-based: Estimate some transition model, and then use it to improve a policy === REINFORCE === - === Temporal Difference Learning === Temporal Difference (TD) is a model for estimating the utility of states given some state-action-outcome information. Suppose we have some initial value <math>V_0(s) </math>, and we get some information <math> (s, a, s', r(s, a) </math>. We can then use the update equation <math>V_{t+1}(s) = (1- \alpha)V_{t}(s)+\alpha(R(s, a, s') + \gamma V_i(s')) </math>. Here <math>\alpha</math> represents the learning rate, which is how much new information is weighted relative to old information, while <math>\gamma</math> represents the discount factor, which can be thought of how much getting a reward in the future factors into our current reward. === Q Learning === Q Learning gives us a way to extract the optimal policy after learning. Instead of keeping track of the values of individual states, we keep track of Q values for state-action pairs, representing the utility of taking action a at state s. How do we use this Q value? Two main ideas. Idea 1: Policy iteration - if we have a policy <math> \pi </math> and we know <math> Q^pi (s, a) </math>, we can improve the policy, by deterministically setting the action at each state be the argmax of all possible actions at the state. <math> Q_{i+1} (s,a) = (1 - \alpha) Q_i (s,a) + \alpha (r(s,a) + \gamma V_i(s'))</math> Idea 2: Gradient update - If <math> Q^pi(s, a) > V^pi(s) </math>, then a is better than average. We will then modify the policy to increase the probability of a.