Open main menu

Humanoid Robots Wiki β

Changes

no edit summary
* Output: <math>a_t</math>: Actions at each time step
* Data: <math>(s_1, a_1, r_1, ... , s_T, a_T, r_T)</math>
* Learn <math>\pi_\theta : s_t -> a_t <\/math> to maximize <math> \sum_t r_t <\/math>
=== State vs. Observation ===
A state is a complete representation of the physical world while the observation is some subset or representation of s. They are not necessarily the same in that we can't always infer s_t from o_t, but o_t is inferable from s_t. To think of it as a bayes netnetwork of conditional probability, we have
* <math> s_1 -> o_1 - (pi_theta) -> a_1 </math> (policy)* <math> s_1, a_1 - (p(s_{t+1} | s_t, a_t) -> s_2 </math> (dynamics)
Note that theta represents the parameters of the policy (for example, the parameters of a neural network). Assumption: Markov Property - Future states are independent of past states given present states. This is the fundamental difference between states and observations.
States and actions are typically continuous - thus, we often want to model our output policy as a density function, which tells us the distribution of probabilities of actions at some given state.
The reward is a function of the state and action r(s, a) -> int, which tells us what states and actions are better. When choosing We often use and tune hyperparameters we need to be careful for reward functions to make sure that we go for completing long term goals instead of always looking for immediate reward. model training faster
=== Markov Chain & Decision Process===
Markov Chain: <math> M = {S, T} <\/math>, where S - state space, T- transition operator. The state space is the set of all states, and can be discrete or continuous. The transition probabilities is represented in a matrix, where the i,j'th entry is the probability of going into state i at state j, and we can express the next time step by multiplying the current time step with the transition operator.
Markov Decision Process: <math> M = {S, A, T, r} <\/math>, where A - action space. T is now a tensor, containing the current state, current action, and next state. We let T_{i, j, k} = p(s_t + 1 = i | s_t = j, a_t = k). r is the reward function.
=== Reinforcement Learning Algorithms - High-level ===
# Improve policy
# Repeat
 
Policy Gradients - Directly differentiate objective with respect to the optimal theta and then perform gradient descent
 
Value-based: Estimate value function or q-function of optimal policy (policy is often represented implicitly)
 
Actor-Critic: Estimate value function or q-function of current policy, and find a better policy gradient
 
Model-based: Estimate some transition model, and then use it to improve a policy
 
=== REINFORCE ===
 
-
=== Temporal Difference Learning ===
Temporal Difference (TD) is a model for estimating the utility of states given some state-action-outcome information. Suppose we have some initial value <math>V_0(s) </math>, and we get some information <math> (s, a, s', r(s, a) </math>. We can then use the update equation <math>V_{t+1}(s) = (1- \alpha)V_{t}(s)+\alpha(R(s, a, s') + \gamma V_i(s')) </math>. Here <math>\alpha </math> represents the learning rate, which is how much new information is weighted relative to old information, while <math>\gamma </math> represents the discount factor, which can be thought of how much getting a reward in the future factors into our current reward.
=== Q Learning ===
Idea 1: Policy iteration - if we have a policy <math> \pi </math> and we know <math> Q^pi (s, a) </math>, we can improve the policy, by deterministically setting the action at each state be the argmax of all possible actions at the state.
 
<math> Q_i+1(s,a)=(1−\alpha)Q_i(s,a)+\alpha(r(s, a)+\gammaV_i(s')) </math>
Idea 2: Gradient update - If <math> Q^pi(s, a) > V^pi(s) </math>, then a is better than average. We will then modify the policy to increase the probability of a.
Anonymous user