Allen's Reinforcement Learning Notes - Revision history

Allen12 at 23:55, 24 May 2024

2024-05-24T23:55:35Z

Allen12 at 20:09, 24 May 2024

2024-05-24T20:09:45Z

108.211.178.220 at 20:07, 24 May 2024

2024-05-24T20:07:08Z

Ben: /* Q Learning */

2024-05-21T18:18:08Z

‎Q Learning

Ben: /* Markov Chain & Decision Process */

2024-05-21T18:15:02Z

‎Markov Chain & Decision Process

108.211.178.220 at 15:47, 21 May 2024

2024-05-21T15:47:13Z

Allen12 at 05:56, 21 May 2024

2024-05-21T05:56:23Z

Allen12: /* Markov Chain & Decision Process */

2024-05-21T05:44:08Z

‎Markov Chain & Decision Process

Allen12 at 05:43, 21 May 2024

2024-05-21T05:43:47Z

Ben at 03:11, 21 May 2024

2024-05-21T03:11:49Z

@@ Line 24: / Line 24: @@
 === State vs. Observation ===
-A state is a complete representation of the physical world while the observation is some subset or representation of s. They are not necessarily the same in that we can't always infer s_t from o_t, but o_t is inferable from s_t. To think of it as a network of conditional probability, we have
+A state is a complete representation of the physical world while the observation is some subset or representation of s. They are not necessarily the same in that we can't always know what s_t is from o_t, but o_t is inferable from s_t. To think of it as a network of conditional probability, we have
 * <math> s_1 -> o_1 - (\pi_\theta) -> a_1 </math> (policy)
 * <math> s_1, a_1 - (p(s_{t+1} | s_t, a_t) -> s_2 </math> (dynamics)
 Note that theta represents the parameters of the policy (for example, the parameters of a neural network). Assumption: Markov Property - Future states are independent of past states given present states. This is the fundamental difference between states and observations.

@@ Line 12: / Line 12: @@
 Consider a problem where we have to train a robot to pick up some object. A traditional ML algorithm might try to learn some function f(x) = y, where given some position x observed via the camera we output some behavior y. The trouble is that in the real world, the correct grab location is some function of the object and the physical environment, which is hard to intuitively ascertain by observation.
-The motivation behind reinforcement learning is to repeatedly take observations, then sample the effects of actions on those observations (reward and new observation/state). Ultimately, we hope to create a policy pi that maps states or observations to actions.
+The motivation behind reinforcement learning is to repeatedly take observations, then sample the effects of actions on those observations (reward and new observation/state). Ultimately, we hope to create a policy <math>pi</math> that maps states or observations to optimal actions.
 === Learning ===

@@ Line 26: / Line 26: @@
 A state is a complete representation of the physical world while the observation is some subset or representation of s. They are not necessarily the same in that we can't always infer s_t from o_t, but o_t is inferable from s_t. To think of it as a network of conditional probability, we have
-* <math> s_1 -> o_1 - (pi_theta) -> a_1 </math> (policy)
+* <math> s_1 -> o_1 - (\pi_\theta) -> a_1 </math> (policy)
 * <math> s_1, a_1 - (p(s_{t+1} | s_t, a_t) -> s_2 </math> (dynamics)

@@ Line 74: / Line 74: @@
 Idea 1: Policy iteration - if we have a policy <math> \pi </math> and we know <math> Q^pi (s, a) </math>, we can improve the policy, by deterministically setting the action at each state be the argmax of all possible actions at the state.
-<math> Q_i+1(s,a)=(1−\alpha)Q_i(s,a)+\alpha(r(s, a)+\gammaV_i(s')) </math>
+<math> Q_{i+1} (s,a) = (1 - \alpha) Q_i (s,a) + \alpha (r(s,a) + \gamma V_i(s'))</math>
 Idea 2: Gradient update - If <math> Q^pi(s, a) > V^pi(s) </math>, then a is better than average. We will then modify the policy to increase the probability of a.

@@ Line 41: / Line 41: @@
 Markov Chain: <math> M = {S, T} </math>, where S - state space, T- transition operator. The state space is the set of all states, and can be discrete or continuous. The transition probabilities is represented in a matrix, where the i,j'th entry is the probability of going into state i at state j, and we can express the next time step by multiplying the current time step with the transition operator.
-Markov Decision Process: <math> M = {S, A, T, r} </math>, where A - action space. T is now a tensor, containing the current state, current action, and next state. We let T_{i, j, k} = p(s_t + 1 = i | s_t = j, a_t = k). r is the reward function.
+Markov Decision Process: <math> M = {S, A, T, r} </math>, where A - action space. T is now a tensor, containing the current state, current action, and next state. We let <math> T_{i, j, k} = p(s_t + 1 = i | s_t = j, a_t = k) </math>. r is the reward function.
 === Reinforcement Learning Algorithms - High-level ===

@@ Line 20: / Line 20: @@
 * Output: <math>a_t</math>: Actions at each time step
 * Data: <math>(s_1, a_1, r_1, ... , s_T, a_T, r_T)</math>
-* Learn <math>\pi_\theta : s_t -> a_t <\math> to maximize <math> \sum_t r_t <\math>
+* Learn <math>\pi_\theta : s_t -> a_t </math> to maximize <math> \sum_t r_t </math>
 === State vs. Observation ===
-A state is a complete representation of the physical world while the observation is some subset or representation of s. They are not necessarily the same in that we can't always infer s_t from o_t, but o_t is inferable from s_t. To think of it as a bayes net, we have
+A state is a complete representation of the physical world while the observation is some subset or representation of s. They are not necessarily the same in that we can't always infer s_t from o_t, but o_t is inferable from s_t. To think of it as a network of conditional probability, we have
-* s_1 -> o_1 - (pi_theta) -> a_1 (policy)
+* <math> s_1 -> o_1 - (pi_theta) -> a_1 </math> (policy)
-* s_1, a_1 - (p(s_{t+1} | s_t, a_t) -> s_2 (dynamics)
+* <math> s_1, a_1 - (p(s_{t+1} | s_t, a_t) -> s_2 </math> (dynamics)
 Note that theta represents the parameters of the policy (for example, the parameters of a neural network). Assumption: Markov Property - Future states are independent of past states given present states. This is the fundamental difference between states and observations.
@@ Line 35: / Line 35: @@
 States and actions are typically continuous - thus, we often want to model our output policy as a density function, which tells us the distribution of probabilities of actions at some given state.
-The reward is a function of the state and action r(s, a) -> int, which tells us what states and actions are better. When choosing hyperparameters we need to be careful to make sure that we go for completing long term goals instead of always looking for immediate reward.
+The reward is a function of the state and action r(s, a) -> int, which tells us what states and actions are better. We often use and tune hyperparameters for reward functions to make model training faster
 === Markov Chain & Decision Process===
-Markov Chain: <math> M = {S, T} <\math>, where S - state space, T- transition operator. The state space is the set of all states, and can be discrete or continuous. The transition probabilities is represented in a matrix, where the i,j'th entry is the probability of going into state i at state j, and we can express the next time step by multiplying the current time step with the transition operator.
+Markov Chain: <math> M = {S, T} </math>, where S - state space, T- transition operator. The state space is the set of all states, and can be discrete or continuous. The transition probabilities is represented in a matrix, where the i,j'th entry is the probability of going into state i at state j, and we can express the next time step by multiplying the current time step with the transition operator.
-Markov Decision Process: <math> M = {S, A, T, r} <\math>, where A - action space. T is now a tensor, containing the current state, current action, and next state. We let T_{i, j, k} = p(s_t + 1 = i | s_t = j, a_t = k). r is the reward function.
+Markov Decision Process: <math> M = {S, A, T, r} </math>, where A - action space. T is now a tensor, containing the current state, current action, and next state. We let T_{i, j, k} = p(s_t + 1 = i | s_t = j, a_t = k). r is the reward function.
 === Reinforcement Learning Algorithms - High-level ===
@@ Line 49: / Line 49: @@
 # Improve policy
 # Repeat
 === Temporal Difference Learning ===
-Temporal Difference (TD) is a model for estimating the utility of states given some state-action-outcome information. Suppose we have some initial value <math>V_0(s) </math>, and we get some information <math> (s, a, s', r(s, a) </math>. We can then use the update equation <math>V_{t+1}(s) = (1- \alpha)V_{t}(s)+\alpha(R(s, a, s') + \gamma V_i(s')) </math>. Here \alpha represents the learning rate, which is how much new information is weighted relative to old information, while \gamma represents the discount factor, which can be thought of how much getting a reward in the future factors into our current reward.
+Temporal Difference (TD) is a model for estimating the utility of states given some state-action-outcome information. Suppose we have some initial value <math>V_0(s) </math>, and we get some information <math> (s, a, s', r(s, a) </math>. We can then use the update equation <math>V_{t+1}(s) = (1- \alpha)V_{t}(s)+\alpha(R(s, a, s') + \gamma V_i(s')) </math>. Here <math>\alpha</math> represents the learning rate, which is how much new information is weighted relative to old information, while <math>\gamma</math> represents the discount factor, which can be thought of how much getting a reward in the future factors into our current reward.
 === Q Learning ===
@@ Line 61: / Line 73: @@
 Idea 1: Policy iteration - if we have a policy <math> \pi </math> and we know <math> Q^pi (s, a) </math>, we can improve the policy, by deterministically setting the action at each state be the argmax of all possible actions at the state.
 Idea 2: Gradient update - If <math> Q^pi(s, a) > V^pi(s) </math>, then a is better than average. We will then modify the policy to increase the probability of a.

← Older revision		Revision as of 05:56, 21 May 2024
Line 56:		Line 56:
	=== Q Learning ===		=== Q Learning ===

−	Q Learning gives us a way to extract the optimal policy after learning. --	+	Q Learning gives us a way to extract the optimal policy after learning. Instead of keeping track of the values of individual states, we keep track of Q values for state-action pairs, representing the utility of taking action a at state s.
		+
		+	How do we use this Q value? Two main ideas.
		+
		+	Idea 1: Policy iteration - if we have a policy <math> \pi </math> and we know <math> Q^pi (s, a) </math>, we can improve the policy, by deterministically setting the action at each state be the argmax of all possible actions at the state.
		+
		+	Idea 2: Gradient update - If <math> Q^pi(s, a) > V^pi(s) </math>, then a is better than average. We will then modify the policy to increase the probability of a.

← Older revision		Revision as of 05:43, 21 May 2024
Line 7:		Line 7:

	[[Category:Reinforcement Learning]]		[[Category:Reinforcement Learning]]
		+
		+	=== Motivation ===
		+
		+	Consider a problem where we have to train a robot to pick up some object. A traditional ML algorithm might try to learn some function f(x) = y, where given some position x observed via the camera we output some behavior y. The trouble is that in the real world, the correct grab location is some function of the object and the physical environment, which is hard to intuitively ascertain by observation.
		+
		+	The motivation behind reinforcement learning is to repeatedly take observations, then sample the effects of actions on those observations (reward and new observation/state). Ultimately, we hope to create a policy pi that maps states or observations to actions.
		+
		+	=== Learning ===
		+
		+	Learning involves the agent taking actions and the environment returning a new state and reward.
		+	* Input: <math>s_t</math>: States at each time step
		+	* Output: <math>a_t</math>: Actions at each time step
		+	* Data: <math>(s_1, a_1, r_1, ... , s_T, a_T, r_T)</math>
		+	* Learn <math>\pi_\theta : s_t -> a_t <\math> to maximize <math> \sum_t r_t <\math>
		+
		+	=== State vs. Observation ===
		+
		+	A state is a complete representation of the physical world while the observation is some subset or representation of s. They are not necessarily the same in that we can't always infer s_t from o_t, but o_t is inferable from s_t. To think of it as a bayes net, we have
		+
		+	* s_1 -> o_1 - (pi_theta) -> a_1 (policy)
		+	* s_1, a_1 - (p(s_{t+1} \| s_t, a_t) -> s_2 (dynamics)
		+
		+	Note that theta represents the parameters of the policy (for example, the parameters of a neural network). Assumption: Markov Property - Future states are independent of past states given present states. This is the fundamental difference between states and observations.
		+
		+	=== Problem Representation ===
		+
		+	States and actions are typically continuous - thus, we often want to model our output policy as a density function, which tells us the distribution of probabilities of actions at some given state.
		+
		+	The reward is a function of the state and action r(s, a) -> int, which tells us what states and actions are better. When choosing hyperparameters we need to be careful to make sure that we go for completing long term goals instead of always looking for immediate reward.
		+
		+	== Markov Chain & Decision Process==
		+
		+	Markov Chain: <math> M = {S, T} <\math>, where S - state space, T- transition operator. The state space is the set of all states, and can be discrete or continuous. The transition probabilities is represented in a matrix, where the i,j'th entry is the probability of going into state i at state j, and we can express the next time step by multiplying the current time step with the transition operator.
		+
		+	Markov Decision Process: <math> M = {S, A, T, r} <\math>, where A - action space. T is now a tensor, containing the current state, current action, and next state. We let T_{i, j, k} = p(s_t + 1 = i \| s_t = j, a_t = k). r is the reward function.
		+
		+	=== Reinforcement Learning Algorithms - High-level ===
		+
		+	# Generate Samples (run policy)
		+	# Fit a model/estimate something about how well policy is performing
		+	# Improve policy
		+	# Repeat
		+
		+	=== Temporal Difference Learning ===
		+
		+	Temporal Difference (TD) is a model for estimating the utility of states given some state-action-outcome information. Suppose we have some initial value <math>V_0(s) </math>, and we get some information <math> (s, a, s', r(s, a) </math>. We can then use the update equation <math>V_{t+1}(s) = (1- \alpha)V_{t}(s)+\alpha(R(s, a, s') + \gamma V_i(s')) </math>. Here \alpha represents the learning rate, which is how much new information is weighted relative to old information, while \gamma represents the discount factor, which can be thought of how much getting a reward in the future factors into our current reward.
		+
		+	=== Q Learning ===
		+
		+	Q Learning gives us a way to extract the optimal policy after learning. --

@@ Line 3: / Line 3: @@
 === Links ===
 * [https://www.youtube.com/watch?v=SupFHGbytvA&list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps Sergey Levine RL Lecture]
 [[Category:Reinforcement Learning]]