Allen's REINFORCE notes - Revision history

Allen12 at 01:23, 26 May 2024

2024-05-26T01:23:18Z

Allen12 at 01:22, 26 May 2024

2024-05-26T01:22:57Z

Allen12 at 01:19, 26 May 2024

2024-05-26T01:19:32Z

Allen12 at 01:17, 26 May 2024

2024-05-26T01:17:53Z

Allen12 at 01:17, 26 May 2024

2024-05-26T01:17:05Z

Allen12 at 01:13, 26 May 2024

2024-05-26T01:13:05Z

Allen12 at 00:53, 26 May 2024

2024-05-26T00:53:11Z

Allen12 at 00:52, 26 May 2024

2024-05-26T00:52:28Z

Allen12 at 00:52, 26 May 2024

2024-05-26T00:52:02Z

Allen12 at 00:46, 26 May 2024

2024-05-26T00:46:36Z

@@ Line 43: / Line 43: @@
 === Loss Computation ===
-It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} | s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) </math>.
+It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} | s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities, we get our final gradient expression <math> \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) </math>.
 Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] </math>. This is why our loss is equal to <math> -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t </math>, since using the chain rule to take its derivative gives us the formula for the gradient for our backwards pass (see Dennis' Optimization Notes).

← Older revision		Revision as of 01:22, 26 May 2024
Line 45:		Line 45:
	It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.		It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.

−	Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right] </math>. This is why our loss is equal to <math> -\sum_{t = 0}^T \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t </math>, since using the chain rule to take its derivative gives us the formula for the gradient..	+	Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right] </math>. This is why our loss is equal to <math> -\sum_{t = 0}^T \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t </math>, since using the chain rule to take its derivative gives us the formula for the gradient for our backwards pass (see Dennis' Optimization Notes).

← Older revision		Revision as of 01:19, 26 May 2024
Line 45:		Line 45:
	It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.		It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.

−	Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right] </math>. This is why our loss is equal to <math>-\sum_{t = 0}^T \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t ~~\right~~</math>.	+	Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right] </math>. This is why our loss is equal to <math> -\sum_{t = 0}^T \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t </math>, since using the chain rule to take its derivative gives us the formula for the gradient..

← Older revision		Revision as of 01:17, 26 May 2024
Line 45:		Line 45:
	It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.		It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.

−	Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right] </math>. This is why our loss is equal to -\sum_{t = 0}^T \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right	+	Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right] </math>. This is why our loss is equal to <math>-\sum_{t = 0}^T \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right</math>.

← Older revision		Revision as of 01:17, 26 May 2024
Line 45:		Line 45:
	It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.		It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.

−	Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right] </math>.	+	Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right] </math>. This is why our loss is equal to -\sum_{t = 0}^T \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right

← Older revision		Revision as of 01:13, 26 May 2024
Line 45:		Line 45:
	It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.		It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.

−	Rewriting this into an expectation, we have <math> \~~nabla_theta~~ J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right] </math>.	+	Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right] </math>.

@@ Line 39: / Line 39: @@
 Suppose we'd like to find <math>\nabla_{x_1}\log(f(x_1, x_2, x_3, ...))</math>. By the chain rule this is equal to <math>\frac{\nabla_{x_1}f(x_1, x_2, x_3 ...)}{f(x_1, x_2, x_3 ...)}</math>. Thus, by rearranging, we can take the gradient of any function with respect to some variable as <math>\nabla_{x_1}f(x_1, x_2, x_3, ...)= f(x_1, x_2, x_3,...)\nabla_{x_1}\log(f(x_1, x_2, x_3, ...)</math>.
-Thus, using this idea, we can rewrite our gradient as <math> \sum_\tau R(\tau) p(\tau | \theta) \nabla_\theta \log P(\tau | \theta) </math>. Finally, using the definition of expectation again, we have <math> \nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \right] </math>
+Thus, using this idea, we can rewrite our gradient as <math> \sum_\tau R(\tau) p(\tau | \theta) \nabla_\theta \log P(\tau | \theta) </math>.
 === Loss Computation ===

← Older revision		Revision as of 00:52, 26 May 2024
Line 45:		Line 45:
	It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.		It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.

−	Rewriting this into an expectation, we have \nabla_theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right]. Using the formula for discounted reward, we have our final formula E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right]	+	Rewriting this into an expectation, we have <math> \nabla_theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right] </math>.

← Older revision		Revision as of 00:52, 26 May 2024
Line 43:		Line 43:
	=== Loss Computation ===		=== Loss Computation ===

−	It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory ~~occuring~~ given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.	+	It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.
		+
		+	Rewriting this into an expectation, we have \nabla_theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t)\right]. Using the formula for discounted reward, we have our final formula E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) \gamma^{T - t}r_t \right]

← Older revision		Revision as of 00:46, 26 May 2024
Line 43:		Line 43:
	=== Loss Computation ===		=== Loss Computation ===

−	It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occuring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression \sum_\tau P(\tau \| \theta) R \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t).	+	It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occuring given the policy is <math> P(\tau \| \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t \| s_t) P(s_{t + 1} \| s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} \| s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression <math> \sum_\tau P(\tau \| \theta) R \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t \| s_t) </math>.