<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://54.204.126.50/index.php?action=history&amp;feed=atom&amp;title=Allen%27s_REINFORCE_notes</id>
	<title>Allen's REINFORCE notes - Revision history</title>
	<link rel="self" type="application/atom+xml" href="http://54.204.126.50/index.php?action=history&amp;feed=atom&amp;title=Allen%27s_REINFORCE_notes"/>
	<link rel="alternate" type="text/html" href="http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;action=history"/>
	<updated>2026-04-06T06:40:23Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.31.0</generator>
	<entry>
		<id>http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1285&amp;oldid=prev</id>
		<title>Allen12 at 01:23, 26 May 2024</title>
		<link rel="alternate" type="text/html" href="http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1285&amp;oldid=prev"/>
		<updated>2024-05-26T01:23:18Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 01:23, 26 May 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l43&quot; &gt;Line 43:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 43:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Loss Computation ===&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Loss Computation ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;. We &lt;/del&gt;get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;, we &lt;/ins&gt;get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &amp;lt;math&amp;gt; \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &amp;lt;/math&amp;gt;. Using the formula for discounted reward, we have our final formula &amp;lt;math&amp;gt; E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &amp;lt;/math&amp;gt;. This is why our loss is equal to &amp;lt;math&amp;gt; -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, since using the chain rule to take its derivative gives us the formula for the gradient for our backwards pass (see Dennis' Optimization Notes).&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &amp;lt;math&amp;gt; \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &amp;lt;/math&amp;gt;. Using the formula for discounted reward, we have our final formula &amp;lt;math&amp;gt; E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &amp;lt;/math&amp;gt;. This is why our loss is equal to &amp;lt;math&amp;gt; -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, since using the chain rule to take its derivative gives us the formula for the gradient for our backwards pass (see Dennis' Optimization Notes).&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key wikidb:diff::1.12:old-1284:rev-1285 --&gt;
&lt;/table&gt;</summary>
		<author><name>Allen12</name></author>
		
	</entry>
	<entry>
		<id>http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1284&amp;oldid=prev</id>
		<title>Allen12 at 01:22, 26 May 2024</title>
		<link rel="alternate" type="text/html" href="http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1284&amp;oldid=prev"/>
		<updated>2024-05-26T01:22:57Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 01:22, 26 May 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l45&quot; &gt;Line 45:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 45:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &amp;lt;math&amp;gt; \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &amp;lt;/math&amp;gt;. Using the formula for discounted reward, we have our final formula &amp;lt;math&amp;gt; E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &amp;lt;/math&amp;gt;. This is why our loss is equal to &amp;lt;math&amp;gt; -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, since using the chain rule to take its derivative gives us the formula for the gradient&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;.&lt;/del&gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &amp;lt;math&amp;gt; \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &amp;lt;/math&amp;gt;. Using the formula for discounted reward, we have our final formula &amp;lt;math&amp;gt; E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &amp;lt;/math&amp;gt;. This is why our loss is equal to &amp;lt;math&amp;gt; -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, since using the chain rule to take its derivative gives us the formula for the gradient &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;for our backwards pass (see Dennis' Optimization Notes)&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key wikidb:diff::1.12:old-1283:rev-1284 --&gt;
&lt;/table&gt;</summary>
		<author><name>Allen12</name></author>
		
	</entry>
	<entry>
		<id>http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1283&amp;oldid=prev</id>
		<title>Allen12 at 01:19, 26 May 2024</title>
		<link rel="alternate" type="text/html" href="http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1283&amp;oldid=prev"/>
		<updated>2024-05-26T01:19:32Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 01:19, 26 May 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l45&quot; &gt;Line 45:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 45:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &amp;lt;math&amp;gt; \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &amp;lt;/math&amp;gt;. Using the formula for discounted reward, we have our final formula &amp;lt;math&amp;gt; E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &amp;lt;/math&amp;gt;. This is why our loss is equal to &amp;lt;math&amp;gt;-\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;\right&lt;/del&gt;&amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &amp;lt;math&amp;gt; \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &amp;lt;/math&amp;gt;. Using the formula for discounted reward, we have our final formula &amp;lt;math&amp;gt; E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &amp;lt;/math&amp;gt;. This is why our loss is equal to &amp;lt;math&amp;gt; -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t &amp;lt;/math&amp;gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;, since using the chain rule to take its derivative gives us the formula for the gradient.&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key wikidb:diff::1.12:old-1282:rev-1283 --&gt;
&lt;/table&gt;</summary>
		<author><name>Allen12</name></author>
		
	</entry>
	<entry>
		<id>http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1282&amp;oldid=prev</id>
		<title>Allen12 at 01:17, 26 May 2024</title>
		<link rel="alternate" type="text/html" href="http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1282&amp;oldid=prev"/>
		<updated>2024-05-26T01:17:53Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 01:17, 26 May 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l45&quot; &gt;Line 45:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 45:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &amp;lt;math&amp;gt; \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &amp;lt;/math&amp;gt;. Using the formula for discounted reward, we have our final formula &amp;lt;math&amp;gt; E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &amp;lt;/math&amp;gt;. This is why our loss is equal to -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &amp;lt;math&amp;gt; \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &amp;lt;/math&amp;gt;. Using the formula for discounted reward, we have our final formula &amp;lt;math&amp;gt; E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &amp;lt;/math&amp;gt;. This is why our loss is equal to &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;-\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;.&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key wikidb:diff::1.12:old-1281:rev-1282 --&gt;
&lt;/table&gt;</summary>
		<author><name>Allen12</name></author>
		
	</entry>
	<entry>
		<id>http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1281&amp;oldid=prev</id>
		<title>Allen12 at 01:17, 26 May 2024</title>
		<link rel="alternate" type="text/html" href="http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1281&amp;oldid=prev"/>
		<updated>2024-05-26T01:17:05Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 01:17, 26 May 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l45&quot; &gt;Line 45:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 45:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &amp;lt;math&amp;gt; \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &amp;lt;/math&amp;gt;. Using the formula for discounted reward, we have our final formula &amp;lt;math&amp;gt; E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &amp;lt;math&amp;gt; \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &amp;lt;/math&amp;gt;. Using the formula for discounted reward, we have our final formula &amp;lt;math&amp;gt; E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &amp;lt;/math&amp;gt;. &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;This is why our loss is equal to -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key wikidb:diff::1.12:old-1280:rev-1281 --&gt;
&lt;/table&gt;</summary>
		<author><name>Allen12</name></author>
		
	</entry>
	<entry>
		<id>http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1280&amp;oldid=prev</id>
		<title>Allen12 at 01:13, 26 May 2024</title>
		<link rel="alternate" type="text/html" href="http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1280&amp;oldid=prev"/>
		<updated>2024-05-26T01:13:05Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 01:13, 26 May 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l45&quot; &gt;Line 45:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 45:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &amp;lt;math&amp;gt; \&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;nabla_theta &lt;/del&gt;J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &amp;lt;/math&amp;gt;. Using the formula for discounted reward, we have our final formula &amp;lt;math&amp;gt; E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &amp;lt;math&amp;gt; \&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;nabla_\theta &lt;/ins&gt;J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &amp;lt;/math&amp;gt;. Using the formula for discounted reward, we have our final formula &amp;lt;math&amp;gt; E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key wikidb:diff::1.12:old-1279:rev-1280 --&gt;
&lt;/table&gt;</summary>
		<author><name>Allen12</name></author>
		
	</entry>
	<entry>
		<id>http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1279&amp;oldid=prev</id>
		<title>Allen12 at 00:53, 26 May 2024</title>
		<link rel="alternate" type="text/html" href="http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1279&amp;oldid=prev"/>
		<updated>2024-05-26T00:53:11Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 00:53, 26 May 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l39&quot; &gt;Line 39:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 39:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Suppose we'd like to find &amp;lt;math&amp;gt;\nabla_{x_1}\log(f(x_1, x_2, x_3, ...))&amp;lt;/math&amp;gt;. By the chain rule this is equal to &amp;lt;math&amp;gt;\frac{\nabla_{x_1}f(x_1, x_2, x_3 ...)}{f(x_1, x_2, x_3 ...)}&amp;lt;/math&amp;gt;. Thus, by rearranging, we can take the gradient of any function with respect to some variable as &amp;lt;math&amp;gt;\nabla_{x_1}f(x_1, x_2, x_3, ...)= f(x_1, x_2, x_3,...)\nabla_{x_1}\log(f(x_1, x_2, x_3, ...)&amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Suppose we'd like to find &amp;lt;math&amp;gt;\nabla_{x_1}\log(f(x_1, x_2, x_3, ...))&amp;lt;/math&amp;gt;. By the chain rule this is equal to &amp;lt;math&amp;gt;\frac{\nabla_{x_1}f(x_1, x_2, x_3 ...)}{f(x_1, x_2, x_3 ...)}&amp;lt;/math&amp;gt;. Thus, by rearranging, we can take the gradient of any function with respect to some variable as &amp;lt;math&amp;gt;\nabla_{x_1}f(x_1, x_2, x_3, ...)= f(x_1, x_2, x_3,...)\nabla_{x_1}\log(f(x_1, x_2, x_3, ...)&amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Thus, using this idea, we can rewrite our gradient as &amp;lt;math&amp;gt; \sum_\tau R(\tau) p(\tau | \theta) \nabla_\theta \log P(\tau | \theta) &amp;lt;/math&amp;gt;. &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;Finally, using the definition of expectation again, we have &amp;lt;math&amp;gt; \nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \right] &amp;lt;/math&amp;gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Thus, using this idea, we can rewrite our gradient as &amp;lt;math&amp;gt; \sum_\tau R(\tau) p(\tau | \theta) \nabla_\theta \log P(\tau | \theta) &amp;lt;/math&amp;gt;. &amp;#160;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Loss Computation ===&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Loss Computation ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key wikidb:diff::1.12:old-1278:rev-1279 --&gt;
&lt;/table&gt;</summary>
		<author><name>Allen12</name></author>
		
	</entry>
	<entry>
		<id>http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1278&amp;oldid=prev</id>
		<title>Allen12 at 00:52, 26 May 2024</title>
		<link rel="alternate" type="text/html" href="http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1278&amp;oldid=prev"/>
		<updated>2024-05-26T00:52:28Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 00:52, 26 May 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l45&quot; &gt;Line 45:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 45:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have \nabla_theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right]. Using the formula for discounted reward, we have our final formula E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right]&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Rewriting this into an expectation, we have &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt; &lt;/ins&gt;\nabla_theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;. Using the formula for discounted reward, we have our final formula &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt; &lt;/ins&gt;E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;.&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key wikidb:diff::1.12:old-1277:rev-1278 --&gt;
&lt;/table&gt;</summary>
		<author><name>Allen12</name></author>
		
	</entry>
	<entry>
		<id>http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1277&amp;oldid=prev</id>
		<title>Allen12 at 00:52, 26 May 2024</title>
		<link rel="alternate" type="text/html" href="http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1277&amp;oldid=prev"/>
		<updated>2024-05-26T00:52:02Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 00:52, 26 May 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l43&quot; &gt;Line 43:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 43:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Loss Computation ===&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Loss Computation ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;occuring &lt;/del&gt;given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;occurring &lt;/ins&gt;given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &amp;lt;math&amp;gt; \sum_\tau P(\tau | \theta) R&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;( &lt;/ins&gt;\tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt;&amp;#160;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;#160;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt;&amp;#160;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;Rewriting this into an expectation, we have \nabla_theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right]. Using the formula for discounted reward, we have our final formula E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right]&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key wikidb:diff::1.12:old-1276:rev-1277 --&gt;
&lt;/table&gt;</summary>
		<author><name>Allen12</name></author>
		
	</entry>
	<entry>
		<id>http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1276&amp;oldid=prev</id>
		<title>Allen12 at 00:46, 26 May 2024</title>
		<link rel="alternate" type="text/html" href="http://54.204.126.50/index.php?title=Allen%27s_REINFORCE_notes&amp;diff=1276&amp;oldid=prev"/>
		<updated>2024-05-26T00:46:36Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 00:46, 26 May 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l43&quot; &gt;Line 43:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 43:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Loss Computation ===&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Loss Computation ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occuring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression \sum_\tau P(\tau | \theta) R \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t).&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is tricky for us to give our policy the notion of &amp;quot;total&amp;quot; reward and &amp;quot;total&amp;quot; probability. Thus, we desire to change these values parameterized by &amp;lt;math&amp;gt; \tau &amp;lt;/math&amp;gt; to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; is the total reward over all timesteps. Thus, we can rewrite the &amp;lt;math&amp;gt; R(\tau) &amp;lt;/math&amp;gt; component at some timestep t as &amp;lt;math&amp;gt; \gamma^{T - t}r_t &amp;lt;/math&amp;gt;, where gamma is our discount factor. Further, we recall that the probability of the trajectory occuring given the policy is &amp;lt;math&amp;gt; P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) &amp;lt;/math&amp;gt;. Since the probabilities of &amp;lt;math&amp;gt; P(s_0) &amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt; P(s_{t+a} | s_t, a,t) &amp;lt;/math&amp;gt; are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities. We get our final gradient expression &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt; &lt;/ins&gt;\sum_\tau P(\tau | \theta) R \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key wikidb:diff::1.12:old-1275:rev-1276 --&gt;
&lt;/table&gt;</summary>
		<author><name>Allen12</name></author>
		
	</entry>
</feed>