Difference between revisions of "Allen's REINFORCE notes"

Revision as of 21:46, 24 May 2024

Allen's REINFORCE notes

Recall that the objective of Reinforcement Learning is to find an optimal policy $\pi ^{*}$ which we encode in a neural network with parameters $\theta ^{*}$ . These optimal parameters are defined as $\theta ^{*}={\text{argmax}}_{\theta }E_{\tau \sim p_{\theta }(\tau )}\left[\sum _{t}r(s_{t},a_{t})\right]$ . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory determined by the policy is the highest over all policies.

Learning

Learning involves the agent taking actions and the environment returning a new state and reward.

Input: $s_{t}$ : States at each time step
Output: $a_{t}$ : Actions at each time step
Data: $(s_{1},a_{1},r_{1},...,s_{T},a_{T},r_{T})$
Learn $\pi _{\theta }:s_{t}->a_{t}$ to maximize $\sum _{t}r_{t}$

State vs. Observation

@@ Line 10: / Line 10: @@
 Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. These optimal parameters are defined as
-<math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>
+<math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>. Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory determined by the policy is the highest over all policies.
 === Learning ===

Difference between revisions of "Allen's REINFORCE notes"

Revision as of 21:46, 24 May 2024

Contents

Links

Motivation

Learning

State vs. Observation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools