Difference between revisions of "Allen's REINFORCE notes"
(→Motivation) |
|||
| Line 10: | Line 10: | ||
Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. These optimal parameters are defined as | Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. These optimal parameters are defined as | ||
| − | <math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math> | + | <math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>. Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory determined by the policy is the highest over all policies. |
=== Learning === | === Learning === | ||
Revision as of 21:46, 24 May 2024
Allen's REINFORCE notes
Links
Motivation
Recall that the objective of Reinforcement Learning is to find an optimal policy Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \pi^* } which we encode in a neural network with parameters Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \theta^*} . These optimal parameters are defined as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] } . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory determined by the policy is the highest over all policies.
Learning
Learning involves the agent taking actions and the environment returning a new state and reward.
- Input: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle s_t} : States at each time step
- Output: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_t} : Actions at each time step
- Data: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (s_1, a_1, r_1, ... , s_T, a_T, r_T)}
- Learn Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \pi_\theta : s_t -> a_t } to maximize Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_t r_t }