Difference between revisions of "Allen's REINFORCE notes"

From Humanoid Robots Wiki
Jump to: navigation, search
Line 10: Line 10:
  
 
Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. These optimal parameters are defined as
 
Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. These optimal parameters are defined as
<math>\theta^* = \text<argmax>_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>
+
<math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>
  
 
=== Learning ===
 
=== Learning ===

Revision as of 21:43, 24 May 2024

Allen's REINFORCE notes

Links

Motivation

Recall that the objective of Reinforcement Learning is to find an optimal policy which we encode in a neural network with parameters . These optimal parameters are defined as

Learning

Learning involves the agent taking actions and the environment returning a new state and reward.

  • Input: : States at each time step
  • Output: : Actions at each time step
  • Data:
  • Learn to maximize

State vs. Observation