Changes

Allen's REINFORCE notes

248 bytes added, 21:46, 24 May 2024

→‎Motivation

Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. These optimal parameters are defined as

<math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>. Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory determined by the policy is the highest over all policies.

=== Learning ===

Allen12

53

edits

Humanoid Robots Wiki β

Changes

Allen's REINFORCE notes

Humanoid Robots Wiki ^β