Changes

Allen's REINFORCE notes

283 bytes added, 23:12, 25 May 2024

no edit summary

=== Objective Function ===

The goal of reinforcement learning is to maximize the expected reward over the entire episode. We use <math>R(\tau)</math> to denote the total reward over some trajectory <math>\tau</math> defined by our policy. Thus we want to maximize <math>E_{\tau ~ \sim \pi_\theta}[R(\tau)]</math>. We can use the definition of expected value to expand this as <math>\sum_\tau P(\tau | \theta) R (\tau)</math>, where the probability of a given trajectory occurring can further be expressed as P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_theta(a_t | s_t) P(s_{t + 1} | s_t, a_t)

=== Loss Function ===

The goal of REINFORCE is to optimize the expected cumulative reward. We do so using gradient descent

Allen12

53

edits

Changes

Allen's REINFORCE notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools