Changes

Allen's Reinforcement Learning Notes

11 bytes added, 18:18, 21 May 2024

→‎Q Learning

Idea 1: Policy iteration - if we have a policy <math> \pi </math> and we know <math> Q^pi (s, a) </math>, we can improve the policy, by deterministically setting the action at each state be the argmax of all possible actions at the state.

<math> ~~Q_i~~Q_{i+1} (s,a)=(1−1 - \alpha)Q_i(s,a)+\alpha(r(s, a)+\~~gammaV_i~~gamma V_i(s')) </math>

Idea 2: Gradient update - If <math> Q^pi(s, a) > V^pi(s) </math>, then a is better than average. We will then modify the policy to increase the probability of a.

Ben

blockimmune, Bureaucrats, Administrators

488

edits

Changes

Allen's Reinforcement Learning Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools