Changes

Allen's Reinforcement Learning Notes

11 bytes added, 18:18, 21 May 2024

→‎Q Learning

Idea 1: Policy iteration - if we have a policy <math> \pi </math> and we know <math> Q^pi (s, a) </math>, we can improve the policy, by deterministically setting the action at each state be the argmax of all possible actions at the state.

<math> ~~Q_i~~Q_{i+1} (s,a)=(1−1 - \alpha)Q_i(s,a)+\alpha(r(s, a)+\~~gammaV_i~~gamma V_i(s')) </math>

Idea 2: Gradient update - If <math> Q^pi(s, a) > V^pi(s) </math>, then a is better than average. We will then modify the policy to increase the probability of a.

Ben

blockimmune, Bureaucrats, Administrators

488

edits

Humanoid Robots Wiki β

Changes

Allen's Reinforcement Learning Notes

Humanoid Robots Wiki ^β