Changes

Allen's Reinforcement Learning Notes

614 bytes added, 05:56, 21 May 2024

no edit summary

=== Q Learning ===

Q Learning gives us a way to extract the optimal policy after learning. Instead of keeping track of the values of individual states, we keep track of Q values for state-action pairs, representing the utility of taking action a at state s. How do we use this Q value? Two main ideas. Idea 1: Policy iteration -if we have a policy <math> \pi </math> and we know <math> Q^pi (s, a) </math>, we can improve the policy, by deterministically setting the action at each state be the argmax of all possible actions at the state. Idea 2: Gradient update - If <math> Q^pi(s, a) > V^pi(s) </math>, then a is better than average. We will then modify the policy to increase the probability of a.

Allen12

53

edits

Humanoid Robots Wiki β

Changes

Allen's Reinforcement Learning Notes

Humanoid Robots Wiki ^β