Idea 1: Policy iteration - if we have a policy <math> \pi </math> and we know <math> Q^pi (s, a) </math>, we can improve the policy, by deterministically setting the action at each state be the argmax of all possible actions at the state.
<math> Q_iQ_{i+1} (s,a)=(1−1 - \alpha)Q_i(s,a)+\alpha(r(s, a)+\gammaV_igamma V_i(s')) </math>
Idea 2: Gradient update - If <math> Q^pi(s, a) > V^pi(s) </math>, then a is better than average. We will then modify the policy to increase the probability of a.