Open main menu

Humanoid Robots Wiki β

Changes

Allen's REINFORCE notes

108 bytes added, 26 May
no edit summary
Now we want to find the gradient of <math> J (\theta) </math>, namely
<math>\nabla_\theta \sum_\tau P(\tau | \theta) R(\tau) </math>. Since the reward function isn't a dependent on the parameters. We can rearrange: <math>\nabla_\theta \sum_\tau PR(\tau | ) \nabla_\theta) R P(\tau| \theta) </math>. The next step here is what's called the Log Derivative Trick.
====Log Derivative Trick====
Suppose we'd like to find <math>\nabla_{x_1}\log(f(x_1, x_2, x_3, ...))</math>. By the chain rule this is equal to <math>\frac{\nabla_{x_1}f(x_1, x_2, x_3 ...)}{f(x_1, x_2, x_3 ...)}</math>. Thus, by rearranging, we can take the gradient of any function with respect to some variable as <math>\nabla_{x_1}f(x_1, x_2, x_3, ...)= f(x_1, x_2, x_3,...)\nabla_{x_1}\log(f(x_1, x_2, x_3, ...)</math>.
Thus, using this idea, we can rewrite our gradient as <math> \sum_\tau R(\tau) p(\tau | \theta) \nabla_\theta \log P(\tau | \theta) </math>
=== Loss Function ===
The goal of REINFORCE is to optimize the expected cumulative reward. We do so using gradient descent
53
edits