53
edits
Changes
no edit summary
Now we want to find the gradient of <math> J (\theta) </math>, namely
<math>\nabla_\theta \sum_\tau P(\tau | \theta) R(\tau) </math>. Since the reward function isn't a dependent on the parameters. We can rearrange: <math>\nabla_\theta \sum_\tau P(\tau | \theta) R(\tau) </math>. The important next step here is what's called the Log Derivative Trick.
====Log Derivative Trick====
Suppose we'd like to find <math>\nabla_{x_1}\log(f(x_1, x_2, x_3, ...))</math>. By the chain rule this is equal to <math>\frac{\nabla_{x_1}f(x_1, x_2, x_3 ...)}{f(x_1, x_2, x_3 ...)}</math>. Thus, by rearranging, we can take the gradient of any function with respect to some variable as <math>\nabla_{x_1}f(x_1, x_2, x_3, ...)= f(x_1, x_2, x_3,...)\nabla_{x_1}\log(f(x_1, x_2, x_3, ...)</math>.
Thus
=== Loss Function ===
The goal of REINFORCE is to optimize the expected cumulative reward. We do so using gradient descent