53
edits
Changes
no edit summary
<math>\nabla_\theta \sum_\tau P(\tau | \theta) R(\tau) </math>. The important step here is called the Log Derivative Trick.
====Log Derivative Trick====Suppose we'd like to find <math>\nabla_{x_1}\log(f(x_1, x_2, x_3, ...))</math>. By the chain rule this is equal to <math>\nablafrac{\nabla_{x_1}f(x_1, x_2, x_3 ...)}{f(x_1, x_2, x_3 ...)}. Thus, by rearranging, we can take the gradient of any function with respect to some variable as <math>\nabla_{x_1}f(x_1, x_2, x_3, ...)= f(x_1, x_2, x_3,...)\nabla_{x_1}\log(f(x_1, x_2, x_3, ...)</math>.
=== Loss Function ===
The goal of REINFORCE is to optimize the expected cumulative reward. We do so using gradient descent