Changes

Allen's REINFORCE notes

153 bytes added, 00:30, 26 May 2024

no edit summary

Now we want to find the gradient of <math> J (\theta) </math>, namely

<math>\nabla_\theta \sum_\tau P(\tau | \theta) R(\tau) </math>. Since the reward function isn't a dependent on the parameters. We can rearrange: <math>\nabla_\theta \sum_\tau P(\tau | \theta) R(\tau) </math>. The ~~important~~ next step here is what's called the Log Derivative Trick.

====Log Derivative Trick====

Suppose we'd like to find <math>\nabla_{x_1}\log(f(x_1, x_2, x_3, ...))</math>. By the chain rule this is equal to <math>\frac{\nabla_{x_1}f(x_1, x_2, x_3 ...)}{f(x_1, x_2, x_3 ...)}</math>. Thus, by rearranging, we can take the gradient of any function with respect to some variable as <math>\nabla_{x_1}f(x_1, x_2, x_3, ...)= f(x_1, x_2, x_3,...)\nabla_{x_1}\log(f(x_1, x_2, x_3, ...)</math>.

Thus

=== Loss Function ===

The goal of REINFORCE is to optimize the expected cumulative reward. We do so using gradient descent

Allen12

53

edits

Humanoid Robots Wiki β

Changes

Allen's REINFORCE notes

Humanoid Robots Wiki ^β