Policy Gradients
Let $\tau$ denote a state-action trajectory $\tau = (s_0, a_0, s_1, a_1, \ldots, s_H, a_H)$. We denote the reward of the trajectory as $R(\tau) = \sum_{t=0}^{H} R(s_t, a_t)$. The expected utitlity given a policy $\pi_\theta$ is: $$U(\theta) = E [\sum_{t=0}^{H} R(s_t, a_t) \mid \pi_\theta] = \sum_{\tau} P(\tau \mid \pi_\theta) R(\tau)$$ Our goal is to find the policy $\pi_\theta$ that maximizes the expected utility $U(\theta)$: $$\max_{\theta} U(\theta) = \max_{\theta} \sum_{\tau} P(\tau \mid \pi_\theta) R(\tau)$$ Taking the gradient of the expected utility with respect to the policy parameters $\theta$, we get: \[ \begin{align*} \nabla_{\theta} U(\theta) &= \nabla_{\theta} \sum_{\tau} P(\tau \mid \pi_\theta) R(\tau) \\ &= \sum_{\tau} \nabla_{\theta} P(\tau \mid \pi_\theta) R(\tau) \\ &= \sum_{\tau} \frac{P(\tau \mid \pi_\theta)}{P(\tau \mid \pi_\theta)} \nabla_{\theta} P(\tau \mid \pi_\theta) R(\tau) \\ &= \sum_{\tau} P(\tau \mid \pi_\theta) \frac{\nabla_{\theta} P(\tau \mid \pi_\theta)}{P(\tau \mid \pi_\theta)} R(\tau) \\ &= \sum_{\tau} P(\tau \mid \pi_\theta) \nabla_{\theta} \log P(\tau \mid \pi_\theta) R(\tau) \\ &= E_{\tau \sim P(\tau \mid \pi_\theta)}\left[ \nabla_{\theta} \log P(\tau \mid \pi_\theta) R(\tau) \right] \end{align*} \]

References: