Learning RL (Part 2)

Policy Gradients
Let $\tau$ denote a state-action trajectory $\tau = (s_0, a_0, s_1, a_1, \ldots, s_H, a_H)$. We denote the reward of the trajectory as $R(\tau) = \sum_{t=0}^{H} R(s_t, a_t)$. The expected utitlity given a policy $\pi_\theta$ is: $$U(\theta) = E [\sum_{t=0}^{H} R(s_t, a_t) \mid \pi_\theta] = \sum_{\tau} P(\tau \mid \pi_\theta) R(\tau)$$ Our goal is to find the policy $\pi_\theta$ that maximizes the expected utility $U(\theta)$: $$\max_{\theta} U(\theta) = \max_{\theta} \sum_{\tau} P(\tau \mid \pi_\theta) R(\tau)$$ We can't directly do gradient ascent on $U(\theta)$ because the trajectory sampling process itself depends on $\theta$. However, taking the gradient of the expected utility with respect to the policy parameters $\theta$, we get: \[ \begin{align*} \nabla_{\theta} U(\theta) &= \nabla_{\theta} \sum_{\tau} P(\tau \mid \pi_\theta) R(\tau) \\ &= \sum_{\tau} \nabla_{\theta} P(\tau \mid \pi_\theta) R(\tau) \\ &= \sum_{\tau} \frac{P(\tau \mid \pi_\theta)}{P(\tau \mid \pi_\theta)} \nabla_{\theta} P(\tau \mid \pi_\theta) R(\tau) \\ &= \sum_{\tau} P(\tau \mid \pi_\theta) \frac{\nabla_{\theta} P(\tau \mid \pi_\theta)}{P(\tau \mid \pi_\theta)} R(\tau) \\ &= \sum_{\tau} P(\tau \mid \pi_\theta) \nabla_{\theta} \log P(\tau \mid \pi_\theta) R(\tau) \\ &= E_{\tau \sim P(\tau \mid \pi_\theta)}\left[ \nabla_{\theta} \log P(\tau \mid \pi_\theta) R(\tau) \right] \end{align*} \] We can estimate the gradient by sampling trajectories from the policy: $$\nabla_{\theta} U(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P(\tau_i \mid \pi_\theta) R(\tau_i)$$ We can do temporal decomposition on $\nabla_{\theta} \log P(\tau \mid \pi_\theta)$: \[ \begin{align*} \nabla_{\theta} \log P(\tau \mid \pi_\theta) &= \nabla_{\theta} \log [ \prod_{t=0}^{H} P(s_{t+1} \mid s_t, a_t) \pi_\theta(a_t \mid s_t) ] \\ &= \nabla_{\theta} [\sum_{t=0}^{H} \log P(s_{t+1} \mid s_t, a_t) + \sum_{t=0}^{H} \log \pi_\theta(a_t \mid s_t) ]\\ &= \nabla_{\theta} \sum_{t=0}^{H} \log \pi_\theta(a_t \mid s_t) \\ &= \sum_{t=0}^{H} \nabla_{\theta} \log \pi_\theta(a_t \mid s_t) \\ \end{align*} \] The key insight here is that the environment dynamics is independent of the policy, and so its gradient will be zero and can be factored out.

Next we will consider baseline subtraction to reduce the variance of the gradient estimate. Suppose we have a baseline $b$: $$\nabla U(\theta) \approx \hat{g} = \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P(\tau_i \mid \pi_\theta) (R(\tau_i) - b)$$ We will first show that this is still an unbiased estimate of the gradient: \[ \begin{align*} E[ \nabla_{\theta} \log P(\tau \mid \pi_\theta) b ] &= \sum_{\tau} P(\tau \mid \pi_\theta) \nabla_{\theta} \log P(\tau \mid \pi_\theta) b \\ &= \sum_{\tau} P(\tau \mid \pi_\theta) \frac{\nabla_{\theta} P(\tau \mid \pi_\theta)}{P(\tau \mid \pi_\theta)} b \\ &= \sum_{\tau} \nabla_{\theta} P(\tau \mid \pi_\theta) b \\ &= b \nabla_{\theta} (\sum_{\tau} P(\tau \mid \pi_\theta)) \\ &= b \nabla_{\theta} (1) \\ &= 0 \\ \end{align*} \] This holds as long as $b$ does not depend on $\tau$. Another trick is that we can do temporal decomposition on the reward too: \[ \begin{align*} \hat{g} &= \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P(\tau_i \mid \pi_\theta) (R(\tau_i) - b) \\ &= \frac{1}{N} \sum_{i=1}^{N} (\sum_{t=0}^{H-1} \nabla_{\theta} \log \pi_\theta(a_t^i \mid s_t^i)) (\sum_{t=0}^{H-1} R(s_t^i, a_t^i) - b) \\ &= \frac{1}{N} \sum_{i=1}^{N} (\sum_{t=0}^{H-1} \nabla_{\theta} \log \pi_\theta(a_t^i \mid s_t^i)) ((\sum_{k=0}^{t-1} R(s_k^i, a_k^i)) + (\sum_{k=t}^{H-1} R(s_k^i, a_k^i)) - b) \\ \end{align*} \] The past rewards don't depend on the currect action $a_t$ we can drop that part without changing the expectation, and obtain our unbiased lower-variance estimator: $$\hat{g} = \frac{1}{N} \sum_{i=1}^{N} (\sum_{t=0}^{H-1} \nabla_{\theta} \log \pi_\theta(a_t^i \mid s_t^i)) (\sum_{k=t}^{H-1} R(s_k^i, a_k^i) - b(s_t^i))$$ where the baseline $b$ can be dependent on the state $s_t$ (but doesn't have to). In fact, there are a couple common choices for $b$. The first choice to use the average reward across all trajectories: $$b = \frac{1}{N} \sum_{i=1}^{N} R(\tau_i)$$ The second choice to use the state-dependent expected return (i.e., the value function): $$b(s_t) = E [R_t + R_{t+1} + \ldots + R_{H-1} \mid s_t] = V^\pi(s_t)$$ There are two ways to estimate the value function, either through Monte Carlo or Bootstrapping.

Monte Carlo Estimation of V(s)

Initialize $V_{\phi_0}^{\pi}$
Collect trajectories $\tau_1, \ldots, \tau_N$
Regress against empirical returns:

Bootstrap Estimation of V(s)

Initialize $V_{\phi_0}^{\pi}$
Collect data $\{s, a, s', r(s, a, s')\}$
V-iteration:

Putting it all together, we get a version of the vanilla policy gradient algorithm:

Algorithm: Policy Gradient (REINFORCE with baseline)

Initialize policy parameters $\theta$ and baseline $b$
For iteration = $1, 2, \ldots, $ do:
- Collect trajectories by executing the current policy
- At each timestep in each trajectory, compute:
- Re-fit the baseline by minimizing $\sum_{i} \sum_{t} \| b(s_t) - R_t^i \|^2$ over all trajectories and timesteps in the batch (Monte Carlo Estimation)
- Update the policy using the policy gradient estimate:

Advantage Estimation
Looking back at our policy gradient estimate: $$\hat{g} = \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{H-1} \nabla_{\theta} \log \pi_\theta(a_t^i \mid s_t^i) (\sum_{k=t}^{H-1} R(s_k^i, a_k^i) - V^\pi(s_t^i))$$ The term $\sum_{k=t}^{H-1} R(s_k^i, a_k^i)$ is the single-sample estimate of the Q-value: $Q^\pi(s_t^i, a_t^i)$ (ignoring the discount factors). This is the pure Monte Carlo way of estimating the Q-value. It is unbiased but has high variance, and requires waiting until the end of the trajectory. Alternatively, we can also use Bootstrapping to estimate the Q-value, and let's write it out incorporating the discount factor $\gamma$ too: \begin{align*} Q^{\pi}(s, a) &= E[r_0 + \gamma r_1 + \gamma^2 r_2 + \ldots \mid s_0=s, a_0=a] \quad (\text{Monte Carlo}) \\ &= E[r_0 + \gamma V^\pi(s_1) \mid s_0=s, a_0=a] \\ &= E[r_0 + \gamma r_1 + \gamma^2 V^\pi(s_2) \mid s_0=s, a_0=a] \\ &= E[r_0 + \gamma r_1 + \gamma^2 r_2 + \gamma^3 V^\pi(s_3) \mid s_0=s, a_0=a] \\ &= \ldots \\ \end{align*}
If we use the 1-step bootstrap estimator: $\hat{Q}(s_t, a_t) = r_t + \gamma V_{\phi}(s_{t+1})$, then our advantage estimate becomes: $\hat{A}(s_t, a_t) = \hat{Q}(s_t, a_t) - V_{\phi}(s_t)$. Plug this into the policy gradient algorithm, we get the following algorithm known as Actor-Critic:

Algorithm: Actor-Critic

Initialize policy parameters $\theta$ (actor) and value parameters $\phi$ (critic)
For iteration = $1, 2, \ldots,$ do:
- Collect trajectories by executing the current policy $\pi_\theta$
- For each timestep in each trajectory, compute the TD target and TD error:
- Update the critic by regressing onto the TD target (in practice, take a gradient step on the loss):
- Update the actor using the policy gradient estimate with the critic's TD error:

Note that here we are no longer using the sampled return minus the baseline to estimate the advantage, but rather the TD error (TD: temporal difference). While pure Monte Carlo is unbiased but has high variance, 1-step TD is biased but has lower variance. We can often interpolate between these two. One way to do so is to take the n-step return: $\hat{Q}(s_t, a_t) = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots + \gamma^{n-1} r_{t+n-1} + \gamma^n V_{\phi}(s_{t+n})$, as used in A3C. Another way to do so is through Generalized Advantage Estimation (GAE), where instead of taking the n-step TD for a particular value of $n$, we take the weighted average of all the n-step TD. Specifically, if we write out all the n-step TD as: \begin{align*} \hat{A}_t^{(1)} \;\; &:=\;\; \delta_t^V &&= -V(s_t) + r_t + \gamma V(s_{t+1}) \\ \hat{A}_t^{(2)} \;\; &:=\;\; \delta_t^V + \gamma \delta_{t+1}^V &&= -V(s_t) + r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2}) \\ \hat{A}_t^{(3)} \;\; &:=\;\; \delta_t^V + \gamma \delta_{t+1}^V + \gamma^2 \delta_{t+2}^V &&= -V(s_t) + r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 V(s_{t+3}) \\ \vdots \\ \hat{A}_t^{(\infty)} \;\; &:=\;\; \sum_{l=0}^{\infty} \gamma^l \delta_{t+l}^V &&= -V(s_t) + \sum_{l=0}^{\infty} \gamma^l r_{t+l} \quad (\text{Monte Carlo}) \\ \end{align*} Then the GAE is defined as the exponentally-weighted average of all these k-step estimators: \begin{align*} \hat{A}_t^{\text{GAE}(\gamma, \lambda)} &= (1 - \lambda) (\hat{A}_t^{(1)} + \lambda \hat{A}_t^{(2)} + \lambda^2 \hat{A}_t^{(3)} + \ldots) \\ &= (1 - \lambda) (\delta_t^V + \lambda (\delta_t^V + \gamma \delta_{t+1}^V) + \lambda^2 (\delta_t^V + \gamma \delta_{t+1}^V + \gamma^2 \delta_{t+2}^V) + \ldots) \\ &= \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}^V \\ \end{align*} When $\lambda = 1$, the GAE is equivalent to Monte Carlo. When $\lambda = 0$, the GAE is equivalent to the 1-step TD: \begin{align*} \text{GAE}(\gamma, 0) : \hat{A}_t \;\; &:=\;\; \delta_t &&= -V(s_t) + r_t + \gamma V(s_{t+1}) \\ \text{GAE}(\gamma, 1) : \hat{A}_t \;\; &:=\;\; \sum_{l=0}^{\infty} \gamma^l \delta_{t+l}^V &&= -V(s_t) + \sum_{l=0}^{\infty} \gamma^l r_{t+l} \\ \end{align*}

TRPO & PPO
Let's first introduce importance sampling. Suppose we want to know the expected outcome of a distribution $p(x)$, but we can only sample from a different distribution $q(x)$, we can use importance sampling to estimate the expected outcome of $p(x)$ by: $$E_{x \sim p(x)}[f(x)] = E_{x \sim q(x)}[\frac{p(x)}{q(x)} f(x) ]$$ where the term $\frac{p(x)}{q(x)}$ is the importance weight. Now let's denote the policy that we use to collect data as $\pi_{\theta_{\text{old}}}$ and the policy that we want to update as $\pi_\theta$ (before any gradient updates, these two are the same). We sampled trajectories $\tau$ from the old policy and obtained the returns on them, and we want to use this data to estimate the expected return of the new policy. We can do so via importance sampling: $$U(\theta) = E_{\tau \sim \pi_{\theta_{\text{old}}}}[\frac{P(\tau \mid \pi_\theta)}{P(\tau \mid \pi_{\theta_{\text{old}}})} R(\tau)]$$ If we take its gradient: \begin{align*} \nabla_{\theta} U(\theta) &= E_{\tau \sim \pi_{\theta_{\text{old}}}}[\frac{\nabla_{\theta} P(\tau \mid \pi_\theta)}{P(\tau \mid \pi_{\theta_{\text{old}}})} R(\tau)] \\ \nabla_{\theta} U(\theta) \mid_{\theta = \theta_{\text{old}}} &= E_{\tau \sim \pi_{\theta_{\text{old}}}}[\frac{\nabla_{\theta} P(\tau \mid \pi_\theta) \mid_{\theta = \theta_{\text{old}}}}{P(\tau \mid \pi_{\theta_{\text{old}}})} R(\tau)] \\ &= E_{\tau \sim \pi_{\theta_{\text{old}}}} [\nabla_{\theta} \log P(\tau \mid \pi_\theta)\mid_{\theta = \theta_{\text{old}}} R(\tau)]\\ \end{align*} This is telling us: when evaluated at $\theta_{\text{old}}$, the gradient of our objective $U(\theta)$ is exactly the policy gradient form we've seen before. Now we introduce the surrogate objective of TRPO: $$\max_{\pi} L(\pi) = E_{s,a \sim \pi_{\text{old}}} \left[\frac{\pi(a | s)}{\pi_{\text{old}}(a | s)} A^{\pi_{\text{old}}}(s,a)\right]$$ $$\text{s.t.} \;\; E_{\pi_{\text{old}}} [KL(\pi(\cdot|s) || \pi_{\text{old}}(\cdot | s) )] \leq \delta $$ This surrogate is a local first-order approximation of the true return $U(\theta)$ around the old policy (note that here we are reweighting the action choice at a given old-policy state, rather than reweighting the whole trajectories). The full TRPO algorithm looks like:

Algorithm: Trust Region Policy Optimization

For each iteration $k = 1, 2, \ldots$ do:
- Run policy for T timesteps or N trajectories
- Estimate advantage function at all timesteps
- Compute policy gradient $g$
- Use conjugate gradient (with Hessian-vector products) to compute $F^{-1}g$
- Do line search on surrogate loss and KL constraint
End for

In fact, we don't really need to worry about all the conjugate gradient and line search stuff here, because we can massively simplify this algorithm by replacing the hard KL constraint with a soft KL penalty directly in the objective: $$\max_\theta \hat{\mathbb{E}}_t [\frac{\pi_\theta (a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t - \beta KL[\pi_{\theta}(\cdot | s_t), \pi_{\theta_{\text{old}}}(\cdot | s_t)]]$$ Below the the PPO algorithm with adaptive KL penalty.

Algorithm: PPO with adaptive KL penalty

Initialize policy parameters $\theta_0$, initial KL penalty $\beta_0$, target KL divergence $\delta$
For each iteration $k = 0, 1, 2, \ldots$ do:
- Collect a set of (partial) trajectories $\mathcal{D_k}$ using current policy $\pi_{\theta_k}$
- Estimate advantages $\hat{A}_t^{\pi_{\theta_k}}$ using any advantage estimation algorithm
- Compute policy update: $$ \theta_{k+1} \leftarrow \arg \max_{\theta} \frac{1}{|\mathcal{D}|}\sum_t \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)} \hat{A}_t - \beta_k \, \bar{D}_{\text{KL}}[\pi_{\theta}(\cdot|s_t) \,||\, \pi_{\theta_k}(\cdot|s_t)\big] \right] $$
- by taking K steps of minibatch SGD (via Adam)
- if $\bar{D}_{\text{KL}}(\theta_{k+1} || \theta_{k}) \geq 1.5 \delta$ then:
- else if $\bar{D}_{\text{KL}}(\theta_{k+1} || \theta_{k}) \leq \delta / 1.5$ then:
End for

Note that we are using reverse KL throughout the derivation to be consistent with the modern standard, despite that the original TRPO and PPO papers used forward KL. We will elaborate more on KL later. Instead of incorporating KL in the objective and directly maximizing it, another alternative approach is to use the clipped objective. Let $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_k}(a_t | s_t)}$, then: $$L_{\theta_k}^{CLIP} = \hat{\mathbb{E}}_t [\min (r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t)]$$ where $\epsilon$ is a hyper-parameter (usually $\epsilon=0.2$), and the policy update would be to do gradient ascent on this clipped objective. To reason about this clipping intuitively, suppose $A > 0$, meaning the advantage for the state-action is positive, then the optimization wants to make $\pi_\theta(a | s)$ larger, but there's no additional benefit to increase it beyond $(1+\epsilon)$ because L will be bounded by $\min (\frac{\pi_\theta(a | s)}{\pi_{\theta_k}(a | s)}, (1+\epsilon)) A$. Similarly, when $A \leq 0$, the optimization wants to make $\pi_\theta(a | s)$ smaller, but L will be bounded by $\max (\frac{\pi_\theta(a | s)}{\pi_{\theta_k}(a | s)}, (1-\epsilon)) A$. The figure below illustrates this idea (left: when $A > 0$; right: when $A < 0$).

clipping diagram

KL Estimator
Let's first discuss forward and reverse KL.

Forward KL: $D_{KL} (P || Q) = E_{x \sim P} [\log \frac{P(x)}{Q(x)}]$ where you sample from the true distribution $P$ and see how bad $Q$ (e.g., your model distribution) is at predicting those samples. It penalizes $Q$ heavily when $Q(x)$ is small in regions where $P(x)$ is large, so $Q$ learns to be mode-covering.
Reverse KL: $D_{KL} (Q || P) = E_{x \sim Q} [\log \frac{Q(x)}{P(x)}]$ where you sample from your model $Q$ and measure how much those samples disagree with the true $P$. It penalizes $Q$ when it puts probability mass where $P(x)$ is small. Intuitively, it's ok for $Q$ to ignore some parts of $P$ entirely, as long as the areas it does cover have high $P(x)$, so $Q$ learns to be mode-seeking.

Next, let's talk about the different estimators for approximating KL through Monte Carlo. Let's say we want to estimate $KL[q,p]$ and we denote $r = \frac{p(x)}{q(x)}$. There are three options (summarizing from John's blog):

$k_1$ estimator: $k_1 = \log \frac{q(x)}{p(x)} = - \log r$. It is unbiased but has high variance: it's negative for half of the samples whereas KL is always non-negative.
$k_2$ estimator: $k_2 = \frac{1}{2} (\log \frac{p(x)}{q(x)})^2 = \frac{1}{2} (\log r)^2$. Intuitively, $k_2$ seems to be better because each sample tells you how far apart $p$ and $q$ are, and it's always non-negative. Empirically, $k_2$ does gave much lower variance than $k_1$, while also comes with a low bias.
$k_3$ estimator: $k_3 = (r - 1) - \log r$. To see how we derived this: the motivation is to introduce a control variate in $k_1$ to reduce variance, and we notice $\frac{p(x)}{q(x)} - 1 = r - 1$ has zero expectation so for any $\lambda$, $- \log r + \lambda (r - 1)$ is an unbiased estimator of $KL[q, p]$. Then, notice that if we pick $\lambda = 1$, $(r - 1)$ is exactly the tangent line for $\log(r)$ at $r = 1$; we have $\log r \leq r - 1$ for all $r > 0$ and the expression $(r - 1) - \log r$ measures the vertical distance between the tangent line and $\log(r)$. Empirically, $k_3$ estimator is not only unbiased, but also has low variance.

References:

Learning RL (Part 2): From Policy Gradients to PPO