Learning RL (Part 3)

From PPO to GRPO
We talked about PPO in the previous post, now let's first think about how PPO can be applied to finetune an LM. In particular, it optimizes the LM by maximizing the following objective: $$J_{\text{PPO}} (\theta) = E_{[q \sim P(Q), o \sim \pi_{\theta_{\text{old}}}]} \frac{1}{|o|} \sum_{t=1}^{|o|}\min[ \frac{\pi_\theta(o_t | q, o_{< t})}{\pi_{\theta_{\text{old}}} (o_t | q, o_{< t})}A_t, \text{clip}(\frac{\pi_\theta(o_t | q, o_{< t})}{\pi_{\theta_{\text{old}}}(o_t | q, o_{< t})}, 1 - \epsilon, 1 + \epsilon)A_t]$$ where $\pi_\theta$ and $\pi_{\theta_{\text{old}}}$ are the current and old policy models, and $q, o$ are questions and responses sampled from the question dataset and the old policy $\pi_{\theta_{\text{old}}}$. $A_t$ is advantage for the $t$-th token, which is computed via GAE based on the rewards $\{r_{\geq t}\}$ and a learned value function $V_\phi$ (reminder: $\hat{A}^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} \gamma^{l} \delta_{t+l}^V$, where $\delta_t^V = - V(s_t) + r_t + \gamma V(s_{t+1})$). The value function $V$ is trained alongside the policy model. Typically people add a per-token KL penalty from a reference model in the reward at each token: $$r_t = r(q, o_{\leq t}) - \beta \log \frac{\pi_\theta(o_t | q, o_{< t})}{\pi_{ref} (o_t | q, o_{< t})}$$ where $\pi_{ref}$ is the reference model, which is the SFT or base model checkpoint before RL. The raw reward of each token $r(q, o_{\leq t})$ is a design choice, sometimes people only assign an outcome reward to the final token, sometimes people train a reward model to score the intermediate steps. The main complaint that people have about the above PPO objective is that computing the advantage through GAE involves training a value function model, which takes substantial memory and compute. To avoid this, DeepSeek came up with an easier way to compute the advantage, which leads to GRPO.

As a reminder, when we introduced value function back in the previous blog post, the motivation is to use the value function $V$ as a baseline for variance reduction. GRPO achieves this without training a value function. Specially, for each question $q$, we sample a group of outputs $\{o_1, o_2, \dots, o_G\}$ from $\pi_{\theta_{\text{old}}}$, and we get their rewards $\mathbf{r} = \{r_1, r_2, ..., r_G\}$ through either a reward model or rule-based rewards. And then we compute the advantage $\hat{A}_{i,t}$ of each token in each response as: $$\hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$ which means we use the group mean as a baseline, and assign the same advantage for every token in the same response. Plug this new advantage formula back into the overall training objective: $$J_{\text{GRPO}} = E_{[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}]} \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \{ \min [\frac{\pi_\theta(o_{i,t} | q, o_{i, < t})}{\pi_{\theta_{\text{old}}}(o_{i,t}| q, o_{i,< t})} \hat{A}_{i,t}, \text{clip} (\frac{\pi_\theta(o_{i,t} | q, o_{i, < t})}{\pi_{\theta_{\text{old}}}(o_{i,t}| q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon) \hat{A}_{i,t}] - \beta \mathbb{D}_{KL} [\pi_\theta || \pi_{ref}] \}$$ Note that here we are using the reverse KL, and if we use the $k_3$ estimator, this will become: $$\mathbb{D}_{KL} [\pi_\theta || \pi_{ref}] = \frac{\pi_{ref}(o_{i,t} | q, o_{i, < t})}{\pi_{\theta}(o_{i,t}| q, o_{i,< t})} - \log \frac{\pi_{ref}(o_{i,t} | q, o_{i, < t})}{\pi_{\theta}(o_{i,t}| q, o_{i,< t})} - 1$$ Now you might notice a slight mismatch here: when computing $\mathbb{D}_{KL} [\pi_\theta || \pi_{ref}]$, we should be sampling from $\pi_\theta$ while in reality we are sampling from $\pi_{\theta_{\text{old}}}$. This is fine if we are operating in the on-policy setting. If in the off-policy setting, we should multiple the $k_3$ estimator with an importance sampling weight $w_{i,t} = \frac{\pi_\theta(o_{i,t} | q, o_{i,< t})}{\pi_{\text{old}}(o_{i,t} | q, o_{i,< t})}$ (see the RPG paper for more discussion on this).

The gradient of this GRPO objective would be similar to policy gradient, if we ignore the KL term, the gradient is: $$\nabla_{\theta} J_{\text{GRPO}} = E_{[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}]} $$

Various modifications to GRPO have been explored in follow-up works. For example, GSPO treats each response generation as one action instead of treating each token generation as an action, which resulted in the following objective (KL term omitted): $$J_{\text{GSPO}} = E_{[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}]} [\frac{1}{G} \sum_{i=1}^{G} \min (s_i(\theta) \hat{A}_i, \text{clip}(s_i(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_i)]$$ where $s_i(\theta) = (\frac{\pi_\theta (o_i | q)}{\pi_{\theta_{\text{old}}}(o_i | q)})^{\frac{1}{|o_i|}} = \exp (\frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \log \frac{\pi_\theta(o_{i,t} | q, o_{i, < t})}{\pi_{\theta_{\text{old}}}(o_{i,t}| q, o_{i,< t})})$ defines the importance sampling ratio at the sequence level, and $\hat{A}_i$ represents the reward for the whole response: $\hat{A}_i = \frac{r(x, y_i) - \text{mean}(\{ r(x,y_i) \}_{i=1}^G)}{\text{std}(\{ r(x,y_i) \}_{i=1}^G)}$.

References:

Learning RL (Part 3): RL for Modern Post-Training