Scaling Online RLVR Done Right with Decoupled Generation & Optimization

Reinforcement Learning with Verifiable Rewards (RLVR) optimizes large language models on tasks with objective correctness criteria by directly leveraging deterministic reward signals rather than learned preferences. While theoretically principled, online RLVR remains computationally prohibitive due to tight coupling of generation and optimization, which inflates memory and severely limits training throughput. We prove this gap is architectural, not fundamental. Online RLVR can be reformulated exactly as offline supervised fine-tuning with importance-weighted samples. We introduce Decoupled Generation & Optimization (DGO), a two-phase paradigm that separates generation from optimization, reducing peak memory by ~18-31% and training time by ~75-85% while enabling multi-epoch training. Our framework unifies existing offline methods, exposes systematic theory-practice mismatches, and establishes DGO as the first method where theoretical optimal weights align perfectly with implementation. We show scaling online RLVR is achievable when done right, through principled decoupling and theoretically-grounded design.

Introduction

Large language models (LLMs) are typically trained in two stages: pre-training on massive corpora to learn general language understanding, followed by fine-tuning to align the model with specific tasks or outcomes. Fine-tuning can be broadly categorized into supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) , each with distinct objectives and optimization strategies. RLVR is particularly powerful for tasks where correctness can be objectively verified, such as mathematical problem-solving, code generation, and logical reasoning, where the reward signal comes from deterministic verification rather than learned preference models.

Supervised Fine-tuning

The first approach, supervised fine-tuning (SFT), constitutes a fundamental paradigm for adapting pre-trained language models to downstream tasks through maximum likelihood estimation over curated datasets. Given a pre-collected dataset \(\mathcal{D}\) comprising prompt-response pairs \((x, y)\), the standard SFT objective seeks to minimize the negative log-likelihood of target sequences under the parameterized policy \(\pi_{\theta}\). Leveraging the autoregressive factorization inherent to transformer-based language models, this objective decomposes into a sum of token-level cross-entropy losses, where each token \(y_l\) is predicted conditioned on the preceding context \(y_{< l}\) and the input prompt \(x\). This formulation can be viewed as a special case of behavioral cloning from imitation learning, where the model learns to replicate expert demonstrations encoded in the training corpus. Formally, the objective is:

\[\begin{aligned} \min_{\theta}\mathcal{J}_{\mathrm{SFT}}(\theta) &\triangleq \mathbb{E}_{(x,y) \sim \mathcal{D}}\left[-\log\pi_{\theta}(y \mid x)\right]\\ &=\mathbb{E}_{(x,y) \sim \mathcal{D}}\left[\sum_{l=1}^L-\log\pi_{\theta}(y_l\mid y_{< l},x)\right]. \end{aligned}\]

Reinforcement Learning with Verifiable Rewards

In contrast to the imitation-based approach of SFT, RLVR shifts the objective from imitating fixed demonstrations to optimizing a verifiable reward function \(r(x, y)\) that objectively measures correctness or task success. Unlike learned reward models that approximate human preferences, verifiable rewards provide ground-truth signals—such as whether a mathematical solution is correct, code passes unit tests, or a logical proof is valid. The KL-regularized RLVR objective balances maximizing expected reward against staying close to a reference policy \(\pi_{\mathrm{ref}}\), preventing the model from deviating too far and producing degenerate outputs. Formally, the objective is:

\[\max_{\theta} \mathcal{J}_{\mathrm{RL}}(\theta) \triangleq \mathbb{E}_{x\sim \mathcal{X}, y \sim \pi_{\theta}(\cdot \mid x)} [r(x, y)] - \beta \cdot \mathrm{KL}(\pi_\theta(\cdot \mid x) \parallel \pi_{\mathrm{ref}}(\cdot \mid x))\]

where \(\beta > 0\) is a temperature parameter that controls the trade-off between reward maximization and policy regularization, and \(r(x, y)\) is a verifiable reward that can be computed deterministically (e.g., \(r(x,y) = \mathbb{1}[\mathrm{answer}(y) = \mathrm{ground\_truth}(x)]\) for mathematical reasoning).

Drawbacks of online RLVR. While the KL-regularized RLVR objective is theoretically elegant and eliminates reward hacking concerns through verifiable signals, its practical implementation faces critical challenges that hinder scalability. Specifically, online RLVR methods like PPO and GRPO tightly couple sample generation with policy optimization, requiring simultaneous loading of policy and reference models, discarding samples after minimal reuse, and performing gradient updates in lockstep with generation. This coupling not only slows training but also inflates memory consumption, severely limiting the scale of models and batch sizes that can be trained. In addition, importance sampling from an evolving policy \(\pi_{\theta}\) becomes compute-inefficient for off-policy update: as the policy shifts during training, earlier samples drift off-policy, leading to high-variance gradient estimates and poor sample efficiency. These bottlenecks motivate a reformulation that decouples generation from optimization while maintaining theoretical soundness.

Closed-form optimal policy. To address these challenges, we begin by analyzing the theoretical solution to the KL-regularized RLVR objective. It turns out that the optimal policy \(\pi^*(y \mid x)\) has a closed-form solution that reweights the reference policy by the exponentiated reward, normalized by the partition function \(Z(x)\):

\[\pi^*(y \mid x) = \frac{1}{Z(x)}\pi_{\mathrm{ref}}(y\mid x)\exp(r(x, y) / \beta)\]

where \(Z(x) = \mathbb{E}_{y \sim \pi_{\mathrm{ref}}(\cdot \mid x)}\left[\exp(r(x, y) / \beta)\right]=\sum_{y}\pi_{\mathrm{ref}}(y\mid x)\exp(r(x, y) / \beta)\) is the partition function ensuring \(\sum_y\pi^*(y \mid x)=1\).

Proof of Close-Form Optimal Policy. We want to find the policy $\pi_{\theta}$ that maximizes the KL-regularized RL objective: $$ \max_{\theta} \mathcal{J}_{\mathrm{RL}}(\theta) = \mathbb{E}_{x\sim \mathcal{X}, y \sim \pi_{\theta}(\cdot \mid x)} [r(x, y)] - \beta \cdot \mathrm{KL}(\pi_\theta(\cdot \mid x) \parallel \pi_{\mathrm{ref}}(\cdot \mid x)) $$ Step 1: Expand the KL divergence term The KL divergence can be written as: $$ \mathrm{KL}(\pi_\theta(\cdot \mid x) \parallel \pi_{\mathrm{ref}}(\cdot \mid x)) = \mathbb{E}_{y \sim \pi_{\theta}(\cdot \mid x)}\left[\log \pi_{\theta}(y \mid x) - \log \pi_{\mathrm{ref}}(y \mid x)\right] $$ Substituting this into the objective: $$ \begin{aligned} \mathcal{J}_{\mathrm{RL}}(\theta) &= \mathbb{E}_{y \sim \pi_{\theta}(\cdot \mid x)} [r(x, y)] - \beta \mathbb{E}_{y \sim \pi_{\theta}(\cdot \mid x)}\left[\log \pi_{\theta}(y \mid x) - \log \pi_{\mathrm{ref}}(y \mid x)\right] \\ &= \mathbb{E}_{y \sim \pi_{\theta}(\cdot \mid x)} \left[r(x, y) - \beta \log \pi_{\theta}(y \mid x) + \beta \log \pi_{\mathrm{ref}}(y \mid x)\right] \end{aligned} $$ Step 2: Change from expectation to summation Since $\mathbb{E}_{y \sim \pi_{\theta}(\cdot \mid x)}[f(y)] = \sum_y \pi_{\theta}(y \mid x) f(y)$, we can write: $$ \mathcal{J}_{\mathrm{RL}}(\theta) = \sum_y \pi_{\theta}(y \mid x) \left[r(x, y) + \beta \log \pi_{\mathrm{ref}}(y \mid x) - \beta \log \pi_{\theta}(y \mid x)\right] $$ Step 3: Apply the calculus of variations To find the optimal policy, we use the method of Lagrange multipliers with the constraint $\sum_y \pi_{\theta}(y \mid x) = 1$. The Lagrangian is: $$ \mathcal{L} = \sum_y \pi_{\theta}(y \mid x) \left[r(x, y) + \beta \log \pi_{\mathrm{ref}}(y \mid x) - \beta \log \pi_{\theta}(y \mid x)\right] - \lambda\left(\sum_y \pi_{\theta}(y \mid x) - 1\right) $$ Taking the derivative with respect to $\pi_{\theta}(y \mid x)$ and setting it to zero: $$ \frac{\partial \mathcal{L}}{\partial \pi_{\theta}(y \mid x)} = r(x, y) + \beta \log \pi_{\mathrm{ref}}(y \mid x) - \beta \log \pi_{\theta}(y \mid x) - \beta - \lambda = 0 $$ Solving for $\pi_{\theta}(y \mid x)$: $$ \begin{aligned} \beta \log \pi_{\theta}(y \mid x) &= r(x, y) + \beta \log \pi_{\mathrm{ref}}(y \mid x) - \beta - \lambda \\ \log \pi_{\theta}(y \mid x) &= \frac{r(x, y)}{\beta} + \log \pi_{\mathrm{ref}}(y \mid x) - 1 - \frac{\lambda}{\beta} \\ \pi_{\theta}(y \mid x) &= \pi_{\mathrm{ref}}(y \mid x) \exp\left(\frac{r(x, y)}{\beta} - 1 - \frac{\lambda}{\beta}\right) \end{aligned} $$ Step 4: Determine the normalization constant Using the constraint $\sum_y \pi_{\theta}(y \mid x) = 1$: $$ \sum_y \pi_{\mathrm{ref}}(y \mid x) \exp\left(\frac{r(x, y)}{\beta} - 1 - \frac{\lambda}{\beta}\right) = 1 $$ Let $C = \exp\left(-1 - \frac{\lambda}{\beta}\right)$. Then: $$ C \sum_y \pi_{\mathrm{ref}}(y \mid x) \exp\left(\frac{r(x, y)}{\beta}\right) = 1 $$ Therefore: $$ C = \frac{1}{\sum_y \pi_{\mathrm{ref}}(y \mid x) \exp\left(\frac{r(x, y)}{\beta}\right)} = \frac{1}{Z(x)} $$ where $Z(x) = \sum_y \pi_{\mathrm{ref}}(y \mid x) \exp\left(\frac{r(x, y)}{\beta}\right)$ is the partition function. Step 5: Final optimal policy Substituting back, we obtain the optimal policy: $$ \pi^*(y \mid x) = \frac{1}{Z(x)} \pi_{\mathrm{ref}}(y \mid x) \exp\left(\frac{r(x, y)}{\beta}\right) $$ This completes the proof. $\square$

This theoretical result suggests a natural question: Can we directly learn the optimal policy through offline supervised learning? The answer is yes, and this insight forms the foundation of our Decoupled Generation & Optimization (DGO) paradigm.


From Online RLVR to Offline SFT: A Theoretical Bridge

The key insight is that minimizing the Kullback-Leibler (KL) divergence from the optimal policy is equivalent to the original RLVR objective. We establish this through two complementary perspectives: reverse KL and forward KL. While some of the underlying theoretical connections have been explored in prior work, e.g., the forward KL formulation in RAML and the weighted policy learning framework in AWR , our contribution lies in providing a more comprehensive theoretical framework to reveal how these principles unlock scalable online RLVR for LLMs. Crucially, the verifiable nature of RLVR rewards means we can compute exact reward values without approximation error, unlike learned reward models.

Reverse KL: Equivalence to Original RLVR Objective

The reverse KL perspective $\mathrm{KL}(\pi_{\theta}(\cdot \mid x) \parallel \pi^*(\cdot \mid x))$ in fact has a deep connection to the original RLVR objective. Expanding the reverse KL:

\[\begin{aligned} \mathrm{KL}\left(\pi_{\theta}(\cdot \mid x) \parallel \pi^*(\cdot \mid x)\right) &= \mathbb{E}_{y \sim \pi_{\theta}(\cdot\mid x) }\left[\log \pi_{\theta}(y \mid x) - \log \pi^*(y \mid x)\right] \\ &= \mathbb{E}_{y \sim \pi_{\theta}(\cdot\mid x) }\left[\log \pi_{\theta}(y \mid x) - \log \pi_{\mathrm{ref}}(y \mid x) - \tfrac{1}{\beta} r(x,y) + \log Z(x)\right] \\ &= \mathbb{E}_{y \sim \pi_{\theta}(\cdot\mid x) }\left[\log \pi_{\theta}(y \mid x) - \log \pi_{\mathrm{ref}}(y \mid x)\right] - \tfrac{1}{\beta}\mathbb{E}_{y \sim \pi_{\theta}(\cdot\mid x) }[r(x,y)] + \log Z(x) \\ &= \mathrm{KL}(\pi_{\theta}(\cdot \mid x)\parallel\pi_{\mathrm{ref}}(\cdot \mid x)) - \tfrac{1}{\beta}\mathbb{E}_{y \sim \pi_{\theta}(\cdot\mid x) }[r(x,y)] + \log Z(x). \end{aligned}\]

Rearranging and multiplying by $-\beta$:

\[\begin{aligned} \mathcal{J}_{\mathrm{RL}}(\theta) &= \mathbb{E}_{x \sim \mathcal{X}}\left[\mathbb{E}_{y \sim \pi_{\theta}(\cdot \mid x)}[r(x,y)] - \beta\,\mathrm{KL}(\pi_{\theta}(\cdot \mid x)\parallel\pi_{\mathrm{ref}}(\cdot \mid x))\right] \\ &= \mathbb{E}_{x \sim \mathcal{X}}\left[\beta\log Z(x)\right] - \beta\,\mathbb{E}_{x \sim \mathcal{X}}\left[\mathrm{KL}\left(\pi_{\theta}(\cdot \mid x) \parallel \pi^*(\cdot \mid x)\right)\right]. \end{aligned}\]

Since $\log Z(x)$ is independent of $\theta$, maximizing the original RLVR objective is equivalent to minimizing the reverse KL to the optimal policy, i.e.,

\[\min_{\theta} \, \mathcal{J}_{\mathrm{RL}}(\theta) \iff \min_{\theta} \, \mathbb{E}_{x \sim \mathcal{X}}\left[\mathrm{KL}(\pi_{\theta}(\cdot \mid x) \parallel \pi^*(\cdot \mid x))\right].\]

While the reverse KL perspective provides an exact equivalence to the original RL objective, it still requires sampling from the current policy $\pi_{\theta}$ during optimization. This means we must perform online rollouts at each training step, which remains computationally expensive and memory-intensive. This limitation motivates us to consider the forward KL perspective, which enables a fully offline approach by sampling from a fixed reference policy $\pi_{\mathrm{ref}}$ instead.

Forward KL: Reformulation to Weighted Supervised Learning

Starting from the forward KL divergence \(\mathrm{KL}(\pi^*(\cdot\mid x) \parallel \pi_{\theta}(\cdot \mid x))\), we can derive an equivalent weighted SFT objective in an offline manner. The forward KL measures how well our learned policy $\pi_{\theta}$ approximates the optimal policy \(\pi^*\):

\[\min_{\theta} \mathbb{E}_{x \sim \mathcal{X}}\left[\mathrm{KL}(\pi^*(\cdot\mid x) \parallel \pi_{\theta}(\cdot \mid x))\right].\]

Expanding the KL divergence:

\[\begin{aligned} \mathrm{KL}(\pi^*(\cdot\mid x) \parallel \pi_{\theta}(\cdot\mid x)) &= \mathbb{E}_{y \sim \pi^*(\cdot \mid x)}\left[\log \pi^*(y\mid x) - \log\pi_{\theta}(y\mid x)\right] \\[8pt] &= -H(\pi^*(\cdot \mid x)) - \mathbb{E}_{y \sim \pi^*(\cdot\mid x) }\left[\log\pi_{\theta}(y\mid x)\right]. \end{aligned}\]

where $H(\pi^*(\cdot \mid x))$ is the entropy of the optimal policy. Since the entropy does not depend on $\theta$, we can drop it from the optimization objective.

Key insight: This formulation reveals a theoretical foundation behind standard SFT. Standard SFT can be viewed as a special case where we minimize $\mathrm{KL}(\pi^*(\cdot\mid x) \parallel \pi_{\theta}(\cdot \mid x))$ with the training data $\mathcal{D}$ being samples curated or selected to follow an implicit optimal policy $\pi^*$. In other words, when we perform SFT on carefully curated demonstrations, we are implicitly learning to match an optimal policy defined by those demonstrations.

Substituting the closed-form expression for \(\pi^*\):

\[\begin{aligned} \mathbb{E}_{y \sim \pi^*(\cdot \mid x)}\left[\log\pi_{\theta}(y\mid x)\right] &= \sum_{y} \pi^*(y\mid x)\log \pi_{\theta}(y\mid x) \\ &= \sum_{y} \frac{\exp(r(x, y) / \beta)}{Z(x)}\pi_{\mathrm{ref}}(y\mid x)\log \pi_{\theta}(y\mid x) \\ &= \mathbb{E}_{y \sim \pi_{\mathrm{ref}}(\cdot \mid x)}\left[\frac{\exp(r(x, y) / \beta)}{Z(x)}\log \pi_{\theta}(y\mid x)\right]. \end{aligned}\]

This leads to the weighted SFT objective:

\[\min_{\theta} \mathcal{J}_{\mathrm{W-SFT}}(\theta) \triangleq \mathbb{E}_{x \sim \mathcal{X}}\left[\mathbb{E}_{y \sim \pi_{\mathrm{ref}}(\cdot \mid x)}\left[-w(x,y)\log \pi_{\theta}(y\mid x)\right]\right].\]

where $w(x,y) = \frac{\exp(r(x, y) / \beta)}{Z(x)}$ is the sample weight.

Key insight: This shows that optimal RLVR can be solved by sampling responses from the reference policy $\pi_{\mathrm{ref}}$, computing sample weights based on verifiable rewards (which can be computed efficiently and exactly), and performing weighted supervised learning—no online policy rollouts needed! The verifiable nature of rewards means no reward model needs to be loaded during training, further reducing memory requirements.

Connecting Forward and Reverse KL

Optimal Connection

Both forward and reverse KL perspectives lead to the same optimal policy \(\pi^*\), but they connect to the original RL objective in different ways.

Reverse KL is exactly equivalent to the original RL objective. As shown in the previous section, the reverse KL derivation reveals:

\[\begin{aligned} \mathcal{J}_{\mathrm{RL}}(\theta) &= \mathbb{E}_{x \sim \mathcal{X}}\left[\mathbb{E}_{y \sim \pi_{\theta}(\cdot \mid x)}[r(x,y)] - \beta\,\mathrm{KL}(\pi_{\theta}(\cdot \mid x)\parallel\pi_{\mathrm{ref}}(\cdot \mid x))\right] \\ &= \mathbb{E}_{x \sim \mathcal{X}}\left[\beta\log Z(x)\right] - \beta\,\mathbb{E}_{x \sim \mathcal{X}}\left[\mathrm{KL}\left(\pi_{\theta}(\cdot \mid x) \parallel \pi^*(\cdot \mid x)\right)\right]. \end{aligned}\]

Since $\log Z(x)$ is independent of $\theta$, this establishes a direct equivalence: maximizing the RL objective \(\mathcal{J}_{\mathrm{RFT}}(\theta)\) is exactly equivalent to minimizing \(\mathrm{KL}(\pi_{\theta} \parallel \pi^*)\). This is not an approximation—it’s an algebraic identity.

Forward KL is closely related through the shared optimal policy. While forward KL \(\mathrm{KL}(\pi^* \parallel \pi_{\theta})\) does not directly equal the RL objective, it shares the same global optimum. Both KL divergences achieve their minimum value of zero at $\pi_{\theta} = \pi^*$:

\[\min_{\theta} \, \mathrm{KL}(\pi^* \parallel \pi_{\theta}) = 0 \iff \pi_{\theta} = \pi^* \iff \min_{\theta} \, \mathrm{KL}(\pi_{\theta} \parallel \pi^*) = 0.\]

Therefore, minimizing forward KL optimizes toward the same policy \(\pi^*\) that maximizes the RL objective. The key insight is that forward KL enables offline optimization: by sampling from a fixed reference policy \(\pi_{\mathrm{ref}}\) with importance weights $w(x,y) = \exp(r/\beta)/Z(x)$, we can approximate samples from \(\pi^*\) and perform standard supervised learning. This transforms the online RL problem into an offline weighted SFT problem.

Asymptotic Equivalence

While forward and reverse KL lead to different optimization procedures, a remarkable result shows they become nearly indistinguishable as the learned policy approaches the optimum. Specifically, the difference between the two KL directions vanishes quadratically as $\Delta_x \to 0$:

\[\big|\mathrm{KL}(\pi^{*}(\cdot\mid x)\Vert \pi_{\theta}(\cdot\mid x)) - \mathrm{KL}(\pi_{\theta}(\cdot\mid x)\Vert \pi^{*}(\cdot\mid x))\big| = \mathcal{O}(\Delta_x^{2}),\]

where \(\Delta_x := \mathrm{TV}(\pi^{*}(\cdot\mid x),\,\pi_{\theta}(\cdot\mid x))\) measures the total variation distance between $\pi_{\theta}$ and \(\pi^*\).

Proof of Quadratic Convergence. Fix a reference distribution $Q$ on a finite support with $Q(y) \ge c > 0$ for all $y$, and let $P$ be any distribution on the same support. We have the identity: $$ \mathrm{KL}(P\Vert Q) - \mathrm{KL}(Q\Vert P) = \sum_y \big(P(y)-Q(y)\big)\log\Big(\tfrac{P(y)}{Q(y)}\Big). $$ Let $\Delta := \mathrm{TV}(P,Q) = \tfrac{1}{2}\sum_y |P(y)-Q(y)|$. In the small-distance regime where $\Delta \ll 1$, write $P = Q + \delta$ with $\sum_y \delta(y)=0$ and $\tfrac{1}{2}\sum_y |\delta(y)|=\Delta$. Using a Taylor expansion of the logarithm for small $\delta(y)/Q(y)$, we have for each $y$: $$ \log\Big(\tfrac{Q(y)+\delta(y)}{Q(y)}\Big) = \tfrac{\delta(y)}{Q(y)} - \tfrac{\delta(y)^2}{2Q(y)^2} + \mathcal{O}\Big(\tfrac{|\delta(y)|^3}{Q(y)^3}\Big), $$ where the big-$\mathcal O$ constant is universal (independent of $P$) and the dependence on $Q$ is controlled by the lower bound $Q(y)\ge c$. Substituting into the KL difference: $$ \begin{aligned} \mathrm{KL}(P\Vert Q) - \mathrm{KL}(Q\Vert P) &= \sum_y \delta(y)\Big[\tfrac{\delta(y)}{Q(y)} - \tfrac{\delta(y)^2}{2Q(y)^2} + \mathcal{O}\Big(\tfrac{|\delta(y)|^3}{Q(y)^3}\Big)\Big] \\ &= \sum_y \tfrac{\delta(y)^2}{Q(y)} - \sum_y \tfrac{\delta(y)^3}{2Q(y)^2} + \mathcal{O}\Big(\sum_y \tfrac{|\delta(y)|^4}{Q(y)^3}\Big). \end{aligned} $$ Since $Q(y)\ge c>0$, all denominators are bounded by constants. Moreover, $|\delta|_1 = 2\Delta$ and $|\delta|_2 \le |\delta|_1$, so $$ \sum_y \delta(y)^2 = \mathcal{O}(\Delta^{2}),\qquad \sum_y |\delta(y)|^3 = \mathcal{O}(\Delta^{3}),\qquad \sum_y |\delta(y)|^4 = \mathcal{O}(\Delta^{4}). $$ As a result, $$ \big|\mathrm{KL}(P\Vert Q) - \mathrm{KL}(Q\Vert P)\big| = \mathcal{O}(\Delta^2)\quad\text{as }\Delta\to 0. $$ Applying this to $P=\pi_{\theta}$ and $Q=\pi^{*}$ yields the claimed result with $\Delta = \Delta_x$. $\square$

As the learned policy $\pi_{\theta}$ approaches the optimal policy \(\pi^*\) (i.e., $\Delta_x \to 0$), the difference between forward and reverse KL objectives diminishes quadratically, meaning both converge to the same optimum. Crucially, the KL-regularized RLVR objective constrains \(\pi^*\) to remain close to $\pi_{\mathrm{ref}}$ by design: the closed-form solution \(\pi^*(y\mid x) = \pi_{\mathrm{ref}}(y\mid x)\exp(r(x,y)/\beta)/Z(x)\) shows that \(\pi^*\) is merely a reweighted version of $\pi_{\mathrm{ref}}$, with the temperature $\beta$ controlling the deviation. Since $\pi_{\theta}$ starts from (or near) $\pi_{\mathrm{ref}}$ and optimizes toward \(\pi^*\), both remain in a neighborhood of $\pi_{\mathrm{ref}}$ throughout training, ensuring $\Delta_x = \mathrm{TV}(\pi^*(\cdot\mid x), \pi_{\theta}(\cdot\mid x))$ is naturally small and validating the quadratic bound.

Comprehensive Comparison

To systematically understand the relationship between forward and reverse KL, we provide a detailed comparison across four key dimensions: their connection to the optimal policy \(\pi^*\), sampling requirements, optimization behavior, and practical implementation.

Table 1. A Comprehensive Comparison between Forward and Reverse KL.
Aspect Forward KL: \(\mathrm{KL}(\pi^{*} \parallel \pi_{\theta})\) Reverse KL: \(\mathrm{KL}(\pi_{\theta} \parallel \pi^{*})\)
Connection to \(\pi^{*}\) Equivalent to maximum-likelihood estimation of \(\pi_{\theta}\) from \(\pi^{*}\). Equivalent to original RL objective:
\(\max \mathcal{J}_{\mathrm{RL}} \iff \min \mathbb{E}\big[\mathrm{KL}(\pi_{\theta} \parallel \pi^{*})\big]\).
Sampling Strategy Samples from fixed \(\pi_{\mathrm{ref}}\) (offline). Samples from current policy \(\pi_{\theta}\) (online).
Optimization Behavior Mode-covering: encourages \(\pi_{\theta}\) to place mass on all modes of \(\pi^{*}\).
More diverse, explores broadly.
Mode-seeking: encourages \(\pi_{\theta}\) to focus on dominant modes of \(\pi^{*}\).
More concentrated, focuses narrowly.
Practical Implementation ✓ Offline optimization
✓ Fixed reference policy
✓ Decoupled generation-optimization
✓ Multi-epoch training
✗ Online rollouts required
✗ Moving policy \(\pi_{\theta}\)
✗ Coupled generation-optimization
✗ Limited data reuse
Key takeaway: Forward KL $\mathrm{KL}(\pi^* \mid \pi_{\theta})$ and reverse KL $\mathrm{KL}(\pi_{\theta} \mid \pi^*)$ both optimize toward $\pi^*$ and are equivalent to the original RLVR objective in different perspectives. Forward KL enables offline optimization through importance-weighted sampling from a fixed reference, making it ideal for resource-efficient RLVR. For verifiable rewards, sample weights can be computed efficiently without a reward model, and the deterministic nature of verification eliminates reward hacking concerns. The choice determines sampling strategy (offline vs. online) and exploration behavior (broad vs. focused), but not the destination.

The Decoupled Generation & Optimization (DGO) Paradigm

Having established the theoretical equivalence between online RLVR and offline weighted SFT, we now present Decoupling Generation & Optimization (DGO), a practical paradigm that implements the forward KL objective correctly while addressing the scaling challenges of current approaches. DGO is particularly well-suited for RLVR because verifiable rewards can be computed efficiently during the generation phase without requiring a separate reward model during optimization.

From Theory to Paradigm

The theoretical results from the preceding sections, e.g., the closed-form optimal policy \(\pi^*\) and the forward KL reformulation to weighted SFT, establish that online RLVR can be solved offline. However, direct implementation requires addressing two practical challenges: (1) How to estimate the prompt-specific partition function $Z(x)$, and (2) How to compute and apply sample weights $w(x,y)$ in a scalable two-phase algorithm.

Partition function estimation. The sample weight $w(x,y) = \exp(r(x,y)/\beta)/Z(x)$ requires computing \(Z(x) = \mathbb{E}_{y \sim \pi_{\mathrm{ref}}(\cdot\mid x)}[\exp(r(x,y)/\beta)]\), which is intractable for large output spaces. DGO employs Monte Carlo estimation: for each prompt $x$, sample $N$ responses \(\{y_n\}_{n=1}^N \sim \pi_{\mathrm{ref}}(\cdot\mid x)\) and compute

\[\hat{Z}(x) = \frac{1}{N}\sum_{n=1}^N \exp(r(x,y_n)/\beta).\]

This unbiased estimator converges to $Z(x)$ as $N \to \infty$ and enables prompt-level normalization, ensuring that training is not biased toward prompts where $\pi_{\mathrm{ref}}$ already generates high-reward responses.

Key insight: $\hat{Z}(x)$ is the performance baseline, i.e., the average score across all sampled responses, that reveals what $\pi_{\mathrm{ref}}$ can truly achieve. Larger $N$ sharpens this estimate, leading to more stable, confident optimization.

Sample weight computation and application. Once $\hat{Z}(x)$ is estimated, the normalized importance weight for each sample is $w(x,y) = \exp(r(x,y)/\beta)/\hat{Z}(x)$. These weights transform the intractable forward KL objective into a tractable weighted SFT objective that can be optimized via standard mini-batch gradient descent. Crucially, because $w(x,y)$ depends only on the fixed reference policy $\pi_{\mathrm{ref}}$ and not on the evolving policy $\pi_{\theta}$, the weights remain valid throughout training, enabling multi-epoch optimization without regenerating samples.

Key insight: DGO automatically amplifies what works and suppresses what doesn't: samples beating the baseline get $w(x,y) > 1$ (teaching the model to do more of this), while under-performers get $w(x,y) < 1$ (teaching the model to avoid this). This method extracts signal from both successful and unsuccessful attempts, enabling more data-efficient learning compared to methods that rely solely on positive examples.

These implementation choices coalesce into a two-phase paradigm that fully decouples sample generation from policy optimization (see Algo.1 below):

Algorithm Overview

Algorithm 1: The Decoupled Generation & Optimization (DGO) Paradigm
Input: Prompts \(\mathcal{X}\), responses per prompt \(N\), initial policy \(\pi_{\theta_{0}}\), reward function \(r(\cdot, \cdot)\), temperature \(\beta\), learning rate \(\eta\), training iterations \(T\).
Phase 1: Generation (Scalable generation)
1.
Initialize reference model $\pi_{\mathrm{ref}} \leftarrow \pi_{\theta_{0}}$.
2.
For each prompt $x \in \mathcal{X}$:
(a)
Sample \(N\) responses: $\mathcal{Y}_{x} = \{y_n \sim \pi_{\mathrm{ref}}(\cdot \mid x)\}_{n=1}^N$.
(b)
Evaluate rewards: $\{r(x, y_n)\}_{n=1}^N$.
(c)
Estimate partition function: $\hat{Z}(x) = \frac{1}{N}\sum_{n=1}^N \exp(r(x,y_n)/\beta)$.
(d)
Compute sample weights: $w(x,y_n) = \frac{\exp(r(x,y_n)/\beta)}{\hat{Z}(x)}$.
(e)
Store dataset $\mathcal{D} = \{(x, y, w(x,y)) : x \in \mathcal{X}, y \in \mathcal{Y}_x\}$.
Phase 2: Optimization (Efficient training)
1.
Initialize policy model \(\pi_{\theta} \leftarrow \pi_{\theta_{0}}\).
2.
For iterations \(t = 1, \dots, T\):
(a)
Sample minibatch \(\mathcal{B} = \{(x_b, y_b, w_b)\}_{b=1}^B\) from \(\mathcal{D}\).
(b)
Compute weighted loss: \(\ell = -\frac{1}{B}\sum_{b=1}^B w_b \log \pi_{\theta}(y_b \mid x_b)\).
(c)
Update parameters: \(\theta_{t} \leftarrow \theta_{t-1} - \eta \nabla_{\theta} \ell\).
Output: Optimized policy \(\pi_{\theta}\).

Scaling Advantages

DGO’s two-phase decoupling delivers critical scaling advantages:

Analytical Comparison

Having established the theoretical foundation of DGO, we now reveal how it relates to and fundamentally differs from existing offline RL approaches. The key insight: all offline RL methods can be unified under a single weighted SFT framework, but they differ critically in where they sample from and how they construct weights. This comparison exposes three fundamental design choices that determine whether a method is theoretically grounded, scalable, and practically effective.

The Unified Weighted SFT Framework

Every offline RL variant (including RLVR) performs weighted supervised fine-tuning of the form:

\[\min_{\theta} \mathbb{E}_{x}\,\mathbb{E}_{y\sim q(\cdot\mid x)}\big[-w(x,y)\,\log \pi_{\theta}(y\mid x)\big],\]

where two design choices fully specify the algorithm:

While this template appears simple, the devil is in the details. As Table 2 reveals, seemingly minor differences in these choices lead to drastically different scalability, sample efficiency, and theoretical guarantees.

Table 2. Comparison of different offline RL algorithms for LLM finetuning.
Algorithm Sampling Distribution Data Source Weight Estimate
VAR $$q = \pi^*\ \mathrm{(implicit)}$$ Limited, Pre-Collected Positive Samples $$w(x,y) = \frac{\pi_{\mathrm{ref}}(y\mid x) \exp(r(x,y)/\beta)}{\sum_y \pi_{\mathrm{ref}}(y\mid x) \exp(r(x,y)/\beta)}$$
SPR $$q = \pi^*\ \mathrm{(implicit)}$$ Limited, Pre-Collected Positive Samples $$w(x,y) = \exp((Q(x,y) − W(x)) / \beta)$$
DFT $$q = \pi^* \ \mathrm{(implicit)}$$ Limited, Pre-Collected Positive Samples $$w(x,y) = \mathrm{stop\_grad}(\pi_{\theta}(y \mid x))$$
iw-SFT $$q = \pi_{\mathrm{ref}}$$ Unlimited, Generated & Curated Positive Samples $$w(x,y) = q(y|x)/\pi_{\mathrm{ref}}(y \mid x)$$
Refit $$q = \pi_{\mathrm{ref}}$$ Unlimited, Generated Samples $$w(x,y) = r(x,y)$$
DGO (Ours) $$q = \pi_{\mathrm{ref}}$$ Unlimited, Generated Samples $$w(x,y) = \frac{\exp(r(x,y)/\beta)}{\frac{1}{N} \sum_{n=1}^N \exp(r(x,y_n)/\beta)}$$

Fundamental Difference

Table 2 reveals two fundamental dimensions along which offline RL methods differ, each with profound implications.

Dimension 1: Where do samples come from? (Implicit \(\pi^*\) vs. Explicit $\pi_{\mathrm{ref}}$)

Sampling distribution is the most striking division that separates VAR, SPR, and DFT from iw-SFT, Refit, and DGO. The former group assumes an implicit optimal policy \(q = \pi^*\) and relies on pre-collected positive samples, which is their Achilles’ heel. They inherit two crippling limitations: (1) Training is fundamentally limited by the size of pre-collected datasets. You cannot scale by simply generating more samples. (2) Sampling from \(\pi^*\) risks catastrophic forgetting. Since the data-derived \(\pi^*\) (defined by curated samples) can deviate substantially from $\pi_{\mathrm{ref}},$ potentially covering regions beyond what the KL-regularized RL objective would prescribe. As a result, training exclusively on such samples may cause catastrophic forgetting of capabilities encoded in $\pi_{\mathrm{ref}}$, leading to degraded performance on the broader distribution. In stark contrast, iw-SFT, Refit, and DGO sample from an explicit, controllable reference policy $q = \pi_{\mathrm{ref}}$. This unlocks unlimited scalability: generate as many samples as you need, whenever you need them. The cost of data is now just inference time, not expensive human annotation or cherry-picking. Moreover, sampling around $\pi_{\mathrm{ref}}$ naturally preserves the reference policy’s capabilities, mitigating catastrophic forgetting while still improving task performance through importance weighting.

Key insight: Sampling from $\pi_{\mathrm{ref}}$ transforms RL from a data-limited problem into a compute-limited problem, unlocking two critical advantages: (1) Unlimited scalability: it can generate as much data as needed at inference cost; (2) Preservation of capabilities: training data stays anchored around $\pi_{\mathrm{ref}}$, mitigating catastrophic forgetting while optimizing toward $\pi^*$ through importance weighting. For RLVR, the cost is even lower because verifiable rewards require no reward model—just lightweight verification functions (e.g., test execution, answer checking).

Dimension 2: How do you weight samples? (Theoretically grounded vs. Heuristic)

The weight formulas in Table 2 tell a revealing story about the gap between theoretical and empirical weights:

Key insight: DGO closes the theory-practice gap by explicitly accounting for the sampling distribution through theoretically-grounded importance weights $w(x,y) = \exp(r/\beta)/Z(x)$, ensuring that what we optimize in practice exactly matches what theory prescribes.

Empirical Comparison

To validate our theoretical framework and demonstrate DGO’s practical advantages, we conduct comprehensive experiments on mathematical reasoning tasks using the GSM8K and MATH benchmarks. Our experimental setup uses Qwen3-8B as the base model with LoRA fine-tuning , comparing DGO against representative baselines across three dimensions that directly map to our theoretical claims: (1) Sample efficiency and performance (does theoretically-grounded weighting improve learning?), (2) Resource efficiency (does decoupling reduce computational and memory costs?), and (3) Capability preservation (does sampling from $\pi_{\mathrm{ref}}$ mitigate catastrophic forgetting?).

Experimental Setup

Datasets. We evaluate on two mathematical reasoning benchmarks: GSM8K (grade school math, 7,473 training problems, max completion length 1024) and MATH (high school competition problems, 12,000 training problems, max completion length 2048). For each prompt, we generate $N=8$ rollout responses from the reference policy and evaluate them using exact-match verification against ground-truth answers, a canonical example of verifiable rewards.

Baselines. We compare against five representative methods spanning the design space:

Implementation. All methods use LoRA (rank 8, alpha 64, dropout 0.05) with AdamW optimizer (lr=5e-6, weight decay=0.1, warmup ratio=0.1, cosine lr scheduler). SFT and VAR train on curated demonstrations; GRPO, Refit, and DGO generate $N=8$ rollouts per prompt. GRPO performs online optimization with reference model loaded; Refit and DGO perform offline optimization without reference model. We set temperature $\beta=0.1$ for DGO’s importance weights. All methods train for the same number of gradient steps to ensure fair comparison.

Main Results: Performance and Efficiency

We now present a comprehensive empirical comparison that directly tests our theoretical claims. Table 3 compares DGO against representative baselines across both performance and resource efficiency dimensions, revealing how our framework translates from theory to practice.

Table 3. Performance and resource efficiency on GSM8K and MATH benchmarks.
Method Data Source Accuracy Time Cost Peak Memory
GSM8K
Baseline 87.72%
SFT Demonstrations 88.40% (+0.68) 2.85h 57.59G
VAR Demonstrations 88.64% (+0.92) 6.46h 58.39G
GRPO Rollouts (Online) 90.01% (+2.29) 45.35h 82.95G
Refit Rollouts (Offline) 89.23% (+1.51) 6.96h 57.24G
DGO (Ours) Rollouts (Offline) 90.67% (+2.95) 6.79h 57.23G
MATH
Baseline 55.80%
SFT Demonstrations 57.26% (+1.46) 5.93h 70.68G
VAR Demonstrations 60.17% (+4.37) 12.82h 71.24G
GRPO Rollouts (Online) 61.37% (+5.57) 55.68h 83.96G
Refit Rollouts (Offline) 58.36% (+2.56) 14.31h 69.37G
DGO (Ours) Rollouts (Offline) 61.96% (+6.16) 13.79h 68.96G

The empirical results presented in Table 3 provide strong evidence for the theoretical principles underlying DGO, demonstrating that our framework achieves superior performance while maintaining computational efficiency across both benchmarks. We summarize the main observations as follows:

Figure 1. Convergence comparison on GSM8K and MATH benchmarks.

Convergence dynamics. Figure 1 tracks convergence across 1000 training steps. DGO exhibits the fastest and most stable convergence. Demonstration-based methods (SFT, VAR) show slower, erratic convergence, reflecting the mismatch between their implicit \(\pi^*\) (defined by demonstrations) and the KL-regularized objective when demonstrations deviate from \(\pi_{\text{ref}}\). GRPO converges more slowly than DGO despite identical rollout data, likely due to variance from online importance sampling. Refit shows intermediate behavior, confirming that raw reward weighting lacks DGO’s theoretically-grounded normalization. These dynamics demonstrate that theoretically-grounded importance weights enable faster, more stable convergence than either demonstration-based methods or heuristic alternatives.

Catastrophic Forgetting Analysis

To test our claim that sampling from $\pi_{\mathrm{ref}}$ preserves general capabilities, we evaluate all GSM8K-trained models on four out-of-domain benchmarks: PIQA (physical commonsense) , HellaSwag (commonsense reasoning) , Winogrande (pronoun resolution) , and RACE-high (reading comprehension) .

Table 4. Out-of-domain performance after GSM8K training (measuring catastrophic forgetting).
Method PIQA HellaSwag Winogrande RACE-high
Baseline 71.22 81.80 65.19 79.25
SFT 70.35 (−0.87) 80.56 (−1.24) 63.14 (−2.05) 84.33 (+5.08)
VAR 70.26 (−0.96) 80.73 (−1.07) 63.24 (−1.95) 82.13 (+2.88)
GRPO 70.69 (−0.53) 81.03 (−0.77) 64.26 (−0.93) 80.26 (+1.01)
Refit 70.53 (−0.69) 81.26 (−0.54) 64.01 (−1.18) 80.11 (+0.86)
DGO (Ours) 71.01 (−0.21) 81.13 (−0.67) 64.94 (−0.25) 79.93 (+0.68)

Table 4 reports out-of-domain benchmark performance after fine-tuning on GSM8K. The results validate that DGO’s explicit sampling from $\pi_{\mathrm{ref}}$ provides inherent regularization against catastrophic forgetting. We highlight the following key observations:

Summary

Our empirical results comprehensively validate the theoretical claims established in previous sections:

Together, these results establish DGO as the first practical implementation where theory, scalability, and performance converge, proving that online RLVR can indeed be scaled when done right through principled decoupling and theoretically-grounded design.


Conclusion

The computational barriers limiting online RLVR are architectural, not fundamental. By reformulating online RL as offline weighted SFT through forward KL, DGO resolves this: decoupling generation from optimization with theoretically-grounded importance weights $w(x,y) = \exp(r/\beta)/\hat{Z}(x)$ achieves 4-6× speedup, 18-31% memory reduction, and superior performance while preserving capabilities. Unlike existing methods that sample from implicit optimal policies or use heuristic weights, DGO perfectly aligns theory with implementation: what we optimize matches what theory prescribes. For RLVR, verifiable rewards eliminate reward hacking and remove reward model overhead. Scaling online RLVR is achievable when done right: through principled decoupling, theoretical grounding, and explicit reference sampling.

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Fairness Audits as Theater: When Metrics Mask Structural Harm
  • FANS - Frequency-Adaptive Noise Shaping for Diffusion Models
  • Beyond Attention as a Graph
  • Attention Sinks from the Graph Perspective
  • A Hitchhiker's Guide to Agent Evaluation