Reinforcement Learning with Verifiable Rewards (RLVR) optimizes large language models on tasks with objective correctness criteria by directly leveraging deterministic reward signals rather than learned preferences. While theoretically principled, online RLVR remains computationally prohibitive due to tight coupling of generation and optimization, which inflates memory and severely limits training throughput. We prove this gap is architectural, not fundamental. Online RLVR can be reformulated exactly as offline supervised fine-tuning with importance-weighted samples. We introduce Decoupled Generation & Optimization (DGO), a two-phase paradigm that separates generation from optimization, reducing peak memory by ~18-31% and training time by ~75-85% while enabling multi-epoch training. Our framework unifies existing offline methods, exposes systematic theory-practice mismatches, and establishes DGO as the first method where theoretical optimal weights align perfectly with implementation. We show scaling online RLVR is achievable when done right, through principled decoupling and theoretically-grounded design.
Large language models (LLMs) are typically trained in two stages: pre-training on massive corpora to learn general language understanding, followed by fine-tuning to align the model with specific tasks or outcomes. Fine-tuning can be broadly categorized into supervised fine-tuning (SFT)
The first approach, supervised fine-tuning (SFT), constitutes a fundamental paradigm for adapting pre-trained language models to downstream tasks through maximum likelihood estimation over curated datasets. Given a pre-collected dataset \(\mathcal{D}\) comprising prompt-response pairs \((x, y)\), the standard SFT objective seeks to minimize the negative log-likelihood of target sequences under the parameterized policy \(\pi_{\theta}\). Leveraging the autoregressive factorization inherent to transformer-based language models, this objective decomposes into a sum of token-level cross-entropy losses, where each token \(y_l\) is predicted conditioned on the preceding context \(y_{< l}\) and the input prompt \(x\). This formulation can be viewed as a special case of behavioral cloning from imitation learning, where the model learns to replicate expert demonstrations encoded in the training corpus. Formally, the objective is:
\[\begin{aligned} \min_{\theta}\mathcal{J}_{\mathrm{SFT}}(\theta) &\triangleq \mathbb{E}_{(x,y) \sim \mathcal{D}}\left[-\log\pi_{\theta}(y \mid x)\right]\\ &=\mathbb{E}_{(x,y) \sim \mathcal{D}}\left[\sum_{l=1}^L-\log\pi_{\theta}(y_l\mid y_{< l},x)\right]. \end{aligned}\]In contrast to the imitation-based approach of SFT, RLVR shifts the objective from imitating fixed demonstrations to optimizing a verifiable reward function \(r(x, y)\) that objectively measures correctness or task success. Unlike learned reward models that approximate human preferences, verifiable rewards provide ground-truth signals—such as whether a mathematical solution is correct, code passes unit tests, or a logical proof is valid. The KL-regularized RLVR objective balances maximizing expected reward against staying close to a reference policy \(\pi_{\mathrm{ref}}\), preventing the model from deviating too far and producing degenerate outputs. Formally, the objective is:
\[\max_{\theta} \mathcal{J}_{\mathrm{RL}}(\theta) \triangleq \mathbb{E}_{x\sim \mathcal{X}, y \sim \pi_{\theta}(\cdot \mid x)} [r(x, y)] - \beta \cdot \mathrm{KL}(\pi_\theta(\cdot \mid x) \parallel \pi_{\mathrm{ref}}(\cdot \mid x))\]where \(\beta > 0\) is a temperature parameter that controls the trade-off between reward maximization and policy regularization, and \(r(x, y)\) is a verifiable reward that can be computed deterministically (e.g., \(r(x,y) = \mathbb{1}[\mathrm{answer}(y) = \mathrm{ground\_truth}(x)]\) for mathematical reasoning).
Drawbacks of online RLVR. While the KL-regularized RLVR objective is theoretically elegant and eliminates reward hacking concerns through verifiable signals, its practical implementation faces critical challenges that hinder scalability. Specifically, online RLVR methods like PPO
Closed-form optimal policy. To address these challenges, we begin by analyzing the theoretical solution to the KL-regularized RLVR objective. It turns out that the optimal policy \(\pi^*(y \mid x)\) has a closed-form solution that reweights the reference policy by the exponentiated reward, normalized by the partition function \(Z(x)\):
\[\pi^*(y \mid x) = \frac{1}{Z(x)}\pi_{\mathrm{ref}}(y\mid x)\exp(r(x, y) / \beta)\]where \(Z(x) = \mathbb{E}_{y \sim \pi_{\mathrm{ref}}(\cdot \mid x)}\left[\exp(r(x, y) / \beta)\right]=\sum_{y}\pi_{\mathrm{ref}}(y\mid x)\exp(r(x, y) / \beta)\) is the partition function ensuring \(\sum_y\pi^*(y \mid x)=1\).
This theoretical result suggests a natural question: Can we directly learn the optimal policy through offline supervised learning? The answer is yes, and this insight forms the foundation of our Decoupled Generation & Optimization (DGO) paradigm.
The key insight is that minimizing the Kullback-Leibler (KL) divergence from the optimal policy is equivalent to the original RLVR objective. We establish this through two complementary perspectives: reverse KL and forward KL. While some of the underlying theoretical connections have been explored in prior work, e.g., the forward KL formulation in RAML
The reverse KL perspective $\mathrm{KL}(\pi_{\theta}(\cdot \mid x) \parallel \pi^*(\cdot \mid x))$ in fact has a deep connection to the original RLVR objective. Expanding the reverse KL:
\[\begin{aligned} \mathrm{KL}\left(\pi_{\theta}(\cdot \mid x) \parallel \pi^*(\cdot \mid x)\right) &= \mathbb{E}_{y \sim \pi_{\theta}(\cdot\mid x) }\left[\log \pi_{\theta}(y \mid x) - \log \pi^*(y \mid x)\right] \\ &= \mathbb{E}_{y \sim \pi_{\theta}(\cdot\mid x) }\left[\log \pi_{\theta}(y \mid x) - \log \pi_{\mathrm{ref}}(y \mid x) - \tfrac{1}{\beta} r(x,y) + \log Z(x)\right] \\ &= \mathbb{E}_{y \sim \pi_{\theta}(\cdot\mid x) }\left[\log \pi_{\theta}(y \mid x) - \log \pi_{\mathrm{ref}}(y \mid x)\right] - \tfrac{1}{\beta}\mathbb{E}_{y \sim \pi_{\theta}(\cdot\mid x) }[r(x,y)] + \log Z(x) \\ &= \mathrm{KL}(\pi_{\theta}(\cdot \mid x)\parallel\pi_{\mathrm{ref}}(\cdot \mid x)) - \tfrac{1}{\beta}\mathbb{E}_{y \sim \pi_{\theta}(\cdot\mid x) }[r(x,y)] + \log Z(x). \end{aligned}\]Rearranging and multiplying by $-\beta$:
\[\begin{aligned} \mathcal{J}_{\mathrm{RL}}(\theta) &= \mathbb{E}_{x \sim \mathcal{X}}\left[\mathbb{E}_{y \sim \pi_{\theta}(\cdot \mid x)}[r(x,y)] - \beta\,\mathrm{KL}(\pi_{\theta}(\cdot \mid x)\parallel\pi_{\mathrm{ref}}(\cdot \mid x))\right] \\ &= \mathbb{E}_{x \sim \mathcal{X}}\left[\beta\log Z(x)\right] - \beta\,\mathbb{E}_{x \sim \mathcal{X}}\left[\mathrm{KL}\left(\pi_{\theta}(\cdot \mid x) \parallel \pi^*(\cdot \mid x)\right)\right]. \end{aligned}\]Since $\log Z(x)$ is independent of $\theta$, maximizing the original RLVR objective is equivalent to minimizing the reverse KL to the optimal policy, i.e.,
\[\min_{\theta} \, \mathcal{J}_{\mathrm{RL}}(\theta) \iff \min_{\theta} \, \mathbb{E}_{x \sim \mathcal{X}}\left[\mathrm{KL}(\pi_{\theta}(\cdot \mid x) \parallel \pi^*(\cdot \mid x))\right].\]While the reverse KL perspective provides an exact equivalence to the original RL objective, it still requires sampling from the current policy $\pi_{\theta}$ during optimization. This means we must perform online rollouts at each training step, which remains computationally expensive and memory-intensive. This limitation motivates us to consider the forward KL perspective, which enables a fully offline approach by sampling from a fixed reference policy $\pi_{\mathrm{ref}}$ instead.
Starting from the forward KL divergence \(\mathrm{KL}(\pi^*(\cdot\mid x) \parallel \pi_{\theta}(\cdot \mid x))\), we can derive an equivalent weighted SFT objective in an offline manner. The forward KL measures how well our learned policy $\pi_{\theta}$ approximates the optimal policy \(\pi^*\):
\[\min_{\theta} \mathbb{E}_{x \sim \mathcal{X}}\left[\mathrm{KL}(\pi^*(\cdot\mid x) \parallel \pi_{\theta}(\cdot \mid x))\right].\]Expanding the KL divergence:
\[\begin{aligned} \mathrm{KL}(\pi^*(\cdot\mid x) \parallel \pi_{\theta}(\cdot\mid x)) &= \mathbb{E}_{y \sim \pi^*(\cdot \mid x)}\left[\log \pi^*(y\mid x) - \log\pi_{\theta}(y\mid x)\right] \\[8pt] &= -H(\pi^*(\cdot \mid x)) - \mathbb{E}_{y \sim \pi^*(\cdot\mid x) }\left[\log\pi_{\theta}(y\mid x)\right]. \end{aligned}\]where $H(\pi^*(\cdot \mid x))$ is the entropy of the optimal policy. Since the entropy does not depend on $\theta$, we can drop it from the optimization objective.
Substituting the closed-form expression for \(\pi^*\):
\[\begin{aligned} \mathbb{E}_{y \sim \pi^*(\cdot \mid x)}\left[\log\pi_{\theta}(y\mid x)\right] &= \sum_{y} \pi^*(y\mid x)\log \pi_{\theta}(y\mid x) \\ &= \sum_{y} \frac{\exp(r(x, y) / \beta)}{Z(x)}\pi_{\mathrm{ref}}(y\mid x)\log \pi_{\theta}(y\mid x) \\ &= \mathbb{E}_{y \sim \pi_{\mathrm{ref}}(\cdot \mid x)}\left[\frac{\exp(r(x, y) / \beta)}{Z(x)}\log \pi_{\theta}(y\mid x)\right]. \end{aligned}\]This leads to the weighted SFT objective:
\[\min_{\theta} \mathcal{J}_{\mathrm{W-SFT}}(\theta) \triangleq \mathbb{E}_{x \sim \mathcal{X}}\left[\mathbb{E}_{y \sim \pi_{\mathrm{ref}}(\cdot \mid x)}\left[-w(x,y)\log \pi_{\theta}(y\mid x)\right]\right].\]where $w(x,y) = \frac{\exp(r(x, y) / \beta)}{Z(x)}$ is the sample weight.
Both forward and reverse KL perspectives lead to the same optimal policy \(\pi^*\), but they connect to the original RL objective in different ways.
Reverse KL is exactly equivalent to the original RL objective. As shown in the previous section, the reverse KL derivation reveals:
\[\begin{aligned} \mathcal{J}_{\mathrm{RL}}(\theta) &= \mathbb{E}_{x \sim \mathcal{X}}\left[\mathbb{E}_{y \sim \pi_{\theta}(\cdot \mid x)}[r(x,y)] - \beta\,\mathrm{KL}(\pi_{\theta}(\cdot \mid x)\parallel\pi_{\mathrm{ref}}(\cdot \mid x))\right] \\ &= \mathbb{E}_{x \sim \mathcal{X}}\left[\beta\log Z(x)\right] - \beta\,\mathbb{E}_{x \sim \mathcal{X}}\left[\mathrm{KL}\left(\pi_{\theta}(\cdot \mid x) \parallel \pi^*(\cdot \mid x)\right)\right]. \end{aligned}\]Since $\log Z(x)$ is independent of $\theta$, this establishes a direct equivalence: maximizing the RL objective \(\mathcal{J}_{\mathrm{RFT}}(\theta)\) is exactly equivalent to minimizing \(\mathrm{KL}(\pi_{\theta} \parallel \pi^*)\). This is not an approximation—it’s an algebraic identity.
Forward KL is closely related through the shared optimal policy. While forward KL \(\mathrm{KL}(\pi^* \parallel \pi_{\theta})\) does not directly equal the RL objective, it shares the same global optimum. Both KL divergences achieve their minimum value of zero at $\pi_{\theta} = \pi^*$:
\[\min_{\theta} \, \mathrm{KL}(\pi^* \parallel \pi_{\theta}) = 0 \iff \pi_{\theta} = \pi^* \iff \min_{\theta} \, \mathrm{KL}(\pi_{\theta} \parallel \pi^*) = 0.\]Therefore, minimizing forward KL optimizes toward the same policy \(\pi^*\) that maximizes the RL objective. The key insight is that forward KL enables offline optimization: by sampling from a fixed reference policy \(\pi_{\mathrm{ref}}\) with importance weights $w(x,y) = \exp(r/\beta)/Z(x)$, we can approximate samples from \(\pi^*\) and perform standard supervised learning. This transforms the online RL problem into an offline weighted SFT problem.
While forward and reverse KL lead to different optimization procedures, a remarkable result shows they become nearly indistinguishable as the learned policy approaches the optimum. Specifically, the difference between the two KL directions vanishes quadratically as $\Delta_x \to 0$:
\[\big|\mathrm{KL}(\pi^{*}(\cdot\mid x)\Vert \pi_{\theta}(\cdot\mid x)) - \mathrm{KL}(\pi_{\theta}(\cdot\mid x)\Vert \pi^{*}(\cdot\mid x))\big| = \mathcal{O}(\Delta_x^{2}),\]where \(\Delta_x := \mathrm{TV}(\pi^{*}(\cdot\mid x),\,\pi_{\theta}(\cdot\mid x))\) measures the total variation distance between $\pi_{\theta}$ and \(\pi^*\).
As the learned policy $\pi_{\theta}$ approaches the optimal policy \(\pi^*\) (i.e., $\Delta_x \to 0$), the difference between forward and reverse KL objectives diminishes quadratically, meaning both converge to the same optimum. Crucially, the KL-regularized RLVR objective constrains \(\pi^*\) to remain close to $\pi_{\mathrm{ref}}$ by design: the closed-form solution \(\pi^*(y\mid x) = \pi_{\mathrm{ref}}(y\mid x)\exp(r(x,y)/\beta)/Z(x)\) shows that \(\pi^*\) is merely a reweighted version of $\pi_{\mathrm{ref}}$, with the temperature $\beta$ controlling the deviation. Since $\pi_{\theta}$ starts from (or near) $\pi_{\mathrm{ref}}$ and optimizes toward \(\pi^*\), both remain in a neighborhood of $\pi_{\mathrm{ref}}$ throughout training, ensuring $\Delta_x = \mathrm{TV}(\pi^*(\cdot\mid x), \pi_{\theta}(\cdot\mid x))$ is naturally small and validating the quadratic bound.
To systematically understand the relationship between forward and reverse KL, we provide a detailed comparison across four key dimensions: their connection to the optimal policy \(\pi^*\), sampling requirements, optimization behavior, and practical implementation.
| Aspect | Forward KL: \(\mathrm{KL}(\pi^{*} \parallel \pi_{\theta})\) | Reverse KL: \(\mathrm{KL}(\pi_{\theta} \parallel \pi^{*})\) |
|---|---|---|
| Connection to \(\pi^{*}\) | Equivalent to maximum-likelihood estimation of \(\pi_{\theta}\) from \(\pi^{*}\). | Equivalent to original RL objective: \(\max \mathcal{J}_{\mathrm{RL}} \iff \min \mathbb{E}\big[\mathrm{KL}(\pi_{\theta} \parallel \pi^{*})\big]\). |
| Sampling Strategy | Samples from fixed \(\pi_{\mathrm{ref}}\) (offline). | Samples from current policy \(\pi_{\theta}\) (online). |
| Optimization Behavior | Mode-covering: encourages \(\pi_{\theta}\) to place mass on all modes of \(\pi^{*}\). More diverse, explores broadly. | Mode-seeking: encourages \(\pi_{\theta}\) to focus on dominant modes of \(\pi^{*}\). More concentrated, focuses narrowly. |
| Practical Implementation | ✓ Offline optimization ✓ Fixed reference policy ✓ Decoupled generation-optimization ✓ Multi-epoch training | ✗ Online rollouts required ✗ Moving policy \(\pi_{\theta}\) ✗ Coupled generation-optimization ✗ Limited data reuse |
Having established the theoretical equivalence between online RLVR and offline weighted SFT, we now present Decoupling Generation & Optimization (DGO), a practical paradigm that implements the forward KL objective correctly while addressing the scaling challenges of current approaches. DGO is particularly well-suited for RLVR because verifiable rewards can be computed efficiently during the generation phase without requiring a separate reward model during optimization.
The theoretical results from the preceding sections, e.g., the closed-form optimal policy \(\pi^*\) and the forward KL reformulation to weighted SFT, establish that online RLVR can be solved offline. However, direct implementation requires addressing two practical challenges: (1) How to estimate the prompt-specific partition function $Z(x)$, and (2) How to compute and apply sample weights $w(x,y)$ in a scalable two-phase algorithm.
Partition function estimation. The sample weight $w(x,y) = \exp(r(x,y)/\beta)/Z(x)$ requires computing \(Z(x) = \mathbb{E}_{y \sim \pi_{\mathrm{ref}}(\cdot\mid x)}[\exp(r(x,y)/\beta)]\), which is intractable for large output spaces. DGO employs Monte Carlo estimation: for each prompt $x$, sample $N$ responses \(\{y_n\}_{n=1}^N \sim \pi_{\mathrm{ref}}(\cdot\mid x)\) and compute
\[\hat{Z}(x) = \frac{1}{N}\sum_{n=1}^N \exp(r(x,y_n)/\beta).\]This unbiased estimator converges to $Z(x)$ as $N \to \infty$ and enables prompt-level normalization, ensuring that training is not biased toward prompts where $\pi_{\mathrm{ref}}$ already generates high-reward responses.
Sample weight computation and application. Once $\hat{Z}(x)$ is estimated, the normalized importance weight for each sample is $w(x,y) = \exp(r(x,y)/\beta)/\hat{Z}(x)$. These weights transform the intractable forward KL objective into a tractable weighted SFT objective that can be optimized via standard mini-batch gradient descent. Crucially, because $w(x,y)$ depends only on the fixed reference policy $\pi_{\mathrm{ref}}$ and not on the evolving policy $\pi_{\theta}$, the weights remain valid throughout training, enabling multi-epoch optimization without regenerating samples.
These implementation choices coalesce into a two-phase paradigm that fully decouples sample generation from policy optimization (see Algo.1 below):
DGO’s two-phase decoupling delivers critical scaling advantages:
Having established the theoretical foundation of DGO, we now reveal how it relates to and fundamentally differs from existing offline RL approaches. The key insight: all offline RL methods can be unified under a single weighted SFT framework, but they differ critically in where they sample from and how they construct weights. This comparison exposes three fundamental design choices that determine whether a method is theoretically grounded, scalable, and practically effective.
Every offline RL variant (including RLVR) performs weighted supervised fine-tuning of the form:
\[\min_{\theta} \mathbb{E}_{x}\,\mathbb{E}_{y\sim q(\cdot\mid x)}\big[-w(x,y)\,\log \pi_{\theta}(y\mid x)\big],\]where two design choices fully specify the algorithm:
While this template appears simple, the devil is in the details. As Table 2 reveals, seemingly minor differences in these choices lead to drastically different scalability, sample efficiency, and theoretical guarantees.
| Algorithm | Sampling Distribution | Data Source | Weight Estimate |
|---|---|---|---|
| VAR | $$q = \pi^*\ \mathrm{(implicit)}$$ | Limited, Pre-Collected Positive Samples | $$w(x,y) = \frac{\pi_{\mathrm{ref}}(y\mid x) \exp(r(x,y)/\beta)}{\sum_y \pi_{\mathrm{ref}}(y\mid x) \exp(r(x,y)/\beta)}$$ |
| SPR | $$q = \pi^*\ \mathrm{(implicit)}$$ | Limited, Pre-Collected Positive Samples | $$w(x,y) = \exp((Q(x,y) − W(x)) / \beta)$$ |
| DFT | $$q = \pi^* \ \mathrm{(implicit)}$$ | Limited, Pre-Collected Positive Samples | $$w(x,y) = \mathrm{stop\_grad}(\pi_{\theta}(y \mid x))$$ |
| iw-SFT | $$q = \pi_{\mathrm{ref}}$$ | Unlimited, Generated & Curated Positive Samples | $$w(x,y) = q(y|x)/\pi_{\mathrm{ref}}(y \mid x)$$ |
| Refit | $$q = \pi_{\mathrm{ref}}$$ | Unlimited, Generated Samples | $$w(x,y) = r(x,y)$$ |
| DGO (Ours) | $$q = \pi_{\mathrm{ref}}$$ | Unlimited, Generated Samples | $$w(x,y) = \frac{\exp(r(x,y)/\beta)}{\frac{1}{N} \sum_{n=1}^N \exp(r(x,y_n)/\beta)}$$ |
Table 2 reveals two fundamental dimensions along which offline RL methods differ, each with profound implications.
Dimension 1: Where do samples come from? (Implicit \(\pi^*\) vs. Explicit $\pi_{\mathrm{ref}}$)
Sampling distribution is the most striking division that separates VAR, SPR, and DFT from iw-SFT, Refit, and DGO. The former group assumes an implicit optimal policy \(q = \pi^*\) and relies on pre-collected positive samples, which is their Achilles’ heel. They inherit two crippling limitations: (1) Training is fundamentally limited by the size of pre-collected datasets. You cannot scale by simply generating more samples. (2) Sampling from \(\pi^*\) risks catastrophic forgetting. Since the data-derived \(\pi^*\) (defined by curated samples) can deviate substantially from $\pi_{\mathrm{ref}},$ potentially covering regions beyond what the KL-regularized RL objective would prescribe. As a result, training exclusively on such samples may cause catastrophic forgetting of capabilities encoded in $\pi_{\mathrm{ref}}$, leading to degraded performance on the broader distribution. In stark contrast, iw-SFT, Refit, and DGO sample from an explicit, controllable reference policy $q = \pi_{\mathrm{ref}}$. This unlocks unlimited scalability: generate as many samples as you need, whenever you need them. The cost of data is now just inference time, not expensive human annotation or cherry-picking. Moreover, sampling around $\pi_{\mathrm{ref}}$ naturally preserves the reference policy’s capabilities, mitigating catastrophic forgetting while still improving task performance through importance weighting.
Dimension 2: How do you weight samples? (Theoretically grounded vs. Heuristic)
The weight formulas in Table 2 tell a revealing story about the gap between theoretical and empirical weights:
To validate our theoretical framework and demonstrate DGO’s practical advantages, we conduct comprehensive experiments on mathematical reasoning tasks using the GSM8K
Datasets. We evaluate on two mathematical reasoning benchmarks: GSM8K (grade school math, 7,473 training problems, max completion length 1024) and MATH (high school competition problems, 12,000 training problems, max completion length 2048). For each prompt, we generate $N=8$ rollout responses from the reference policy and evaluate them using exact-match verification against ground-truth answers, a canonical example of verifiable rewards.
Baselines. We compare against five representative methods spanning the design space:
Implementation. All methods use LoRA (rank 8, alpha 64, dropout 0.05) with AdamW optimizer (lr=5e-6, weight decay=0.1, warmup ratio=0.1, cosine lr scheduler). SFT and VAR train on curated demonstrations; GRPO, Refit, and DGO generate $N=8$ rollouts per prompt. GRPO performs online optimization with reference model loaded; Refit and DGO perform offline optimization without reference model. We set temperature $\beta=0.1$ for DGO’s importance weights. All methods train for the same number of gradient steps to ensure fair comparison.
We now present a comprehensive empirical comparison that directly tests our theoretical claims. Table 3 compares DGO against representative baselines across both performance and resource efficiency dimensions, revealing how our framework translates from theory to practice.
| Method | Data Source | Accuracy | Time Cost | Peak Memory |
|---|---|---|---|---|
| GSM8K | ||||
| Baseline | — | 87.72% | — | — |
| SFT | Demonstrations | 88.40% (+0.68) | 2.85h | 57.59G |
| VAR | Demonstrations | 88.64% (+0.92) | 6.46h | 58.39G |
| GRPO | Rollouts (Online) | 90.01% (+2.29) | 45.35h | 82.95G |
| Refit | Rollouts (Offline) | 89.23% (+1.51) | 6.96h | 57.24G |
| DGO (Ours) | Rollouts (Offline) | 90.67% (+2.95) | 6.79h | 57.23G |
| MATH | ||||
| Baseline | — | 55.80% | — | — |
| SFT | Demonstrations | 57.26% (+1.46) | 5.93h | 70.68G |
| VAR | Demonstrations | 60.17% (+4.37) | 12.82h | 71.24G |
| GRPO | Rollouts (Online) | 61.37% (+5.57) | 55.68h | 83.96G |
| Refit | Rollouts (Offline) | 58.36% (+2.56) | 14.31h | 69.37G |
| DGO (Ours) | Rollouts (Offline) | 61.96% (+6.16) | 13.79h | 68.96G |
The empirical results presented in Table 3 provide strong evidence for the theoretical principles underlying DGO, demonstrating that our framework achieves superior performance while maintaining computational efficiency across both benchmarks. We summarize the main observations as follows:
DGO achieves the best task performance while maintaining resource efficiency. On GSM8K, DGO reaches 90.67% accuracy (+2.95% over baseline), outperforming GRPO (90.01%). On the more challenging MATH benchmark, DGO achieves 61.96% accuracy (+6.16%), surpassing both GRPO (61.37%) and VAR (60.17%) with significantly lower resource requirements. This validates our core claim: theoretically-grounded importance weighting enables better learning from the same data.
Decoupling generation from optimization unlocks dramatic efficiency gains. Comparing GRPO (online, coupled) vs. DGO (offline, decoupled) with identical rollout budgets ($N=8$), DGO reduces training time by 85% on GSM8K and 75% on MATH while cutting peak memory by 31% on GSM8K and 18% on MATH. This directly validates our architectural insight: the memory overhead and slowdown of online RLVR are not fundamental; they stem from unnecessary coupling.
Theoretically-grounded weighting outperforms heuristics. DGO consistently outperforms Refit (which uses raw rewards $r(x,y)$ as weights) by +1.44% on GSM8K and +3.60% on MATH, despite both methods using identical rollout data and offline optimization. This gap directly reflects the value of DGO’s principled importance weights $w(x,y) = \exp(r/\beta)/\hat{Z}(x)$, which correctly normalize across prompts and apply temperature scaling. Theory-practice alignment matters.
Demonstration quality vs. data scalability trade-off. VAR achieves strong results on MATH (60.17%) by leveraging high-quality curated demonstrations, but is fundamentally bottlenecked by dataset size: performance is capped by what humans have annotated. In contrast, DGO’s ability to generate unlimited rollouts from $\pi_{\mathrm{ref}}$ enables it to surpass VAR (+1.79% on MATH) while maintaining scalability. On GSM8K where demonstrations are more abundant, the gap is smaller but DGO still leads, demonstrating compute-limited scaling beats data-limited scaling.
Convergence dynamics. Figure 1 tracks convergence across 1000 training steps. DGO exhibits the fastest and most stable convergence. Demonstration-based methods (SFT, VAR) show slower, erratic convergence, reflecting the mismatch between their implicit \(\pi^*\) (defined by demonstrations) and the KL-regularized objective when demonstrations deviate from \(\pi_{\text{ref}}\). GRPO converges more slowly than DGO despite identical rollout data, likely due to variance from online importance sampling. Refit shows intermediate behavior, confirming that raw reward weighting lacks DGO’s theoretically-grounded normalization. These dynamics demonstrate that theoretically-grounded importance weights enable faster, more stable convergence than either demonstration-based methods or heuristic alternatives.
To test our claim that sampling from $\pi_{\mathrm{ref}}$ preserves general capabilities, we evaluate all GSM8K-trained models on four out-of-domain benchmarks: PIQA (physical commonsense)
| Method | PIQA | HellaSwag | Winogrande | RACE-high |
|---|---|---|---|---|
| Baseline | 71.22 | 81.80 | 65.19 | 79.25 |
| SFT | 70.35 (−0.87) | 80.56 (−1.24) | 63.14 (−2.05) | 84.33 (+5.08) |
| VAR | 70.26 (−0.96) | 80.73 (−1.07) | 63.24 (−1.95) | 82.13 (+2.88) |
| GRPO | 70.69 (−0.53) | 81.03 (−0.77) | 64.26 (−0.93) | 80.26 (+1.01) |
| Refit | 70.53 (−0.69) | 81.26 (−0.54) | 64.01 (−1.18) | 80.11 (+0.86) |
| DGO (Ours) | 71.01 (−0.21) | 81.13 (−0.67) | 64.94 (−0.25) | 79.93 (+0.68) |
Table 4 reports out-of-domain benchmark performance after fine-tuning on GSM8K. The results validate that DGO’s explicit sampling from $\pi_{\mathrm{ref}}$ provides inherent regularization against catastrophic forgetting. We highlight the following key observations:
DGO exhibits minimal forgetting across general benchmarks. DGO shows the smallest performance drops on PIQA (−0.21%) and Winogrande (−0.25%), with near-baseline retention on RACE-high (+0.68%). In contrast, demonstration-based methods (SFT, VAR) suffer larger degradation, particularly on Winogrande (−2.05% and −1.95%), validating our theoretical insight: sampling from $\pi_{\mathrm{ref}}$ keeps training data anchored around the reference distribution, naturally preserving its capabilities.
Demonstration-based methods (SFT, VAR) show unexpected gains on RACE-high. The large positive shifts (+5.08% for SFT, +2.88% for VAR) on reading comprehension likely reflect dataset artifacts: GSM8K demonstrations contain rich mathematical reasoning narratives that transfer positively to RACE-high’s comprehension format. However, this comes at the cost of larger drops on other tasks, suggesting overfitting to demonstration distribution rather than robust capability preservation.
Online RL (GRPO) achieves balanced retention. GRPO’s moderate forgetting across all tasks (−0.53% to −0.93% on most benchmarks) suggests that KL regularization provides some protection against catastrophic forgetting, but at massive computational cost (45.35h vs. 6.79h for DGO). DGO matches or exceeds GRPO’s forgetting resistance with 4-6× speedup.
All rollout-based methods (GRPO, Refit, DGO) avoid overfitting. Unlike demonstration-based approaches, methods that generate diverse rollouts from $\pi_{\mathrm{ref}}$ show minimal anomalous gains, maintaining near-baseline performance on out-of-domain tasks. This confirms that explicit $\pi_{\mathrm{ref}}$ sampling not only enables scalability but also provides implicit regularization against distribution shift.
Our empirical results comprehensively validate the theoretical claims established in previous sections:
Theoretically-grounded weighting improves learning: DGO’s importance weights $w(x,y) = \exp(r/\beta)/\hat{Z}(x)$ consistently outperform heuristic alternatives (Refit) and match or exceed online RL (GRPO) in task performance.
Decoupling eliminates architectural bottlenecks: By separating generation from optimization, DGO achieves 4-6× speedup and 18-31% memory reduction compared to online RL, proving the coupling is unnecessary.
Sampling from $\pi_{\mathrm{ref}}$ preserves capabilities: DGO exhibits minimal catastrophic forgetting (≤0.67% drop on most benchmarks), validating that explicit reference sampling provides implicit regularization.
Compute-limited beats data-limited: DGO surpasses demonstration-based methods (VAR) despite using generated rollouts, demonstrating that unlimited scalability through $\pi_{\mathrm{ref}}$ sampling outweighs curated dataset quality.
Together, these results establish DGO as the first practical implementation where theory, scalability, and performance converge, proving that online RLVR can indeed be scaled when done right through principled decoupling and theoretically-grounded design.
The computational barriers limiting online RLVR are architectural, not fundamental. By reformulating online RL as offline weighted SFT through forward KL, DGO resolves this: decoupling generation from optimization with theoretically-grounded importance weights $w(x,y) = \exp(r/\beta)/\hat{Z}(x)$ achieves 4-6× speedup, 18-31% memory reduction, and superior performance while preserving capabilities. Unlike existing methods that sample from implicit optimal policies or use heuristic weights, DGO perfectly aligns theory with implementation: what we optimize matches what theory prescribes. For RLVR, verifiable rewards eliminate reward hacking and remove reward model overhead. Scaling online RLVR is achievable when done right: through principled decoupling, theoretical grounding, and explicit reference sampling.
Here are some more articles you might like to read next: