dLLM - Rethinking Generation Beyond Autoregressive Models

Diffusion large language models (dLLMs) provide an alternative to autoregressive Transformers, supporting parallel token generation and flexible infilling. They excel in structured, long-horizon, or data-constrained settings, though challenges remain with output length, denoising, and blockwise generation. Hybrid approaches combining diffusion for reasoning and autoregressive for generation show promise.

Introduction

Despite the existence of a plethora of architectures and learning objectives, most language models in the current pantheon follow a time-tested recipe: a Transformer backbone trained using a next-token prediction objective, with model outputs generated autoregressively. This recipe, while simple, has proved surprisingly resilient and is still the predominant paradigm for training language models.

To recap, autoregressive models (ARM) factorize the joint probability of a token sequence of length \(T\), where \(x_{1:T}\) is a product of next-step conditionals:

\[p(x_{1:T}) = \prod_{t=1}^{T} p(x_t \mid x_{< t})\] \[x_{< t} = (x_1, x_2, \ldots, x_{t-1})\]

Here, \(p(x_{1:T})\) denotes the joint probability of the entire sequence and \(p(x_t \mid x_{< t})\) represents the probability of the next token given all previous tokens, and \((x_{< t})\) refers to the prefix of the sequence up to time step \(t-1\). This decomposition expresses a high-dimensional distribution as a product of simpler conditional distributions, which is the defining property of autoregressive models.

These models generate sequences one token at a time. This formulation comes with some inherent disadvantages:

Generating tokens one at a time imposes a sequential bottleneck. This means that generation latency scales with output length. While methods like KV caching make subsequent decoding faster, they are not able to eliminate the sequential dependency.
Errors can rapidly accumulate and snowball, as an incorrectly generated token becomes part of the context that all subsequent tokens rely on.
Once a token is generated, there is no possibility for editing or revision in-place; the generated token remains part of the context for the rest of the sequence.
ARMs optimize for token-level likelihood, which does not always correlate with sequence-level goals. The resulting tunnel vision can impede the ability of the model to plan over long horizons or maintain coherence over longer outputs.
Hallucinations may get amplified with ARMs, because once an incorrect token is generated, it is used as context for subsequent generations, thereby leading the model to generate coherent narratives over incorrect facts.
Theoretically, certain distributions can be represented more efficiently by models that marginalize over latent variables; representing the same distributions via a purely autoregressive formulation can require scaling up the model parameter size super-polynomially with input length.

As the bitter lesson shows, working towards learning more general capabilities and scaling them has proven more fruitful than injecting task or domain-specific rules into a model. In a similar vein, we can challenge the inductive biases of ARMs (i.e. that language sequences are generated left-to-right), thus providing the model with more freedom.

One such alternative to ARMs is the diffusion paradigm. Diffusion models have seen great success in computer vision, but just like many other techniques that have found success in computer vision, adapting them to text has been hard, primarily due to the discrete nature of language. Therefore, in language modeling, most diffusion research has focused on discrete diffusion models.

In this blog, we focus specifically on masked discrete diffusion models, termed dLLMs. Masked diffusion models have garnered a lot of interest recently, with a rapidly growing body of work in this paradigm.

Masked diffusion works by:

Defining a forward noising process that gradually replaces tokens with a special token (typically [MASK]), and
Learning a reverse denoising process that iteratively predicts and “locks in” tokens until the sequence is fully unmasked.

The masked diffusion learning objective looks similar to masked language modeling (e.g. BERT), or denoising autoencoding (e.g. BART). However, there is a key difference differentiating diffusion from these other objectives. Diffusion models are trained across a range of corruption levels and use an iterative sampling process that starts from a fully or almost fully noisy sequence and progressively denoises it, rather than performing a single reconstruction from a fixed corruption scheme, as in BERT or BART.

Diffusion models are still generative models. Unlike masked language modeling where the masking rate is fixed during training, diffusion models randomly sample a masking rate (or “time”) between 0 and 1 for each example .

A typical masked diffusion training objective can be expressed as a time-weighted cross entropy calculated over masked tokens:

\[t \sim \mathcal{U}(0,1), \qquad x_t \sim q(x_t \mid x_0, t)\] \[\mathcal{L}(\theta) = \mathbb{E}_{x_0,\, t,\, x_t} \Big[ w(t)\, \sum_{i \in \mathcal{M}(x_t)} -\log p_\theta(x_{0,i} \mid x_t, t) \Big]\]

In this formulation, we first sample a time variable

\[t \sim \mathcal{U}(0,1)\]

representing a point along a continuous noising schedule. Given the original data \(x_0\), we then generate a partially corrupted version

\[x_t \sim q(x_t \mid x_0, t)\]

where the corruption is controlled by the sampled time \(t\).

The model is trained to reverse this corruption using the following loss function:

\[\mathcal{L}(\theta) = \mathbb{E}_{x_0, t, x_t} \Big[ w(t) \sum_{i \in \mathcal{M}(x_t)} -\log p_\theta(x_{0,i} \mid x_t, t) \Big]\]

Here, \({M}(x_t)\) indicates the positions in \(x_t\) that have been masked or corrupted, and \(w(t)\) is a weighting term that ensures heavily corrupted examples do not dominate the learning signal.

Intuitively, the loss encourages the model to predict the original tokens \(x_{0,i}\) at the masked positions, given the noisy input \(x_t\) and the noise level \(t\). By learning to undo the corruption at every point along the noise schedule, the model effectively learns a denoising process that can reconstruct clean data from partially corrupted inputs. This principle is central to diffusion-based generative modeling and related reconstruction tasks.

Autoregressive models optimize the maximum likelihood objective directly. Diffusion models, by contrast, are typically derived from a variational formulation (ELBO / NELBO), though many practical implementations use a weighted mixture of masked-token cross-entropies, as shown above.

Characteristics of Diffusion Models

From an end-user standpoint, diffusion models are said to generate by infilling (iterative refinement of a partially completed sequence). This is especially suitable for tasks like coding or reasoning, which are often non-linear. Infilling also provides opportunities for personalization and enhances structured generation. The decoding order is also configurable.

Some obvious benefits of diffusion models include the ability to perform any-order modeling, in-place context modification, and parallel token prediction.

Let’s now explore the mechanics of masked diffusion in detail.

Masked Diffusion Explained

Masked diffusion can be implemented independent of the architecture. For example, masked diffusion can use state-space models as the backbone.

Forward Process

Consider an example sequence x in the training dataset:

‘He invented the parallelogram as a means to exact vengeance upon his detractors’

A number \(t\) sampled randomly between 0 and 1 (often chosen from a discrete mask schedule in practice), is chosen to be the mask strength. Each token in the given sequence is replaced by a [MASK] token with probability \(t\).

For example, if \(t=0.2\),

‘He invented the [MASK] as a means to [MASK] vengeance upon [MASK] detractors’

If \(t=0.8\),

‘[MASK] invented [MASK] [MASK] [MASK] means [MASK] exact vengeance [MASK] [MASK] [MASK]

Let’s refer to the masked sequence as \(x_t\).

The objective of the model is to predict the masked tokens in \(xt\). The training loss is typically the cross entropy over the masked tokens, with a normalization or weighting term such that examples with higher masking rates do not contribute disproportionately to the training signal. One common normalization scheme is to divide the loss by the masking strength \(t\). Equivalently, some implementations instead normalize by the number of masked tokens.

The key difference between masked language models like BERT and diffusion models is that in BERT, the corruption policy (masking rate) is fixed throughout training, while in masked diffusion models the masking rate \(t\) varies per example across a range of masking rates.

Reverse Process

Prompt: ‘Is Socotra a real place?’ Response: ‘Yes, Socotra is an island in Yemen.’

A typical reverse process proceeds as follows:

For a given prompt \(p\), an initial answer is generated consisting entirely of [MASK] tokens. The response length is typically a hyperparameter.

The reverse process (i.e. denoising) runs for K steps, which is typically a hyperparameter.

At each step, the model predicts tokens for all the [MASK] positions at once, conditioned on the prompt and the currently unmasked tokens. It then commits some tokens (unmasks them) and remasks a portion of tokens (often low-confidence ones), either randomly or via heuristics. Generation stops after all denoising steps are completed. If an <EOS> token is present in the final output, then the tokens after it are discarded.

In practice, there is a discrepancy between training and inference; during inference, the whole output often starts off as masked and is gradually de-masked.

Training

Diffusion language models can continue using the same Transformer backbones that underpin today’s language models. The primary change is in the learning objective, where instead of predicting the next token in a sequence as autoregressive models do, diffusion models are taught to predict all the masked tokens in a sequence simultaneously. This also means that dLLMs can be built using non-Transformer backbones (e.g. state-space models), as long as the architecture supports conditioning on a partially observed sequence.

Typical Training Pipeline

Pre-train from scratch or perform continued pre-training
Midtraining/annealing
Instruction tuning
Reinforcement learning

Note that dLLMs can either be trained from scratch or adapted from a base ARM.

Pre-training From Scratch

While training from scratch, the next-token prediction objective is typically more sample and compute efficient than diffusion in practice. This is because in dLLMs, the loss is typically calculated only over the masked tokens, so each forward pass supervises fewer target positions than an AR pass. As a result, given the same architecture, compute, and data, AR baselines typically train faster and reach higher quality, though the exact gap depends on masking schedules, weighting schemes, and decoding strategies used.

Adapting AR Models to dLLMs

Pre-training from scratch is not the only option; one can also adapt existing autoregressive models to support diffusion. The adaptation is typically carried out using continued pre-training. In this technique, we take a stable checkpoint of an ARM, replace the causal mask with a bidirectional mask, and then continue pre-training it with the diffusion learning objective.

Chandrasegaran et al. propose that self-attention weights should be updated with a relatively higher learning rate to help adapt the model to the diffusion paradigm. The feed forward layers are trained at a relatively lower learning rate so that world knowledge and other capabilities learned during the AR pre-training stage are retained. This helps mitigate catastrophic forgetting. They also observe that dLLMs benefit from larger batch sizes during continued pre-training.

Other techniques for adaptation from ARMs include

Grafting, where the architecture is edited by swapping causal attention blocks for bidirectional attention blocks.
Attention mask annealing, where the causal mask is gradually converted into a bidirectional one during training.

Masked language models like BERT can also be converted into diffusion models using the continued pre-training approach.

Inference

During inference, the model starts from a masked output sequence and generates tokens through a series of denoising steps. Within a denoising step, the masked positions are typically predicted in parallel, followed by a commit-or-remask decision.

A basic denoise-and-remask procedure works as follows. We first initialize the output sequence to be entirely masked:

\[y^{(0)} = [\text{MASK}]^L\]

where \(L\) is the sequence length, typically chosen as a hyperparameter. Then, for a fixed number of iterations \(k = 1, \dots, K\), we perform the following steps:

Predict masked token distributions: Using the model, we estimate the probability distribution over tokens at the currently masked positions:

\[p_\theta(\cdot \mid p, y^{(k-1)}, k)\]

where \(p\) may represent any conditioning information (e.g., a prompt or context), \(y^{(k-1)}\) is the sequence from the previous iteration, and \(k\) indicates the current step.

Commit a subset of positions: We select a subset of tokens to “commit” to the output sequence. This is usually based on a confidence criterion, such as selecting the highest-probability tokens or those with the lowest entropy.
Optional remasking: To refine the sequence, a heuristic or schedule may remask a subset of previously committed tokens that are considered low-confidence. This allows the model to revisit uncertain predictions in subsequent iterations.
Update the sequence: The newly committed tokens replace the previous masked positions to form the updated sequence \(y^{(k)}\).

After completing all \(K\) iterations, the final sequence \(y^{(K)}\) is returned. If an end-of-sequence token <EOS> appears, any tokens following it are discarded.

This iterative denoising procedure gradually replaces masks with high-confidence predictions, while optionally revisiting uncertain tokens. Over multiple steps, the sequence converges toward a coherent output that reflects both the learned model distribution and any conditioning context.

The initial output sequence can be fully masked or it can contain parts of the output that are already known; delegating the model to perform infilling for the tokens that are not yet known. This can be operationalized in a few ways, such as using constrained endings or structured infilling. We will explore structured infilling in detail.

Structured Infilling

Instead of asking the model to generate output in a specific structured format (like JSON), we use a structured format template and let the dLLM fill in the blanks.

For a given structured format, we have:

Invariant tokens (syntax, labels, brackets etc.), which stay unmasked
Variable tokens (values, contents) which are masked

An advantage with infilling templates is that it shrinks the search space during generation, since the model need only choose content tokens and not tokens related to the syntax. Another advantage is that it implicitly ensures the structured format is adhered to during generation.

The tricky part of this technique is in deciding how many masked tokens to allocate for the variables. If the masked tokens added are inadequate, the output has to be truncated. If too many masked tokens are added, then the model tries to fill in the extraneous masked tokens with content, leading to unpredictable outcomes.

Self-adaptive schema scaffolding (S3) addresses this issue by allowing the model to output a special null token upon which generation for that variable block stops, leaving the remaining slots empty.

Decoding Strategies

The behavior of dLLMs at inference time depend not only on the model itself, but also on the decoding strategy used to decide which tokens to commit or remask at each denoising step. Different decoding strategies introduce different trade-offs between efficiency, stability, diversity, and output quality. Below, we describe several common decoding strategies used in dLLMs.

Random

A simple baseline is random unmasking, where the positions to commit/unmask at each step are chosen uniformly at random. However in practice, heuristics tend to be more efficient and produce higher quality outputs.

Confidence-Based Sampling

Confidence-based sampling is a common strategy in iterative denoising or masked sequence generation. In this approach, tokens with high confidence are “locked in,” while low-confidence tokens may be remasked for further refinement.

However, this strategy is not always optimal. High-confidence tokens are often syntactic or structurally predictable, which can cause the model to commit to the surface structure of the sequence too early, potentially limiting the flexibility of subsequent generation.

A typical way to quantify the confidence of a token at position (i) is:

\[c_i = \max_v p_\theta(v \mid \text{context})\]

where \(p_\theta(v \mid \text{context})\) is the predicted probability of token \(v\) given the current context, and \(c_i\) represents the confidence score for position \(i\). Tokens with higher \(c_i\) are more likely to be committed, while those with lower confidence can be remasked and reconsidered in future iterations.

This method provides a simple and interpretable heuristic for guiding which positions to finalize versus which to refine, balancing stability and flexibility in the generated sequence.

Entropy-Based Sampling

This technique uses entropy as a confidence measure, where lower entropy implies higher confidence. This is often more robust than raw probability thresholds. A common way to calculate the entropy at position \(i\) is:

\[H_i = - \sum_v p_i(v) \log p_i(v)\]

where \(p_i(v)\) is the probability assigned to token \(v\) at position \(i\). Here, \(H_i\) measures the uncertainty of the model’s prediction: positions with low entropy correspond to confident predictions, while positions with high entropy indicate ambiguity.

Margin-Based Sampling

Margin-based sampling uses a second-order confidence measure: we take the difference between the confidence of the most probable token and the second most probable token as the margin, and select only tokens that have a high enough margin.

Formally, let \(v_1\) and \(v_2\) be the most probable and second-most probable tokens at position \(i\). The margin is defined as:

\[m_i = p_i(v_1) - p_i(v_2)\]

where \(p_i(v_1)\) and \(p_i(v_2)\) are the probabilities assigned to these tokens.

A higher margin \(m_i\) indicates that the model is strongly favoring the top token over the runner-up, while a small margin suggests uncertainty. During iterative generation, margin-based sampling allows the model to commit tokens with high certainty while deferring those with ambiguous predictions for further refinement.

EB Sampler

Entropy-bounded (EB) sampling typically commits tokens until an entropy budget/constraint is met (e.g. keep committing the lowest-entropy positions until the remaining masked positions have entropy above a target, or until a step-wise budget is exhausted).

PC Sampler

Position-calibrated (PC) samplers add a position-aware calibration term to avoid pathological early commitments to “easy” regions (e.g. always unmasking near the prefix first). Without calibration, models may tend to unmask or commit tokens near the beginning of a sequence first, which can reduce diversity and flexibility in later steps.

One way to implement this is to adjust the confidence score of each position with a position-dependent bias:

\[\tilde{s}_i = s_i + b(i)\]

where \(s_i\) is a base confidence score—such as the negative entropy \(-H_i\) or the raw probability \(c_i\) and \(b(i)\) is a bias or penalty term that depends on the position \(i\).

By adding \(b(i)\), positions that are typically “easy” to commit (like the prefix) can be down-weighted, encouraging the sampler to consider less obvious positions first. The calibrated score \(\tilde{s}_i\) is then used to select which positions to commit or remask in the current iteration, promoting a more balanced and robust sequence generation process.

Unmasking/Remasking Schedules

Once a decoding strategy determines which token predictions appear most reliable, the model must still decide how aggressively to commit or revisit tokens over the course of denoising. A commitment schedule specifies how many tokens are unmasked, retained, or remasked at each step of the reverse process. Below, we describe several common schedules used in dLLMs.

Static Low-Confidence Remasking

In this masking regime, the denoising occurs over K steps. At each step, a fixed number of tokens N/K, where N is the size of the output, are unmasked, usually chosen by a confidence criterion. The low-confidence tokens are then remasked for further refinement.

Dynamic Low-Confidence Remasking

A confidence threshold τ is set. At each denoising step, a token is unmasked only if it’s confidence crosses the threshold τ. If too few positions cross the threshold, then a minimum number of the highest confidence tokens are unmasked.

Dilute Unmasking Schedule

Instead of committing aggressively at every step, the schedule “dilates” commits by committing fewer tokens early, more in the middle, and fewer near the end, so that more global context can settle before too many tokens are locked in.

Speculative Decoding

Speculative decoding in diffusion models is more challenging than in ARMs because generation can happen in parallel, and some models use block-based decoding. Gao et al. propose Self-Speculative Decoding (SSD) to address these challenges. SSD consists of two main phases: self-drafting and verification.

Self-Drafting

In this step, we construct a partial sequence that includes the prompt tokens, the tokens already committed in previous steps, and the currently masked positions. We then perform a single denoising step on this sequence to produce predictions for all masked tokens. This initial prediction step is referred to as self-drafting.

If the model uses block-based decoding, the self-drafting procedure is applied within the current block as follows:

Sort the positions in the block by a confidence measure.
Select the top-k positions as candidates for verification.
If there are fewer than \(k\) positions in the block, extend the selection into subsequent blocks until \(k\) positions are chosen.

Verification

The \(k\) drafted tokens are then verified using a verification tree, which efficiently checks multiple token proposals at once. Tokens that pass verification are committed to the sequence, while tokens that fail are remasked and will be reconsidered in later iterations.

This two-phase procedure allows speculative decoding to leverage parallel generation while maintaining reliability, committing only those tokens for which the model demonstrates sufficient confidence.

Hyperparameters

Beyond the decoding and commitment strategies used during inference, diffusion language models are also sensitive to a number of hyperparameters that shape generation behavior. These hyperparameters control generation properties such as output length, the number of denoising steps, the granularity of blockwise generation, and the balance between diversity and adherence to conditioning. Below, we highlight the most important hyperparameters used in practice.

Generation length

In ARMs, the generation length is dynamic for a given query, and generation stops when the <EOS> token is generated. However, in diffusion models, tokens are predicted in parallel, so we typically allocate a maximum output length in advance, and discard tokens after the first in the final output. This means that the generation length becomes a hyperparameter. Large generation lengths will be expensive due to the quadratic nature of self-attention.

Number of denoising steps

The number of denoising steps K is also a hyperparameter. More steps give the model more chances to revise tokens. If optimizing for latency, K can be lower. If optimizing for task performance, K can be larger.

Block length

In practice, large output sequences are not amenable to being generated in one go. Hence semi-autoregressive (blockwise) diffusion is used. The output sequence is divided into blocks, and each block is generated sequentially. Within each block, tokens are generated through diffusion. Using this method, it becomes possible to perform diffusion across very long horizons.

Temperature

Similar to ARMs, temperature is a key hyperparameter in the dLLM paradigm. Increasing temperature not only increases token diversity but also diversity in generation order (Gong et al., 2025) .

Classifier-Free Guidance (CFG)

Classifier-free guidance (CFG) for discrete diffusion combines a conditional prediction with an “unconditional” prediction (often a null prompt or dropped conditioning).

A common formulation is:

\[z_{\text{cfg}} = z_{\text{uncond}} + s \cdot (z_{\text{cond}} - z_{\text{uncond}})\]

where \(z_{\text{uncond}}\) is the unconditional prediction, \(z_{\text{cond}}\) is the conditional prediction, and \(s\) is the guidance scale. Increasing \(s\) strengthens adherence to conditioning, while smaller values typically yield more diverse outputs.

Pitfalls & Solutions

dLLMs introduce several inference-time hyperparameters and failure modes that do not arise in quite the same way for ARMs. A key challenge is that important generation decisions such as output length, the number of denoising steps, block size, and confidence thresholds must often be specified externally rather than inferred optimally by the model itself. As a result, performance can depend heavily on manual tuning, and the best settings may vary across queries, tasks, and domains. Below, we discuss some of the most important pitfalls that arise in practice, along with proposed solutions.

The Cost of Fixed Output Length

Unlike ARMs, where the model continues generating until an end of sequence (<EOS>) token is seen, the output length in dLLMs is typically a hyperparameter that is chosen before generation.

If the preset output length is too short for a given query, the model may skip steps, be very terse, or just fail entirely. If the preset output length is too long for a given query, neural degeneration may occur and the performance may drop. Longer output lengths also result in significantly more computation due to the quadratic nature of self-attention.

These problems can be resolved if the model can dynamically adapt its output length for each query. To this end, Li et al. introduce the creatively named DAEDAL (Dynamic Adaptive Length Expansion for Diffusion Large Language Models), a training-free decoding technique that leverages <EOS> token probabilities to dynamically adjust output length.

DAEDAL has 2 stages, an initial global estimate and iterative local mask insertions.

Initial Length Adjustment

A short initial length (say 64 or 128 tokens) is assigned for generation. The model then goes through a single denoising step to produce its initial predictions. For the last few tokens of the sequence, the probabilities of the <EOS> token are averaged. If the average probability exceeds a threshold \(T\), then the current length is likely to be sufficient, otherwise the length is extended by a predetermined amount.

This step is repeated until the <EOS> token confidence exceeds the threshold \(T\) or the global maximum length \(L\) is reached.

Iterative Mask Insertion

During the denoising process, there might be local regions in the output where more elaboration is merited. An example would be a tricky code block, where it might serve well to reserve more lines of code for it. To facilitate this, after each denoising step, the lowest-confidence masked positions are taken as expansion points. At these expansion points, multiple [MASK] tokens are inserted, thus dynamically allowing the model to expand generation in that output region.

The authors show that DAEDAL leads to a massive jump in the effective token ratio (proportion of tokens in the output sequence actually used for the output) compared to fixed length baselines.

The Cost of Fixed Denoising Budgets

In most contemporary dLLMs, the number of denoising steps is also a hyperparameter. However, the optimal number of denoising steps depends on the query. The number of denoising steps is analogous to test-time compute in reasoning models, and leads to similar issues that scaling test-time compute encounters: (1) if the number of denoising steps is too low, then the model might not have arrived at the answer yet, and (2) if it is too high, the model may have overshot the answer. This phenomenon of drifting away from the correct answer that was generated during an intermediate denoising step is called temporal oscillation.

In order to quantify temporal oscillation, we can use the ever pass rate metric. The ever pass rate is the accuracy of the model as measured across all denoising steps.

Let \(N\) be the number of evaluation queries, \(K\) the number of denoising steps, and let
\(\mathrm{Correct}(i,k) \in \{0,1\}\)
indicate whether query \(i\) is solved correctly at step \(k\). Then:

\[\mathrm{EverPass} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\left[\max_{1 \le k \le K} \mathrm{Correct}(i,k) = 1\right].\]

A query is counted as correct under this metric if the model produces the right answer at least once during the denoising trajectory. This is contrasted with the full pass rate, which measures accuracy only at the final denoising step:

\[\mathrm{FullPass} = \frac{1}{N} \sum_{i=1}^{N} \mathrm{Correct}(i, K).\]

Temporal oscillation can then be summarized as:

\[\mathrm{Oscillation} = \mathrm{EverPass} - \mathrm{FullPass}.\]

Ideally, the model would adaptively choose the number of denoising steps based on query difficulty, similar to thought budgeting in reasoning models. Wang et al. propose an interim approach that leverages intermediate outputs via the concept of temporal semantic entropy.

Temporal semantic entropy

For a given query \(q\), the model produces a sequence of intermediate answers across the \(K\) denoising steps. These answers are grouped into clusters based on semantic equivalence. Temporal semantic entropy (TSE) is defined as the entropy of the resulting cluster distribution:

\[\mathrm{TSE}(q) = - \sum_{c \in \mathcal{C}} p_c \log p_c,\]

where \(p_c\) is the proportion of intermediate answers assigned to cluster \(c\).

If all intermediate answers fall into a single cluster, TSE is low. If the model oscillates between semantically distinct answers, TSE increases. Empirically, datasets on which the model performs poorly tend to exhibit higher mean TSE. Individual queries answered correctly typically have lower TSE than those answered incorrectly.

TSE can be used to mitigate temporal oscillation and improve final accuracy. Wang et al. also introduce Temporal Self-Consistency Voting, a strategy that selects the final answer via majority vote across all denoising steps.

Temporal Self-Consistency Voting

The model performs majority voting over the solution space. The output from each denoising step is normalized to a semantic form. Outputs are weighted based on which denoising step it came from. The weighting scheme can be:

Fixed - all denoising steps carry the same weight
Linear - the weight increases with the number of steps, i.e. the later steps are weighted more
Exponential - similar to linear, except that the weight increases exponentially towards the end of the sequence

The authors observed that empirically, the exponential weighting scheme is the best performing.

If the model overshot the answer due to a longer-than-needed denoising process, then temporal self-consistency can ensure that intermediate solutions still stand a chance of being picked up as the final answer.

The Cost of Fixed Block Sizes

Diffusion over very large output sequences is suboptimal due to both latency issues and the inability or difficulty of effectively using KV-caches. Therefore in practice, it is customary to divide the output sequence into fixed-length blocks, where each block is generated sequentially but the tokens inside each block are generated via diffusion. Typically, the number of denoising steps is divided equally among each block.

However, having a fixed block size comes with pitfalls. In their paper, Lu et al. showcase two common inefficiencies: (1) late decoding overhead and (2) premature decoding error.

Late Decoding Overhead

Consider an output sequence broken down into three blocks. Let’s say the tokens in the second block are high confidence and easy to predict. However, they are not predicted until all tokens in the first block have been predicted. If the tokens in the first block are relatively lower confidence, then the model will wastefully perform denoising steps even when there are high confidence tokens outside the block boundary waiting to be unmasked. The authors term this as late decoding overhead.

Premature decoding error

On the flip side, with block diffusion, all the tokens in the current block need to be predicted before moving on to the next block. This means that there is a chance of low confidence tokens being locked in prematurely. These tokens can cause errors to propagate, as they will be used as context for generation of tokens in subsequent blocks. The authors term this as premature decoding error.

In order to mitigate these issues, the model should ideally have dynamic block lengths. Lu et al. introduce AdaBlock, an adaptive block size scheduler that leverages token confidence dynamics to draw semantically aware block boundaries.

The authors observed that the confidence dynamics of the output tokens vary across token positions. The confidence landscape of output tokens can be divided into three types of regions:

High-confidence plateau - this includes the already decoded tokens and mask tokens in their vicinity.
Low-confidence floor - this includes token positions at the end of the output sequence
Volatility bands - these are bands of positions where the confidence is fluctuating.

Active decoding happens in the volatility band. The volatility band encodes local semantic structure. With this in mind, the block size can be made adaptive by dividing blocks based on semantic steps. A semantic step can be a span of tokens that are potentially self-contained, like a reasoning step, a line of code, or a statement.

In order to identify semantic step boundaries, we can perform the following:

Identify a set of delimiter tokens (newline, period, etc.) that can convey the end of a semantic unit
Slide a window forward from the current generation position
Pick the delimiter in the window with the highest confidence
If the delimiter with the highest confidence surpasses a threshold \(T\), it is treated as the end of the semantic step, and the block size is adapted such that the block ends at that delimiter
If no delimiters exist or none of them have confidence surpassing \(T\), then the default block size is used for the current generation

Because each block corresponds to a self-contained semantic unit, the KV cache representations age more gracefully. This method also helps mitigate the late decoding overhead and premature decoding error issues.

The Cost of Static Confidence Thresholds

With confidence-based decoding, typically the same threshold is employed throughout the generation process regardless of whether (1) it is an earlier denoising step or a later denoising step or (2) the prompt is easy or difficult.

In practice, we could use a dynamic confidence threshold, because confidence dynamics vary throughout the diffusion phase. It has been observed that mean confidence is low during earlier denoising steps, peaks in the middle, and then becomes low during the final steps again, forming an upside-down U-shape . The shape of this curve depends on the type of task being performed. Shen et al. note that GSM8K’s curve is different from GPQA’s curve, but the confidence dynamics are similar for problems within each dataset. This indicates that we can treat the confidence trajectory over denoising steps as a signature for a task type.

The confidence signature for a given task can then be calculated by taking into account the confidence over all diffusion steps and blocks. The metric could be mean, median, etc. This metric can then be treated as the confidence threshold for all instances of the given task. To prevent the confidence levels from being too restrictive, an upper bound \(B\) can be set.

Information is not carried over across denoising steps

Consider the partially unmasked sequence:

Michael [MASK] to New York.

At denoising step \(p\), suppose the top-3 predicted tokens for the masked position are:

went: 0.4
moved: 0.3
galloped: 0.1

If the model uses confidence-based unmasking, the [MASK] token is not unmasked at step \(p\) because none of these candidates exceed the confidence threshold. However, at step \(p + 1\), the model restarts the prediction process from scratch. The information that went and moved had relatively high probabilities at the previous step is not retained.

This leads to redundant and inefficient computation: the model repeatedly re-evaluates similar candidate sets without leveraging prior signals. Ideally, diffusion-style language models would propagate information across denoising steps, allowing the model to refine or reweight earlier hypotheses instead of discarding them at each iteration.

Soft Masking

As a solution, Hersche et al. introduce soft masking, which augments the discrete [MASK] token with continuous feedback. Instead of treating denoising as a binary decision (unmask or keep masked), each masked position is represented as an interpolation between the [MASK] embedding and a weighted combination of the top-k token embeddings.

This can be expressed as:

\[soft embed = (1 - \alpha) embed([MASK]) + \alpha \sum_{j \in \text{top-}k} \tilde{p}_j embed(j)\]

where:

\(\alpha\) is calculated by taking the negative entropy of the token probability vector and passing it through a sigmoid to obtain a weight between 0 and 1
\(\tilde{p}_j\) is normalized over the top-k tokens so that the weights sum to 1
\(k\) is typically 3–4 tokens

If the token distribution is flat, \(\alpha\) is small and the [MASK] embedding barely changes. If the distribution is peaked, the [MASK] embedding is mostly replaced by the mean embedding of the top-k tokens.

Hersche et al. report that applying soft masking on roughly 80% of denoising steps yields the best results, with most benefits occurring when it is applied during only the first 20% of steps.

Credit Score

Wang et al. propose CreditDecoding, which maintains a credit value \(C_t(i, v)\) for each token position \(i\), each token \(v\), and each denoising step \(t\). The credit is updated as:

\[C_t(i, v) = \begin{cases} \gamma \cdot C_{t-1}(i, v) + p_t(i)^{\beta}, & \text{if } v = v_t(i) \\ \gamma \cdot C_{t-1}(i, v), & \text{otherwise} \end{cases}\]

where:

\(v_t(i)\) is the top-1 token at position \(i\)
\(p_t(i)\) is its probability
\(\gamma\) is a decay factor in (0, 1)
\(\beta\) is an exponent that amplifies mid-range probabilities

Credit scores are incorporated into the logits of the next step:

\[\tilde{z}_t(i, v) = z_t(i, v) + \lambda \cdot C_t(i, v)\]

where \(\lambda\) controls the strength of the credit prior.

Entropy Sink Phenomenon

Many diffusion LLMs still exhibit semi-autoregressive behavior. Models adapted from autoregressive pretraining retain latent left-to-right dependencies, while models trained from scratch can be more flexible.

Confidence-based unmasking — selecting the highest probability or lowest entropy tokens - tends to favor positions near the prefix. The first committed token biases the immediate right neighbor, creating an entropy sink. This bias induces a left-to-right autoregressive (AR) pattern.

Degeneration to AR

Although diffusion models can, in principle, generate tokens in any order, in practice generation often degenerates toward AR behavior. Two metrics introduced by Gong et al. (2025) quantify this:

Local AR-ness: over a sliding window of length \(k\), the proportion of contiguous tokens generated reflects local AR behavior
Global AR-ness: for each unmasked token, if it is the left-most unmasked token in the sequence, it is considered autoregressively generated. The proportion of such tokens reflects the global AR-ness

Models continually pre-trained from AR bases retain higher AR-ness than models trained from scratch. Empirically, math generation shows high local AR-ness, while code generation shows lower AR-ness. This mirrors human behavior: math is typically solved sequentially, while code is edited in a non-linear fashion.

Confidence-based remasking reinforces AR behavior because high-confidence tokens are usually near the prefix. Increasing the sampling temperature decreases AR-ness by flattening token distributions and increasing uncertainty in token commitments.

Future Directions

The Latency Bet

Currently, dLLMs are being promoted as a faster alternative to ARMs, as they support parallel token prediction. In practice, they need multiple denoising steps and often rely on blockwise decoding. Although latency can be reduced with fewer denoising steps, it typically comes at the cost of generation quality or task performance.

The Data Efficiency Bet

If we assume a data-limited regime where the supply of unique, high-quality data becomes scarce well before compute does, dLLMs may offer an important advantage over ARMs. In data-constrained settings, dLLMs exhibit stronger sample efficiency because they can be trained on repeated data over many epochs while still continuing to benefit from additional passes over the data. Repeating data will likely become an increasingly common necessity in cases where data is constrained; however, in ARMs, repeating training data for more than a few epochs will lead to performance plateaus and yield diminishing returns. ARMs tendt to fit the training data after a few epochs and reach saturation and/or overfitting, leading to degraded performance (Ni et al. 2025) .

dLLMs can be trained on the same data for many more epochs than ARMs; Prabhudesai et al. show that ARMs can benefit from data repetition for only around four epochs, while dLLMs can benefit from data repetition for up to roughly hundred epochs. They also show that diffusion’s best validation loss can occur at dramatically higher epoch counts (e.g., around 500 epochs vs around 50 for ARMs), suggesting that diffusion continues extracting signal from repeated data even after AR training has largely saturated or begun to overfit.

The Hybrid Reasoning Bet

In practice, diffusion and autoregressive modes are likely to co-exist for the foreseeable future. A plausible way to combine them together is to use diffusion for reasoning and AR for answer generation.

Diffusion can be used for task decomposition, planning, and outlines, where revision is natural. After this intermediate reasoning stage, the ARM can produce a clean left-to-right final response.

This aligns with what diffusion is naturally good at (global revision and iterative refinement) while preserving AR’s strengths (fast sequential emission, stable length control, and efficient serving).

Open Problems for dLLMs

Several critiques have emerged recently making the case against diffusion LLMs. We will briefly elaborate a few of them in this section.

Any-order model is inefficient and ineffective

Masked diffusion can be viewed as an any-order model, as it is trained to predict arbitrary masked subsets in any order. However in practice, left-to-right and right-to-left orders are easier to learn. This is due to the Markovian nature of language , where conditioning on nearby tokens provides more predictive power. An any-order model thus risks spending model capacity on modeling hard and unnatural orders.

Models predict marginals, not a joint distribution

The denoiser predicts a distribution for each [MASK] position, conditioned on the unmasked tokens, but not the joint distribution over all masked positions. As a result, sampling multiple tokens in parallel has no guarantee of joint coherence.

Training-Inference mismatch

During training, the model is taught to predict from arbitrary masking patterns. However, during inference, denoising is typically performed via confidence-based measures, which causes tokens closer to already unmasked tokens to be generated first, making it nearly autoregressive. Once inference becomes AR-like, most situations covered during training may not be encountered at inference time.

Conclusion

In summary, at present diffusion LLMs are best viewed as a complementary modeling paradigm rather than a universal replacement for autoregressive models. They offer clear advantages in certain regimes: the masked diffusion objective naturally supports infilling and in-place revision, and recent results indicate that diffusion-based training can be significantly more data-efficient than autoregressive training when the amount of unique high-quality data is limited but can be repeated many times. At the same time, current masked diffusion formulations face structural limitations. They optimize over many token orders despite language exhibiting strong directional biases, and their decoding procedures often behave in a semi-autoregressive manner in practice, reducing the practical benefits of full any-order generation.

Taken together, these observations suggest a more nuanced role for diffusion LLMs. In settings where data is the primary bottleneck and compute is relatively abundant, or where flexible infilling and structured editing are central requirements, dLLMs are a compelling choice. In contrast, for latency-sensitive, streaming, or purely left-to-right generation workloads, autoregressive models remain highly competitive and often preferable. A promising direction for future systems is therefore hybrid: using diffusion-style models for planning, reasoning, or structural refinement, and relying on autoregressive models for efficient, stable surface realization.

Enjoy Reading This Article?

Here are some more articles you might like to read next:

Fairness Audits as Theater: When Metrics Mask Structural Harm

FANS - Frequency-Adaptive Noise Shaping for Diffusion Models

Beyond Attention as a Graph

Attention Sinks from the Graph Perspective

A Hitchhiker's Guide to Agent Evaluation