Probabilistic Circuits for Uncertainty Quantification

Deep learning models struggle with epistemic uncertainty quantification, often exhibiting blind confidence on out-of-distribution data. This work reviews Probabilistic Circuits (PCs) as a versatile framework for rigorous, tractable reasoning. PCs model the joint probability distribution and by enforcing structural constraints, specifically smoothness, decomposability, and determinism, they allow for the exact computation of marginals, conditionals, and moments in polynomial time without retraining. We discuss the suitability of PCs for Uncertainty Quantification, describing their advantages and highlighting their potential for tractable UQ in high-dimensional problems.

Epistemic Uncertainty in Today’s Machine Learning

The trajectory of artificial intelligence over the last decade has been defined by a strong focus on predictive accuracy, driven largely by the scaling of deep neural networks. From Large Language Models (LLMs) to generative diffusion systems, the capacity of these models to approximate complex functions is undeniable. However, as these systems migrate from controlled academic benchmarks to high-stakes deployment in healthcare, autonomous navigation, and climate modeling, a significant limitation has emerged: the inability to reliably quantify what the model does not know.

Standard deterministic deep learning architectures, despite their expressiveness, often suffer from overconfidence when evaluated on out-of-distribution (OOD) data. Because they typically act as point-estimate function approximators rather than maintaining a rigorous representation of the underlying joint probability distribution, they can yield highly confident predictions even when presented with unfamiliar inputs, such as a rare physiological anomaly or an unprecedented weather pattern. This phenomenon represents a challenge of epistemic uncertainty quantification (UQ).

Uncertainty is generally categorized into two distinct forms: aleatoric uncertainty, which is irreducible and stems from the inherent stochasticity of the data generation process (e.g., sensor noise), and epistemic uncertainty, which is reducible and arises from a lack of knowledge about the model parameters or the true structure of the data. In the context of scientific engineering and safety-critical AI, distinguishing between these two is paramount. A self-driving car must distinguish between the ‘‘noise’’ of a rainy sensor (aleatoric) and an object it has never been trained to recognize (epistemic).

Illustration of the difference between aleatoric and epistemic uncertainty. While aleatoric uncertainty arises from inherent data noise, epistemic uncertainty can be reduced by acquiring more samples. This is illustrated by the narrowing confidence interval around the samples.

While methods such as Bayesian Neural Networks (BNNs) and Monte Carlo (MC) Dropout have attempted to retrofit uncertainty estimates onto deep networks, they often rely on approximate inference techniques that introduce their own variance and computational overhead. They provide approximations of the posterior, not exact evaluations.

This report posits that Probabilistic Circuits offer a principled framework to address this epistemic challenge. Unlike standard neural networks, PCs are designed not merely to predict, but to represent the joint probability distribution of the data as a computational graph. Crucially, they do so while guaranteeing tractability. Through strict structural properties, i.e.,smoothness, decomposability, and determinism, PCs enable the exact computation of marginals, conditionals, and moments in polynomial time. This capability provides a rigorous alternative to empirical UQ methods, moving from approximate guesses of uncertainty to rigorous, mathematically guaranteed derivations.

The following sections explore the theoretical mechanics that enable tractability, the architectural advancements that have allowed PCs to scale to high-dimensional data, and the application of PCs to complex problems.

PCs are no longer just a theoretical curiosity but a valuable component of the next generation of trustworthy AI.

The Grammar of Tractability

To understand the unique value of PCs for UQ, one must first appreciate the “grammar’’ of their construction. A PC is not simply a neural network with probabilistic outputs; it is a Directed Acyclic Graph (DAG) that encodes a probability distribution function (PDF) or probability mass function (PMF) through a hierarchy of specific computational units.

The Computational Graph

At the fundamental level, a PC represents a joint distribution $P(\mathbf{X})$ over a set of random variables $\mathbf{X}$. The graph is composed of three primary types of nodes, each serving a distinct probabilistic function. Input Units (Leaves) are the building blocks of the circuit, representing simple, tractable distributions over a single variable or a small subset of variables. Common choices include Gaussian distributions for continuous data, Bernoulli or Categorical distributions for discrete data, or even piecewise polynomials. Sum Units ($\oplus$) compute a weighted sum of their children’s outputs. In the probabilistic interpretation, a sum node represents a mixture model. It introduces a latent variable $Z$ that selects which branch of the mixture is active, thereby allowing the circuit to model multimodality and complex dependencies. Product Units ($\otimes$) compute the product of their children’s outputs. Probabilistically, product nodes represent factorizations, encoding independence assumptions between subsets of variables. The value computed at the root of the PC for a given input configuration $\mathbf{x}$ corresponds to the (possibly unnormalized) probability density $P(\mathbf{x})$.

Example of a Probabilistic Circuit that is not tractable. The variables at the leaves indicate the random variables. This PC lacks the necessary structural properties to guarantee tractable inference, as the mixture (red) does not fulfill the smoothness property. Illustration taken from.

Structural Constraints for Tractability

The distinguishing feature of PCs is the imposition of structural constraints on the graph topology. In generic graphical models like Bayesian Networks or Markov Random Fields, inference is often #P-hard, requiring exponential time in the worst case. PCs circumvent this by enforcing properties that ensure integrals and maximizations commute with the sum and product operations.

However, tractability is no ‘‘universal property’’, as Choi et al. explain. It is merely dependent on the query and the model, adhering to structural properties. Choi et al. define tractability as follows: A class of queries $\mathcal{Q}$ is tractable for a class of models $\mathcal{M}$ if any query $q \in \mathcal{Q}$ on model $m \in \mathcal{M}$ can be computed in time polynomial in the size of the model, i.e., $O(\text{poly}(|m|))$. Thus, $\mathcal{M}$ is a tractable representation for $\mathcal{Q}$.

The tractability of different probabilistic queries relies on the structural properties of the PC, as summarized in the table below:

Structural Property	Enabled Query	Mathematical Operation	UQ Application
Smoothness	Evidence	$P(e)$	Anomaly Detection, OOD Detection
Decomposability (+ Smoothness)	Marginals	$\int P(x_{\text{obs}}, x_{\text{miss}}) \, dx_{\text{miss}}$	Missing Data Imputation, Partial Evidence
Decomposability + Smoothness	Conditionals	$P(q \mid e) = \frac{P(q,e)}{P(e)}$	Counterfactuals, Conditional Forecasting
Determinism	MAP Inference	$\underset{x}{\mathrm{argmax}} \, P(x)$	Image Reconstruction, Most Likely Explanation
Structured Decomposability	Circuit Multiplication	$P(x) \cdot Q(x)$	KL Divergence, Bayesian Updating, Ensemble Merging

Smoothness

A sum node is defined as smooth (or complete) if all of its children define distributions over the exact same set of variables, known as the scope.

Smoothness ensures that the sum node represents a valid mixture distribution where the weights sum to unity (or a normalizing constant). If a sum node were non-smooth, meaning one child covered variables ${X_1, X_2}$ and another covered only ${X_1}$, the resulting function would not integrate to a consistent value, as the missing variable $X_2$ in the second branch is unaccounted for. Smoothness guarantees that when we perform marginalization (integrating out a variable), the integral distributes linearly over the sum. This property is what allows PCs to handle missing data naturally: the probability mass of the missing variables integrates to 1 in every branch of a smooth sum node, effectively vanishing from the computation without disrupting the validity of the distribution over the observed variables.

Decomposability

A product node is decomposable if its children define distributions over disjoint sets of variables.

Decomposability is the structural encoding of conditional independence. It allows high-dimensional integrals to break down into products of lower-dimensional integrals. Mathematically, if a function $f(\mathbf{x})$ decomposes into $g(\mathbf{y})h(\mathbf{z})$ where $\mathbf{y}$ and $\mathbf{z}$ are disjoint, then the integral $\int f(\mathbf{x}) d\mathbf{x} = (\int g(\mathbf{y}) d\mathbf{y}) (\int h(\mathbf{z}) d\mathbf{z})$. Without decomposability, the integral would require evaluating the full high-dimensional joint space, which is computationally intractable. This property ensures that marginal inference in a PC is linear in the size of the circuit, providing a distinct advantage over Normalizing Flows or VAEs where marginals are often intractable.

Determinism

A sum node is deterministic if, for any complete input configuration, at most one of its children evaluates to a non-zero value.

While smoothness and decomposability are sufficient for marginal inference, determinism unlocks tractable Maximum A Posteriori (MAP) inference. The MAP query asks for the most probable configuration of variables given evidence: $\text{argmax}_{\mathbf{x}} P(\mathbf{x} | \mathbf{e})$. In general, maximizing a sum of functions (a mixture) is hard because the mode could lie anywhere between the modes of the components. However, if the sum is deterministic, only one component is non-zero for any $\mathbf{x}$. Consequently, the maximum of the sum becomes the maximum of the non-zero component: $\max \sum f_i(\mathbf{x}) = \sum \max f_i(\mathbf{x})$. This allows the $\max$ operator to push down through sum nodes just as integrals push down through smooth sum nodes. This property is critical for tasks like image inpainting or finding the most likely explanation for a medical diagnosis.

Structured Decomposability and Vtrees

Recent research has emphasized a more rigorous constraint known as structured decomposability. A PC is structured-decomposable if the decomposition of variables at every product node follows a hierarchical tree structure over the variables, known as a vtree.

A vtree is a static binary tree where leaves correspond to random variables. A structured PC respects this vtree if every product node in the circuit corresponds to a node in the vtree, partitioning the variables exactly as the vtree does. This property is profound because it enables operations beyond simple inference, such as the efficient multiplication of two circuits. If two PCs respect the same vtree, their product (which represents the product of their densities) remains a structured PC. This algebra of circuits allows for computing KL divergences, merging expert models, and performing Bayesian updates in polynomial time.

Connecting Probabilistic Circuits to Uncertainty Quantification

The theoretical properties of PCs translate directly into capabilities that solve fundamental challenges in UQ. While deep learning models often struggle to distinguish between low-probability events and model errors, PCs provide exact probabilistic metrics.

Arbitrary Conditioning

The ability to compute exact conditional probabilities $P(\mathbf{Q} \mid \mathbf{E})$ for any disjoint subsets of query variables $\mathbf{Q}$ and evidence variables $\mathbf{E}$ is perhaps the most significant operational advantage of PCs.

In standard deep learning, UQ is typically tied to a specific prediction task. A model is trained to predict $Y$ given $X$. If the user suddenly needs to know the probability of a specific input feature $X_1$ given the output $Y$ (an inverse query), the model cannot provide it without retraining or complex inversion techniques. PCs, by modeling the full joint distribution, are agnostic to the direction of inference.

This capability is particularly vital for handling missing data. In real-world scientific engineering, sensor failure is common. When a standard neural network encounters missing inputs, it usually requires imputation, guessing the missing values, before processing. This imputation introduces a point estimate that ignores the uncertainty of the missing value. A PC, conversely, handles missing data by integrating out the missing variables analytically. The resulting marginal distribution over the observed variables reflects the true uncertainty: the probability density becomes ‘‘flatter’’ or more diffuse, accurately capturing the loss of information. This is not an approximation but the belief given partial evidence.

Tractable Dropout Inference

Recent work has bridged the gap between the popular UQ technique of MC Dropout and the tractable framework of PCs. MC Dropout estimates uncertainty in neural networks by randomly dropping units during inference and measuring the variance of the predictions. While effective, it is computationally expensive (requiring multiple forward passes) and yields only an empirical approximation.

Comparison of forward passes for a standard PC (a), a PC with MC Dropout sampling (b), and a PC using TDI via variance propagation (c). MCD requires multiple forward passes, each sampling one instantiation of a possible subgraph. In contrast, TDI analytically propagates variances through the graph in a single pass. Illustration taken from.

Ventola et al. have introduced Tractable Dropout Inference (TDI) for PCs. Because PCs track the propagation of moments exactly, it is possible to derive the analytical moments of the output distribution under the dropout noise model in a single forward pass. Instead of sampling dropout masks, TDI propagates the first and second moments (mean and variance) through the sum and product nodes.

For a sum node, the mean is the weighted sum of children’s means. The variance computation involves the variances of children plus a term accounting for the variance of the gating weights themselves (if they are stochastic). For product nodes, due to independence (decomposability), the mean is the product of means, and the variance follows standard variance-of-product rules.

This allows PCs to provide ‘‘dropout-based’’ uncertainty estimates that are theoretically sound and computationally efficient, eliminating the sampling noise inherent in standard MC Dropout. This technique has been shown to significantly improve the robustness of PCs to distribution shifts and OOD data.

In-Distribution (ID) vs. OOD detection precision difference for PCs (dashed) and PCs with TDI (solid) across different thresholds on different tasks. TDI improves both absolute performance and the ID/OOD trade-off. Without TDI, PCs perform poorly and peak at counterintuitively low thresholds. Illustration obtained from.

Sequential Uncertainty in Time Series

In temporal domains, uncertainty accumulates over time. A forecasting model should become less confident the further it predicts into the future. Standard Recurrent Neural Networks (RNNs) often fail to capture this diverging uncertainty.

Integrating PCs with recurrent architectures, such as in Recurrent Conditional Whittle Networks (RECOWN), provides a mechanism to quantify this temporal uncertainty. These models use a PC, i.e. a Conditional Whittle SPN, to model the distribution of the spectral coefficients of the time series.

The figure shows the architecture of the RECOWN model. The weights of the Conditional Whittle SPN (CDSPN) are determined by a neural network using a short time fourier transform (STFT) on the context $X$. The same STFT is fed to the gated recurrent units of an RNN, which predicts the Fourier coefficients $D^{\prime i}$. The CWSPN computes the conditional Whittle log-likelihood based on such Fourier coefficients. Illustration obtained from.

This integration allows for the computation of the Log-Likelihood Ratio Score (LLRS), a dynamic uncertainty metric. When the model encounters a sequence that deviates from the learned temporal dynamics, such as a sudden frequency shift in power grid data, the conditional likelihood computed by the PC drops sharply. This score allows operators to distinguish between a ‘‘hard to predict’’ stochastic sequence (high aleatoric uncertainty) and a ‘‘structurally novel’’ sequence (high epistemic uncertainty), enabling trustworthy anomaly detection in time-series data.

Probabilistic Calibrated Circuits

While PCs offer exact and efficient inference, a critical problem of calibration arises when applying them to UQ. The standard training objective for PCs is to approximate the joint distribution $P(X,Y)$ by minimizing the negative log-likelihood, $\mathcal{L}_{\text{NLL}} = - \mathbb{E}_{\text{data}} [\log P_{\text{PC}}(x, y)]$. However, accurate UQ, in the various forms discussed earlier, requires well-calibrated conditional distributions, which are computed only as a secondary step in PCs. Since the training objective prioritizes the joint likelihood, it does not guarantee that the conditionals are calibrated. This misalignment can lead to systematic errors in uncertainty estimate. The model might capture the global density well but fail to accurately reflect the confidence in specific predictions.

The authors’ current work addresses this miscalibration by answering two key questions: how can the systematic calibration error be quantified, and how can it be mitigated? In response, they introduce Probabilistic Calibrated Circuits (PCCs), a novel post-hoc recalibration technique that provably retains the structure and tractability of PCs while reducing miscalibration. This section provides a high-level overview of the core concepts, while a more comprehensive treatment will be presented in an upcoming publication.

Quantifying the Miscalibration

To diagnose the extent of miscalibration, Probability Integral Transform (PIT) can be utilized. For a continuous random variable $X$ and its cumulative distribution function (CDF) $F_X$, the random variable $Z = F_X(X)$ is uniformly distributed $Z \sim \mathcal{U}(0, 1)$.

The calibration of the conditional distribution $p(x\mid y)$ can be checked by calculating the PIT values for a recalibration dataset $(x_i, y_i),\ i = 1,\dots,N$

\[z_i = F_{\text{PC}}(x_i \mid y_i) = \int_{-\infty}^{x_i} p_{\text{PC}}(x' \mid y_i) dx',\ i = 1, \dots, N\]

Thanks to the properties of PCs (marginalization and integration), the calculation of $F_{PC}$ is exact and efficient. The resulting histogram of PIT values reveals the nature of the error:

Uniform Distribution: The model is perfectly calibrated.
U-Shape: The model is under-dispersed (distribution too narrow, uncertainty underestimated).
Inverted U-Shape: The model is over-dispersed (distribution too broad, uncertainty overestimated).
Systematic Shift: The mean of the prediction is systematically incorrect.

To formally quantify the deviation from uniformity, we can calculate the distance between the empirical CDF of the PIT values, $\hat{F}_{\text{cal}}$, and the ideal uniform CDF. This calibration error, $E_{\text{cal}} = d(\hat{F}_{\text{cal}}(z),\mathcal{U}(0,1))$, can serve as the objective function for optimization in PCCs.

Post-hoc Input Recalibration

Leveraging the calibration error metric, we apply an input recalibration function $g$ to the variable $x$, ensuring that the PIT values of the transformed input are uniformly distributed. The resulting recalibrated density is derived via the change of variables formula

\[p_{\text{cal}}(x, y) = P_{\text{PC}}(g(x), y) \cdot|g'(x)|\]

It can be proven that, provided $g$ satisfies specific properties, this recalibrated density $p_{\text{cal}}$ can be exactly represented by a new, valid Probabilistic Circuit.

Scaling Circuit Architectures to High Dimensions

For years, a major criticism of Probabilistic Circuits (PCs) was their lack of scalability. Although they provided exact inference, they struggled to model the complex, high-dimensional dependencies found in data like images or natural language—a domain where Deep Neural Networks (DNNs) excelled. This led to the prevailing belief in a rigid trade-off: one could have tractability (PCs) or expressiveness (DNNs), but not both. However, architectural innovations and new computational frameworks developed in recent research (2023–2025) have largely dismantled this dichotomy, enabling PCs to scale massively.

Scaling via Vectorization

The traditional implementation of PCs involved sparse, irregular graph structures that were essentially pointer-chasing operations. This is efficient on CPUs for small models but not suited for modern GPUs, which rely on dense matrix multiplications and coherent memory access.

Einsum Networks (EiNets) represent a paradigm shift in PC implementation. The core insight of EiNets is to reformulate the execution of sum and product layers using the Einstein summation (einsum) convention, a standard operation in tensor algebra libraries like PyTorch and TensorFlow.

An illustration of grouping nodes with the same topological depth into disjoint subsets. The forward- and backward-pass can be carried out independently on nodes within the same layer / subset. Illustration obtained from.

Instead of processing nodes individually, EiNets organize nodes into layers. A ‘‘product layer’’ can be viewed as a mixing operation that can be computed via element-wise multiplication and reshaping of large tensors. A ‘‘sum layer’’ becomes a tensor contraction (matrix multiplication).

By combining these monolithic tensor operations, EiNets allow PCs to utilize the massive parallelism of GPUs. This vectorization enables the training of PCs with millions of parameters and hundreds of layers, achieving density estimation performance on benchmarks like ImageNet that rivals intractable deep generative models.

Scaling via Sparse Monarch Metrics

While EiNets solved the computation speed, parameter efficiency remained a challenge. Fully dense sum layers imply that every latent component connects to every child, leading to quadratic parameter growth.

Recent work has introduced Monarch Matrices to parameterize the sum blocks in PCs. Monarch matrices are a class of structured sparse matrices that are highly expressive (capable of representing permutations and Fast Fourier Transforms) yet computationally efficient.

By replacing dense weight matrices in sum layers with products of sparse Monarch factors, Zhang et al. have reduced the memory and computation footprint of PCs significantly.

The figure shows the bits-per-dimension (BPD) as a function of the FLOPs during training per pixel. The Hidden Chow-Liu Tree (HCLT) serves as base architecture. As the computational budget increases, the monarch-version of the HCLT shows a better efficiency as the default version. Illustration obtained from.

This Monarch Parameterization approach has enabled ‘‘unprecedented scaling’’, allowing PCs to achieve state-of-the-art generative modeling performance on challenging benchmarks like Text8 and ImageNet 32x32, demonstrating superior scaling laws (better performance for fewer FLOPs) compared to traditional dense parameterizations.

Restructuring of Fitted Circuits

A historical limitation of PCs was ‘‘structure lock-in’’. Once a PC was trained with a specific vtree (variable decomposition), it was difficult to perform operations with other circuits having different structures.

New algorithms for restructuring PCs have emerged. These algorithms allow a structured-decomposable PC to be transformed into a new PC that respects a target vtree while representing the same distribution.

The restructuring process involves converting the original PC into an equivalent Bayesian Network with latent variables, identifying the conditional independencies required by the target vtree, and then recursively constructing the new circuit layers.

This breakthrough allows for dynamic inference optimization. A large, complex PC trained for high expressiveness can be ‘‘compiled’’ or restructured into a shallower, optimized circuit for faster inference on edge devices. It also enables the multiplication of circuits with different structures, which is essential for ensemble methods where different models might learn different structural dependencies.

Applications of Probabilistic Circuits for UQ

To conclude this post, we want to discuss some recent applications of PCs for UQ. The frontier of PC research is no longer about replacing neural networks but integrating with them. The concept of Probabilistic Neural Circuits (PNCs) has emerged, blending the learnable features of deep learning with the tractable reasoning of circuits.

Probabilistic Flow Circuits

A promising hybrid architecture involves integrating Probabilistic Circuits with Normalizing Flows (NFs). This synergy addresses the complementary limitations of both architectures: Normalizing Flows are highly effective at modeling continuous, local correlations through diffeomorphic transformations but lack mechanisms for handling global discrete structure and efficient marginalization. Conversely, PCs excel at capturing global structure via mixture models and performing exact marginalization but can be inefficient at modeling complex local continuous manifolds, often requiring an excessive number of mixture components to approximate them.

The illustration shows a comparison between a PC and a PFC on a donut-shaped task. The blue and red colors depict the distributions captured by the leaf nodes while the green dots illustrated the joint distribution. The PFC is able to capture the target data better, due to its multi-modal leaf densities. Illustration obtained from.

Probabilistic Flow Circuits (PFCs) bridge this gap by replacing the standard univariate leaf distributions of a PC with flexible, invertible flow transformations. This integration yields a powerful dual-structure model. The PC backbone manages the multimodal, discrete structure of the data (such as distinct object categories in an image), while the flow-based leaves model the continuous manifold of variations (such as pixel intensities) within each category. A naive integration of flows would violate the decomposability required for tractable inference, as flows inherently couple variables. To preserve the circuit’s marginalization guarantees, recent theoretical work introduces structural constraints such as $\tau$-decomposability. These conditions ensure that flow transformations are applied only to disjoint subsets of variables in a way that does not entangle the global independence structure maintained by the circuit. This architecture enables PFCs to improve density estimation performance while retaining the capability to answer complex probabilistic queries, such as marginals and conditionals, that are typically intractable for standalone Normalizing Flows.

Interestingly, despite originating from distinct research motivations, the authors’ work on Probabilistic Calibrated Circuits (discussed above) can be viewed as a specialized subclass of Probabilistic Flow Circuits. While PFCs generally employ flows to enhance the flexibility of density estimation, PCCs utilize specific monotonic transformations at the leaves to minimize calibration error. Thus, the PCC framework effectively instantiates a Probabilistic Flow Circuit where the flow transformations are constrained by the objective of post-hoc uncertainty calibration.

Multi-Token Prediction with Probabilistic Circuits

Another interesting and highly relevant application of PCs in 2025 involves their integration into the training and inference of Large Language Models (LLMs), specifically in the domain of Multi-Token Prediction (MTP).

Different dependency structures over sequences of tokens. The input tokens are grouped into the coloured layers. Illustration taken from.

Standard LLMs are autoregressive, that is, they predict the next token $x_{t+1}$ given the history $x_{1:t}$. Generating a sequence of length $L$ requires $L$ sequential forward passes, a process bound by memory bandwidth and thus inherently slow. Speculative Decoding addresses this by employing a lightweight “draft” model to predict a chunk of $K$ tokens, which are subsequently verified in parallel by the large target model. However, conventional draft models often prioritize speed over expressiveness by assuming independence among the $K$ predicted tokens. This approximation sacrifices accuracy, leading to lower acceptance rates and reduced speedups.

To bridge this gap, recent work introduces MTP with PCs (MTPC). This framework utilizes a PC to model the joint distribution of the next $K$ tokens, $P(x_{t+1}, \dots, x_{t+K} \mid x_{1:t})$.

PCs are uniquely suited for this task because they offer a tractable mechanism to model complex dependencies between future tokens (e.g., capturing that “San” strongly implies “Francisco”) without the computational overhead of a full Transformer. A PC can evaluate the likelihood of candidate sequences with high computational efficiency.

The authors explored the trade-off between latency and expressiveness with different MTP designs with respect to different PC architectures (FF, CP, HMM, BTree), as well as shared layers between draft and varifying models in self-speculative decoding. Illustration taken from.

The MTPC framework systematically explores a spectrum of PC architectures to balance expressiveness and latency:

Fully Factorized (FF): The baseline approach that assumes independence between tokens. While fast, it suffers from low acceptance rates due to its inability to model correlations.
Canonical Polyadic (CP): Introduces a single latent variable $Z$ with $r$ states to model the joint distribution as a mixture of independent components. This captures shared global context through the mixture weights while keeping the structure shallow for fast inference.
Hidden Markov Models (HMM): A deeper PC structure that models sequential dependencies explicitly via latent state transitions $z_1 \to z_2 \to \dots \to z_K$. This offers higher expressiveness by capturing local dependencies between adjacent tokens but incurs increased latency due to the sequential summation.
Binary Tree Factorizations (BTree): A novel hierarchical structure that recursively splits the token window. This architecture strikes an optimal balance between depth and width, efficiently capturing long-range dependencies among the $K$ tokens while enabling parallel sampling.

Empirical evaluations retrofitting EvaByte, a byte-level LLM, with MTPC demonstrate the efficacy of this approach. MTPC increases generation throughput by a factor of $5.47$ compared to standard autoregressive generation, achieving a $1.22$ speedup over MTP baselines that rely on independence assumptions. This significant performance gain stems from the fact that the PC-based draft model is sufficiently expressive to generate high-quality drafts, which are accepted by the verifier model significantly more often than those from independent drafters. These results establish PCs as a viable, real-time accelerator for foundation models, capable of efficiently capturing the local correlations of language and byte sequences.

SPN-Guided Latent Space Manipulation

In the domain of medical AI, “explainability” is not merely a desirable feature but a crucial safety requirement. Clinicians need to understand the rationale behind a model’s decision, for instance, why a specific scan was diagnosed as containing a tumor. A powerful technique for providing such insights is the generation of counterfactual explanations: answering the question, “What would this scan look like if the patient were healthy?”

However, generating plausible counterfactuals is challenging; standard methods often produce unrealistic images that trick the classifier but are medically meaningless. To address this, Siekiera et al. propose SPN-Guided Latent Space Manipulation, utilizing the rigorous density estimation capabilities of PCs to generate high-fidelity counterfactuals.

The illustration shows the use of a SPN to model the latent distribution of a VAE. Based on the counterfactual-frontier ($\mathbf{z}\_{cf}$), plausuble counterfactial embeddings are found. By propagating such through the decoder, counterfactual examples are generated. Illustration obtained from.

The approach is built upon a hybrid architecture that integrates an SPN into the latent space of a semi-supervised Variational Autoencoder (VAE). First, a VAE is trained to compress high-dimensional medical images into a lower-dimensional latent space $\mathbf{z}$. Unlike standard VAEs, which assume a simple Gaussian prior and often fail to capture complex data manifolds, this method employs an SPN to model the true, complex distribution of latent vectors, $P_{\text{SPN}}(\mathbf{z} \mid y)$, where $y$ denotes the class label (e.g., “Healthy” or “Sick”). The SPN serves a dual purpose: it acts as a flexible prior describing the latent topology and as a classifier $P(y \mid \mathbf{z})$, ensuring the latent space is structured according to the diagnostic classes.

To generate a counterfactual for a patient diagnosed as “Sick” ($y_{\text{orig}}$), the model optimizes a new latent vector $\mathbf{z}_{cf}$ that flips the classification to “Healthy” ($y_{\text{target}}$). This optimization is governed by three competing objectives: validity, proximity, and plausibility. First, the model maximizes the SPN’s predicted probability for the target class, $P_{\text{SPN}}(y_{\text{target}} \mid \mathbf{z}_{cf})$, to ensure the diagnosis changes. Second, it minimizes the distance $\vert \mathbf{z}_{cf} - \mathbf{z}_{\text{orig}} \vert$ to guarantee that the counterfactual remains semantically similar to the original patient scan. Finally, and crucially, the optimization maximizes the likelihood of the vector under the SPN prior, $P_{\text{SPN}}(\mathbf{z}_{cf})$. This constraint forces the search into the high-density regions of the “Healthy” distribution, effectively preventing the generation of OOD or hallucinated samples.

Experiments on the CheXpert dataset demonstrate that this SPN-guided approach produces anatomically plausible alterations, such as the specific removal of lung opacities, that are far more stable and interpretable than those generated by baseline DNN methods. While simple MLP classifiers require aggressive regularization to avoid adversarial noise, the SPN’s robust density modeling naturally guides the generation toward semantically meaningful counterfactuals without such fragile tuning.

A Personal Note

To conclude, we have shown how recent advancements have positioned PCs as a scalable and powerful solution for UQ. These models not only resemble a robust, tractable method for assessing uncertainty but also enable innovative applications across diverse fields, highlighted by a few examples above. We anticipate a particularly transformative role for PCs in the future, specifically their integration into live systems to boost reliability and provide deeper interpretability. We hope this overview stimulates further research and practical deployment of PCs for UQ. Please note, this post is by far not a comprehensive survey of all works in this rapidly evolving field. It is merely a selection of key ideas and applications that we found particularly interesting and illustrative.

For attribution in academic contexts, please cite this work as

          PLACEHOLDER FOR ACADEMIC ATTRIBUTION

BibTeX citation

          PLACEHOLDER FOR BIBTEX