Deep learning models struggle with epistemic uncertainty quantification, often exhibiting blind confidence on out-of-distribution data. This work reviews Probabilistic Circuits (PCs) as a versatile framework for rigorous, tractable reasoning. PCs model the joint probability distribution and by enforcing structural constraints, specifically smoothness, decomposability, and determinism, they allow for the exact computation of marginals, conditionals, and moments in polynomial time without retraining. We discuss the suitability of PCs for Uncertainty Quantification, describing their advantages and highlighting their potential for tractable UQ in high-dimensional problems.
The trajectory of artificial intelligence over the last decade has been defined by a strong focus on predictive accuracy, driven largely by the scaling of deep neural networks. From Large Language Models (LLMs) to generative diffusion systems, the capacity of these models to approximate complex functions is undeniable. However, as these systems migrate from controlled academic benchmarks to high-stakes deployment in healthcare, autonomous navigation, and climate modeling, a significant limitation has emerged: the inability to reliably quantify what the model does not know.
Standard deterministic deep learning architectures, despite their expressiveness, often suffer from overconfidence when evaluated on out-of-distribution (OOD) data
Uncertainty is generally categorized into two distinct forms: aleatoric uncertainty, which is irreducible and stems from the inherent stochasticity of the data generation process (e.g., sensor noise), and epistemic uncertainty, which is reducible and arises from a lack of knowledge about the model parameters or the true structure of the data
While methods such as Bayesian Neural Networks (BNNs) and Monte Carlo (MC) Dropout have attempted to retrofit uncertainty estimates onto deep networks, they often rely on approximate inference techniques that introduce their own variance and computational overhead
This report posits that Probabilistic Circuits offer a principled framework to address this epistemic challenge. Unlike standard neural networks, PCs are designed not merely to predict, but to represent the joint probability distribution of the data as a computational graph. Crucially, they do so while guaranteeing tractability. Through strict structural properties, i.e.,smoothness, decomposability, and determinism, PCs enable the exact computation of marginals, conditionals, and moments in polynomial time
The following sections explore the theoretical mechanics that enable tractability, the architectural advancements that have allowed PCs to scale to high-dimensional data, and the application of PCs to complex problems.
PCs are no longer just a theoretical curiosity but a valuable component of the next generation of trustworthy AI.
To understand the unique value of PCs for UQ, one must first appreciate the “grammar’’ of their construction. A PC is not simply a neural network with probabilistic outputs; it is a Directed Acyclic Graph (DAG) that encodes a probability distribution function (PDF) or probability mass function (PMF) through a hierarchy of specific computational units
At the fundamental level, a PC represents a joint distribution $P(\mathbf{X})$ over a set of random variables $\mathbf{X}$. The graph is composed of three primary types of nodes, each serving a distinct probabilistic function. Input Units (Leaves) are the building blocks of the circuit, representing simple, tractable distributions over a single variable or a small subset of variables. Common choices include Gaussian distributions for continuous data, Bernoulli or Categorical distributions for discrete data, or even piecewise polynomials. Sum Units ($\oplus$) compute a weighted sum of their children’s outputs. In the probabilistic interpretation, a sum node represents a mixture model
The distinguishing feature of PCs is the imposition of structural constraints on the graph topology. In generic graphical models like Bayesian Networks or Markov Random Fields, inference is often #P-hard, requiring exponential time in the worst case. PCs circumvent this by enforcing properties that ensure integrals and maximizations commute with the sum and product operations.
However, tractability is no ‘‘universal property’’, as Choi et al. explain
The tractability of different probabilistic queries relies on the structural properties of the PC, as summarized in the table below
| Structural Property | Enabled Query | Mathematical Operation | UQ Application |
|---|---|---|---|
| Smoothness | Evidence | $P(e)$ | Anomaly Detection, OOD Detection |
| Decomposability (+ Smoothness) | Marginals | $\int P(x_{\text{obs}}, x_{\text{miss}}) \, dx_{\text{miss}}$ | Missing Data Imputation, Partial Evidence |
| Decomposability + Smoothness | Conditionals | $P(q \mid e) = \frac{P(q,e)}{P(e)}$ | Counterfactuals, Conditional Forecasting |
| Determinism | MAP Inference | $\underset{x}{\mathrm{argmax}} \, P(x)$ | Image Reconstruction, Most Likely Explanation |
| Structured Decomposability | Circuit Multiplication | $P(x) \cdot Q(x)$ | KL Divergence, Bayesian Updating, Ensemble Merging |
A sum node is defined as smooth (or complete) if all of its children define distributions over the exact same set of variables, known as the scope
Smoothness ensures that the sum node represents a valid mixture distribution where the weights sum to unity (or a normalizing constant). If a sum node were non-smooth, meaning one child covered variables ${X_1, X_2}$ and another covered only ${X_1}$, the resulting function would not integrate to a consistent value, as the missing variable $X_2$ in the second branch is unaccounted for. Smoothness guarantees that when we perform marginalization (integrating out a variable), the integral distributes linearly over the sum. This property is what allows PCs to handle missing data naturally: the probability mass of the missing variables integrates to 1 in every branch of a smooth sum node, effectively vanishing from the computation without disrupting the validity of the distribution over the observed variables
A product node is decomposable if its children define distributions over disjoint sets of variables
Decomposability is the structural encoding of conditional independence. It allows high-dimensional integrals to break down into products of lower-dimensional integrals. Mathematically, if a function $f(\mathbf{x})$ decomposes into $g(\mathbf{y})h(\mathbf{z})$ where $\mathbf{y}$ and $\mathbf{z}$ are disjoint, then the integral $\int f(\mathbf{x}) d\mathbf{x} = (\int g(\mathbf{y}) d\mathbf{y}) (\int h(\mathbf{z}) d\mathbf{z})$. Without decomposability, the integral would require evaluating the full high-dimensional joint space, which is computationally intractable. This property ensures that marginal inference in a PC is linear in the size of the circuit, providing a distinct advantage over Normalizing Flows or VAEs where marginals are often intractable
A sum node is deterministic if, for any complete input configuration, at most one of its children evaluates to a non-zero value
While smoothness and decomposability are sufficient for marginal inference, determinism unlocks tractable Maximum A Posteriori (MAP) inference. The MAP query asks for the most probable configuration of variables given evidence: $\text{argmax}_{\mathbf{x}} P(\mathbf{x} | \mathbf{e})$. In general, maximizing a sum of functions (a mixture) is hard because the mode could lie anywhere between the modes of the components. However, if the sum is deterministic, only one component is non-zero for any $\mathbf{x}$. Consequently, the maximum of the sum becomes the maximum of the non-zero component: $\max \sum f_i(\mathbf{x}) = \sum \max f_i(\mathbf{x})$. This allows the $\max$ operator to push down through sum nodes just as integrals push down through smooth sum nodes. This property is critical for tasks like image inpainting or finding the most likely explanation for a medical diagnosis.
Recent research has emphasized a more rigorous constraint known as structured decomposability. A PC is structured-decomposable if the decomposition of variables at every product node follows a hierarchical tree structure over the variables, known as a vtree
A vtree is a static binary tree where leaves correspond to random variables. A structured PC respects this vtree if every product node in the circuit corresponds to a node in the vtree, partitioning the variables exactly as the vtree does. This property is profound because it enables operations beyond simple inference, such as the efficient multiplication of two circuits. If two PCs respect the same vtree, their product (which represents the product of their densities) remains a structured PC. This algebra of circuits allows for computing KL divergences, merging expert models, and performing Bayesian updates in polynomial time
The theoretical properties of PCs translate directly into capabilities that solve fundamental challenges in UQ. While deep learning models often struggle to distinguish between low-probability events and model errors, PCs provide exact probabilistic metrics.
The ability to compute exact conditional probabilities $P(\mathbf{Q} \mid \mathbf{E})$ for any disjoint subsets of query variables $\mathbf{Q}$ and evidence variables $\mathbf{E}$ is perhaps the most significant operational advantage of PCs.
In standard deep learning, UQ is typically tied to a specific prediction task. A model is trained to predict $Y$ given $X$. If the user suddenly needs to know the probability of a specific input feature $X_1$ given the output $Y$ (an inverse query), the model cannot provide it without retraining or complex inversion techniques. PCs, by modeling the full joint distribution, are agnostic to the direction of inference.
This capability is particularly vital for handling missing data. In real-world scientific engineering, sensor failure is common. When a standard neural network encounters missing inputs, it usually requires imputation, guessing the missing values, before processing. This imputation introduces a point estimate that ignores the uncertainty of the missing value. A PC, conversely, handles missing data by integrating out the missing variables analytically
Recent work has bridged the gap between the popular UQ technique of MC Dropout and the tractable framework of PCs. MC Dropout estimates uncertainty in neural networks by randomly dropping units during inference and measuring the variance of the predictions. While effective, it is computationally expensive (requiring multiple forward passes) and yields only an empirical approximation.
Ventola et al. have introduced Tractable Dropout Inference (TDI) for PCs
For a sum node, the mean is the weighted sum of children’s means. The variance computation involves the variances of children plus a term accounting for the variance of the gating weights themselves (if they are stochastic). For product nodes, due to independence (decomposability), the mean is the product of means, and the variance follows standard variance-of-product rules.
This allows PCs to provide ‘‘dropout-based’’ uncertainty estimates that are theoretically sound and computationally efficient, eliminating the sampling noise inherent in standard MC Dropout. This technique has been shown to significantly improve the robustness of PCs to distribution shifts and OOD data
In temporal domains, uncertainty accumulates over time. A forecasting model should become less confident the further it predicts into the future. Standard Recurrent Neural Networks (RNNs) often fail to capture this diverging uncertainty.
Integrating PCs with recurrent architectures, such as in Recurrent Conditional Whittle Networks (RECOWN), provides a mechanism to quantify this temporal uncertainty
This integration allows for the computation of the Log-Likelihood Ratio Score (LLRS), a dynamic uncertainty metric. When the model encounters a sequence that deviates from the learned temporal dynamics, such as a sudden frequency shift in power grid data, the conditional likelihood computed by the PC drops sharply. This score allows operators to distinguish between a ‘‘hard to predict’’ stochastic sequence (high aleatoric uncertainty) and a ‘‘structurally novel’’ sequence (high epistemic uncertainty), enabling trustworthy anomaly detection in time-series data
While PCs offer exact and efficient inference, a critical problem of calibration arises when applying them to UQ. The standard training objective for PCs is to approximate the joint distribution $P(X,Y)$ by minimizing the negative log-likelihood, $\mathcal{L}_{\text{NLL}} = - \mathbb{E}_{\text{data}} [\log P_{\text{PC}}(x, y)]$. However, accurate UQ, in the various forms discussed earlier, requires well-calibrated conditional distributions, which are computed only as a secondary step in PCs. Since the training objective prioritizes the joint likelihood, it does not guarantee that the conditionals are calibrated. This misalignment can lead to systematic errors in uncertainty estimate. The model might capture the global density well but fail to accurately reflect the confidence in specific predictions.
The authors’ current work addresses this miscalibration by answering two key questions: how can the systematic calibration error be quantified, and how can it be mitigated? In response, they introduce Probabilistic Calibrated Circuits (PCCs), a novel post-hoc recalibration technique that provably retains the structure and tractability of PCs while reducing miscalibration. This section provides a high-level overview of the core concepts, while a more comprehensive treatment will be presented in an upcoming publication.
To diagnose the extent of miscalibration, Probability Integral Transform (PIT) can be utilized
The calibration of the conditional distribution $p(x\mid y)$ can be checked by calculating the PIT values for a recalibration dataset $(x_i, y_i),\ i = 1,\dots,N$
\[z_i = F_{\text{PC}}(x_i \mid y_i) = \int_{-\infty}^{x_i} p_{\text{PC}}(x' \mid y_i) dx',\ i = 1, \dots, N\]Thanks to the properties of PCs (marginalization and integration), the calculation of $F_{PC}$ is exact and efficient. The resulting histogram of PIT values reveals the nature of the error
To formally quantify the deviation from uniformity, we can calculate the distance between the empirical CDF of the PIT values, $\hat{F}_{\text{cal}}$, and the ideal uniform CDF. This calibration error, $E_{\text{cal}} = d(\hat{F}_{\text{cal}}(z),\mathcal{U}(0,1))$, can serve as the objective function for optimization in PCCs.
Leveraging the calibration error metric, we apply an input recalibration function $g$ to the variable $x$, ensuring that the PIT values of the transformed input are uniformly distributed. The resulting recalibrated density is derived via the change of variables formula
\[p_{\text{cal}}(x, y) = P_{\text{PC}}(g(x), y) \cdot|g'(x)|\]It can be proven that, provided $g$ satisfies specific properties, this recalibrated density $p_{\text{cal}}$ can be exactly represented by a new, valid Probabilistic Circuit.
For years, a major criticism of Probabilistic Circuits (PCs) was their lack of scalability. Although they provided exact inference, they struggled to model the complex, high-dimensional dependencies found in data like images or natural language—a domain where Deep Neural Networks (DNNs) excelled. This led to the prevailing belief in a rigid trade-off: one could have tractability (PCs) or expressiveness (DNNs), but not both. However, architectural innovations and new computational frameworks developed in recent research (2023–2025) have largely dismantled this dichotomy, enabling PCs to scale massively
The traditional implementation of PCs involved sparse, irregular graph structures that were essentially pointer-chasing operations. This is efficient on CPUs for small models but not suited for modern GPUs, which rely on dense matrix multiplications and coherent memory access.
Einsum Networks (EiNets) represent a paradigm shift in PC implementation
Instead of processing nodes individually, EiNets organize nodes into layers. A ‘‘product layer’’ can be viewed as a mixing operation that can be computed via element-wise multiplication and reshaping of large tensors. A ‘‘sum layer’’ becomes a tensor contraction (matrix multiplication).
By combining these monolithic tensor operations, EiNets allow PCs to utilize the massive parallelism of GPUs. This vectorization enables the training of PCs with millions of parameters and hundreds of layers, achieving density estimation performance on benchmarks like ImageNet that rivals intractable deep generative models
While EiNets solved the computation speed, parameter efficiency remained a challenge. Fully dense sum layers imply that every latent component connects to every child, leading to quadratic parameter growth.
Recent work has introduced Monarch Matrices to parameterize the sum blocks in PCs
By replacing dense weight matrices in sum layers with products of sparse Monarch factors, Zhang et al. have reduced the memory and computation footprint of PCs significantly.
This Monarch Parameterization approach has enabled ‘‘unprecedented scaling’’, allowing PCs to achieve state-of-the-art generative modeling performance on challenging benchmarks like Text8 and ImageNet 32x32, demonstrating superior scaling laws (better performance for fewer FLOPs) compared to traditional dense parameterizations
A historical limitation of PCs was ‘‘structure lock-in’’. Once a PC was trained with a specific vtree (variable decomposition), it was difficult to perform operations with other circuits having different structures.
New algorithms for restructuring PCs have emerged
The restructuring process involves converting the original PC into an equivalent Bayesian Network with latent variables, identifying the conditional independencies required by the target vtree, and then recursively constructing the new circuit layers.
This breakthrough allows for dynamic inference optimization. A large, complex PC trained for high expressiveness can be ‘‘compiled’’ or restructured into a shallower, optimized circuit for faster inference on edge devices. It also enables the multiplication of circuits with different structures, which is essential for ensemble methods where different models might learn different structural dependencies
To conclude this post, we want to discuss some recent applications of PCs for UQ. The frontier of PC research is no longer about replacing neural networks but integrating with them. The concept of Probabilistic Neural Circuits (PNCs) has emerged, blending the learnable features of deep learning with the tractable reasoning of circuits
A promising hybrid architecture involves integrating Probabilistic Circuits with Normalizing Flows (NFs)
Probabilistic Flow Circuits (PFCs) bridge this gap by replacing the standard univariate leaf distributions of a PC with flexible, invertible flow transformations. This integration yields a powerful dual-structure model. The PC backbone manages the multimodal, discrete structure of the data (such as distinct object categories in an image), while the flow-based leaves model the continuous manifold of variations (such as pixel intensities) within each category. A naive integration of flows would violate the decomposability required for tractable inference, as flows inherently couple variables. To preserve the circuit’s marginalization guarantees, recent theoretical work introduces structural constraints such as $\tau$-decomposability
Interestingly, despite originating from distinct research motivations, the authors’ work on Probabilistic Calibrated Circuits (discussed above) can be viewed as a specialized subclass of Probabilistic Flow Circuits. While PFCs generally employ flows to enhance the flexibility of density estimation, PCCs utilize specific monotonic transformations at the leaves to minimize calibration error. Thus, the PCC framework effectively instantiates a Probabilistic Flow Circuit where the flow transformations are constrained by the objective of post-hoc uncertainty calibration.
Another interesting and highly relevant application of PCs in 2025 involves their integration into the training and inference of Large Language Models (LLMs), specifically in the domain of Multi-Token Prediction (MTP).
Standard LLMs are autoregressive, that is, they predict the next token $x_{t+1}$ given the history $x_{1:t}$. Generating a sequence of length $L$ requires $L$ sequential forward passes, a process bound by memory bandwidth and thus inherently slow. Speculative Decoding addresses this by employing a lightweight “draft” model to predict a chunk of $K$ tokens, which are subsequently verified in parallel by the large target model. However, conventional draft models often prioritize speed over expressiveness by assuming independence among the $K$ predicted tokens. This approximation sacrifices accuracy, leading to lower acceptance rates and reduced speedups.
To bridge this gap, recent work introduces MTP with PCs (MTPC)
PCs are uniquely suited for this task because they offer a tractable mechanism to model complex dependencies between future tokens (e.g., capturing that “San” strongly implies “Francisco”) without the computational overhead of a full Transformer. A PC can evaluate the likelihood of candidate sequences with high computational efficiency.
The MTPC framework systematically explores a spectrum of PC architectures to balance expressiveness and latency:
Empirical evaluations retrofitting EvaByte, a byte-level LLM, with MTPC demonstrate the efficacy of this approach
In the domain of medical AI, “explainability” is not merely a desirable feature but a crucial safety requirement. Clinicians need to understand the rationale behind a model’s decision, for instance, why a specific scan was diagnosed as containing a tumor. A powerful technique for providing such insights is the generation of counterfactual explanations: answering the question, “What would this scan look like if the patient were healthy?”
However, generating plausible counterfactuals is challenging; standard methods often produce unrealistic images that trick the classifier but are medically meaningless. To address this, Siekiera et al. propose SPN-Guided Latent Space Manipulation, utilizing the rigorous density estimation capabilities of PCs to generate high-fidelity counterfactuals
The approach is built upon a hybrid architecture that integrates an SPN into the latent space of a semi-supervised Variational Autoencoder (VAE). First, a VAE is trained to compress high-dimensional medical images into a lower-dimensional latent space $\mathbf{z}$. Unlike standard VAEs, which assume a simple Gaussian prior and often fail to capture complex data manifolds, this method employs an SPN to model the true, complex distribution of latent vectors, $P_{\text{SPN}}(\mathbf{z} \mid y)$, where $y$ denotes the class label (e.g., “Healthy” or “Sick”). The SPN serves a dual purpose: it acts as a flexible prior describing the latent topology and as a classifier $P(y \mid \mathbf{z})$, ensuring the latent space is structured according to the diagnostic classes.
To generate a counterfactual for a patient diagnosed as “Sick” ($y_{\text{orig}}$), the model optimizes a new latent vector $\mathbf{z}_{cf}$ that flips the classification to “Healthy” ($y_{\text{target}}$). This optimization is governed by three competing objectives: validity, proximity, and plausibility. First, the model maximizes the SPN’s predicted probability for the target class, $P_{\text{SPN}}(y_{\text{target}} \mid \mathbf{z}_{cf})$, to ensure the diagnosis changes. Second, it minimizes the distance $\vert \mathbf{z}_{cf} - \mathbf{z}_{\text{orig}} \vert$ to guarantee that the counterfactual remains semantically similar to the original patient scan. Finally, and crucially, the optimization maximizes the likelihood of the vector under the SPN prior, $P_{\text{SPN}}(\mathbf{z}_{cf})$. This constraint forces the search into the high-density regions of the “Healthy” distribution, effectively preventing the generation of OOD or hallucinated samples.
Experiments on the CheXpert dataset demonstrate that this SPN-guided approach produces anatomically plausible alterations, such as the specific removal of lung opacities, that are far more stable and interpretable than those generated by baseline DNN methods. While simple MLP classifiers require aggressive regularization to avoid adversarial noise, the SPN’s robust density modeling naturally guides the generation toward semantically meaningful counterfactuals without such fragile tuning.
To conclude, we have shown how recent advancements have positioned PCs as a scalable and powerful solution for UQ. These models not only resemble a robust, tractable method for assessing uncertainty but also enable innovative applications across diverse fields, highlighted by a few examples above. We anticipate a particularly transformative role for PCs in the future, specifically their integration into live systems to boost reliability and provide deeper interpretability. We hope this overview stimulates further research and practical deployment of PCs for UQ. Please note, this post is by far not a comprehensive survey of all works in this rapidly evolving field. It is merely a selection of key ideas and applications that we found particularly interesting and illustrative.
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX