Bridging the Data Processing Inequality and Function-Space Variational Inference

This blog post explores the interplay between the Data Processing Inequality (DPI), a cornerstone concept in information theory, and Function-Space Variational Inference (FSVI) within the context of Bayesian deep learning. The DPI governs the transformation and flow of information through stochastic processes, and its unique connection to FSVI is employed to highlight FSVI's focus on Bayesian predictive posteriors over parameter space. The post examines various forms of the DPI, including the KL divergence based DPI, and provides intuitive examples and detailed proofs. It also explores the equality case of the DPI to gain a deeper understanding. The connection between DPI and FSVI is then established, showing how FSVI can measure a predictive divergence independent of parameter symmetries. The post relates FSVI to knowledge distillation and label entropy regularization, highlighting the practical relevance of the theoretical concepts. Throughout the post, theoretical concepts are intertwined with intuitive explanations and mathematical rigor, offering a comprehensive understanding of these complex topics. By examining these concepts in depth, the post provides valuable insights for both theory and practice in machine learning.

Introduction

In information theory, the data processing inequality (DPI) expresses a fundamental idea: processing data (stochastically) cannot increase information. The DPI provides us with a powerful intuition about what information processing systems can do and what the limitations of data processing are.

In this blog post, we first study the DPI, developing intuition through vivid examples and detailed proofs—especially the equality case, which is arguably the best way to understand inequalities. We will consider classic forms of the DPI as well as DPIs relating probability distributions more broadly. Then, we explore the intriguing connection between DPI and function-space variational inference (FSVI), a modern Bayesian deep learning technique that focuses on the Bayesian predictive posterior rather than the parameter space. Exploring this connection is important because it can provide new insights into FSVI on a fundamental level. We apply the DPI to recover several interesting results from the literature in a simple form and build intuitions for the relationship between parameter and functional priors.

Most importantly, we consider how FSVI can measure a predictive divergence between the approximate and true posterior which is independent of parameter symmetries. (With parameter symmetries, I refer to different parameters that yield the same predictions, which is very common in over-parameterized neural networks: think of parameter symmetries like different paths leading to the same destination; they might look different but end up at the same predictionsThanks to ChatGPT for this analogy! 🤗.) Explaining this connection is one of the main goals of this article and will help you understand the relationships between DPI, FSVI, and other deep learning methods. As a concrete example and application, we relate FSVI to training with knowledge distillation and label entropy regularization: potentially more meaningful priors than the ones usually used in Bayesian neural networksIn many papers, an isotropic Gaussian is used because of its simplicity. Indeed, there are better alternatives, see Fortuin et al (2022) and Fortuin (2022).. This connection highlights the practical relevance of the theoretical concepts discussed in this post and will hopefully inspire the reader to view Bayesian deep learning from a new point of view.

TL;DR

The following sections summarize the key takeaways of this blog post. If they don’t make sense, don’t worry: they will after reading this post.

Data Processing Inequality

The data processing inequality examines how information cannot increase due to processing. In information theory, it is usually stated based on a Markov chain of random variables \(X \rightarrow Y \rightarrow Z\) and their mutual information. We will look at different data processing inequalities that relate different distributions instead of different random variables. However, the blog posts in particular looks at the DPI when formulated using Kullback-Leibler (KL) divergences between distributions. I will use “🥬 divergence” in headings to add a bit of color. 😊

Concretely, this KL DPI states that processing data stochastically can only reduce information. More formally:

That is, the KL divergence between \(\qof{Y}\) and \(\pof{Y}\) cannot be larger than the one between the original \(\qof{\W}\) and \(\pof{\W}\). Intuitively, the stochastic mapping \(\opf\) induces a bottleneck that reduces how well we can distinguish between \(\opp\) and \(\opq\). Finally we have equality when \(\Kale{\qof{\W \given Y}}{\pof{\W \given Y}} = 0\).

The paper “Understanding Variational Inference in Function-Space” by Burt et al. (2021) succinctly summarizes the DPI as follows:

The data processing inequality states that if two random variables are transformed in this way, they cannot become easier to tell apart.

Function-Space Variational Inference

Generally, variational inference is a powerful technique for approximating complex Bayesian posteriors with simpler distributions. In its usual form, it optimizes an approximate, variational distribution to match the Bayesian parameter posterior as closely as possible. This way, it transforms the problem of Bayesian inference into an optimization problem.

However, especially for deep neural networks, obtaining a good approximation of the parameter space can be difficult. One reason is the sheer size of the parameter space. Additionally, the parameterization of a neural network often contains many symmetries—different parameter configurations can lead to the same predictions of the model—that are not taken into account either.

Here, Function-space variational inference (FSVI) side-steps some of these restrictions by only requiring that the variational distribution matches the Bayesian predictive posterior:

If we define equivalence classes of model parameters according to the model predictions, FSVI aims to approximate the posterior distribution over equivalence classes of parameters \(\pof{[\w] \given \Dany}\) rather than the posterior over raw parameters \(\pof{\w \given \Dany}\). This is because \(\pof{[\w] \given \Dany}\) captures the meaningful aspects of the posterior, as it is invariant to the symmetries and equivalences in the parameter space that leave the likelihood unchanged.

To achieve this, FSVI employs an implicit variational distribution \(\qof{[\w]}\) and minimizes an objective that is equivalent to regular variational inference on the equivalence classes \(\Kale{\qof{[\W]}}{\pof{[\W] \given \Dany}}\).

By using an implicit variational distribution and leveraging the DPI, FSVI sidesteps the need to explicitly define the equivalence classes or specify a model that operates on them directly. Instead, FSVI relies on the fact that matching the predictive prior distributions in the limit of infinite data is sufficient to match the posteriors over equivalence classes overall. This leads to approximations we can connect to label entropy regularization and knowledge distillation.

Background: Information-Theoretic Notation

Information theory deals with the communication of informationSee the excellent "Visual Information Theory" by Chris Olah for a visual introduction to information theory.. In this blog post, we use a unified information-theoretic notation to express various quantities related to probability distributions and their relationshipsIt largely follows "A Practical & Unified Notation for Information-Theoretic Quantities in ML".. Here are some key concepts we will use:

The information content of an event \(x\) is denoted as \(\Hof{x}\) and is defined as \(-\log \pof{x}\). It represents the minimum amount of information needed to describe the occurrence of \(x\) given an underlying probability distribution. In machine learning, this information content is often used as a minimization objective, represented as the negative log-likelihood or cross-entropy when averaged over a dataset.

The entropy \(\Hof{X}\) of a random variable \(X\) is the expectation of its information content:

\[\Hof{X} \triangleq \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{-\log \pof{x}}.\]

The entropy measures the average amount of information needed to describe the random variable \(X\). It provides a measure of uncertainty or randomness associated with \(X\). We can similarly define the entropy of a conditional distribution \(\Hof{X \given Y}\) and the joint entropy \(\Hof{X, Y}\).

The mutual information \(\MIof{X;Y}\) between two random variables \(X\) and \(Y\) is a measure of the amount of information that one random variable contains about the other. It is defined as:

\[\begin{aligned} \MIof{X;Y} & \triangleq \Hof{X} - \Hof{X \given Y} \\ &= \Hof{Y} - \Hof{Y \given X} \\ &= \Hof{X} + \Hof{Y} - \Hof{X, Y}. \end{aligned}\]

We will also use the Kullback-Leibler divergence \(\Kale{\pof{X}}{\qof{X}}\) and the cross-entropy \(\CrossEntropy{\pof{X}}{\qof{X}}\):

\[\begin{aligned} \CrossEntropy{\pof{X}}{\qof{X}} & = \E{\pof{x}}{-\log \qof{x}}\\ \Kale{\pof{X}}{\qof{X}} & = \CrossEntropy{\pof{X}}{\qof{X}} - \Hof{X} \end{aligned}\]

The cross-entropy quantifies the average number of bits needed to encode samples drawn from the true distribution \(\pof{X}\) using a different distribution \(\qof{X}\). The Kullback-Leibler divergence is a measure of the difference between two probability distributions and captures the additional bits needed to encode samples from \(\pof{X}\) compared to encoding them using the true distribution \(\qof{X}\).

Now that we have covered the notation, let’s delve into the data processing inequality.

Data Processing Inequality

The data processing inequality (DPI) is a fundamental inequality in information theory that states the mutual information between two random variables cannot increase through processing. The original DPI is typically stated for a Markov chain of random variables \(X \rightarrow Y \rightarrow Z\) and relates the mutual information terms as follows:

\[\MIof{X;Y} \ge \MIof{X;Z}.\]

We can view \(\rightarrow\) as a processing or transition step that maps \(X\) to \(Y\) and \(Y\) to \(Z\), whereas the mapping can be deterministic or stochastic. The inequality tells us that processing the random variable \(X\) to obtain \(Y\) and further processing \(Y\) to obtain \(Z\) cannot increase the mutual information between \(X\) and \(Z\) compared to the mutual information between \(X\) and \(Y\).

The following three scenarios illustrate the data processing inequality using different mappings:

Example: Image Processing Pipeline

Consider an image processing pipeline with the following steps. Let:

\(X\) be the original image data;
\(Y\) be a compressed version of the image; and
\(Z\) be \(Y\) after adding blur and pixelation.

In this case, \(X\) has more mutual information with \(Y\) than with \(Z\). The compression reduces information, but the image is still recognizable. However, after the additional processing of blurring and pixelating, the mutual information between \(X\) and \(Z\) is further reduced. This gives an intuitive example of how additional processing on data reduces the mutual information with the original data. Each processing step results in some loss of information.

Example: Supervised Learning

Consider a supervised learning pipeline with the following steps. Let

\(X\) be the input features;
\(Y\) be the intermediate representations learned by the model; and
\(Z\) be the model predictions.

Here, \(X \rightarrow Y \rightarrow Z\) forms a Markov chain. The data processing inequality tells us that the mutual information between the inputs \(X\) and predictions \(Z\) cannot exceed the mutual information between the inputs \(X\) and intermediate representations \(Y\):

\[\MIof{X; Y} \geq \MIof{X; Z}.\]

This makes intuitive sense—the intermediate representations \(Y\) are obtained by processing the raw inputs \(X\), so they cannot contain more information about \(X\) than \(X\) itself. The predictions \(Z\) are obtained by further processing \(Y\), so additional information may be lost, reducing the mutual information with the original inputs \(X\).

As a more concrete example, consider an image classification model. Let:

\(X\) be the input images;
\(Y\) be the activations of the convolutional layers; and
\(Z\) be predicted image labels.

The convolutional layers will extract features from the input images, but cannot extract more information than present in the original images. The predicted labels are obtained by further processing these convolutional features, so may lose some fine-grained information about the original inputs.

Example: Autoencoders

An autoencoder compresses the input \(X\) into a latent code \(Y\) and then tries to reconstruct the original input from the code, producing \(\hat{X}\). Let:

\(X\) be the input;
\(Y\) be the latent code; and
\(\hat{X}\) be the reconstruction;

The data processing inequality tells us again:

\[\MIof{X; Y} \geq \MIof{X; \hat{X}}.\]

The latent code \(Y\) is obtained by compressing \(X\), so cannot contain more information. The reconstruction \(\hat{X}\) tries to recover \(X\) from \(Y\), but some information may be lost, reducing the mutual information with \(X\).

Intuitively, autoencoders try to preserve as much mutual information between inputs \(X\) and reconstructions \(\hat{X}\) as possible by learning latent representations \(Y\) that compress inputs without losing too much information. The data processing inequality quantifies this information bottleneck.

Proof of the DPI

The proof is simple and connects the DPI to another important inequality.

First we note that the Markov Chain implies the following factorization of the joint distribution:

\[\pof{x, y, z} = \pof{x} \pof{y \given x} \pof{z \given y}.\]

Using this factorization, we can express the mutual information terms:

\[\begin{aligned} \MIof{X;Y} &= \Hof{X} - \Hof{X \given Y} \\ &\ge \Hof{X} - \Hof{X \given Z} \\ &= \MIof{X;Z}. \end{aligned}\]

This relies on \(\Hof{X \given Y} \le \Hof{X \given Z}\). Why is this true?

We have the following chain of inequalities:

\[\Hof{X \given Y} = \underbrace{\MIof{X ; Z \given Y}}_{\overset{(1)}{=}0} + \Hof{X \given Y, Z} \overset{(2)}{\le} \Hof{X \given Z}.\]

(1) follows from the Markov chain property: when \(X \rightarrow Y \rightarrow Z\), \(X\) does not depend on \(Z\) at all when conditioned on \(Y\); and (2) follows from the fact that conditioning reduces entropy, i.e. \(\Hof{A \given B} \le \Hof{A}.\)

The equality gap \(\Hof{X \given Y, Z} - \Hof{X \given Z}\) corresponds to the mutual information \(\MIof{X ; Y \given Z}\). This mutual information measures the extra information about \(X\) contained in \(Y\) that is not already conveyed by \(Z\). It is zero if and only if \(X \rightarrow Z \rightarrow Y\) forms a Markov chain, indicating that \(Z\) is a sufficient statistic for \(X\).

Proof of (2) "Conditioning Reduces Entropy":

We can easily show that conditioning reduces entropy by using the non-negative property of the mutual information:

\(\begin{aligned} 0 &\le \Kale{\pof{X,Y}}{\pof{X}\pof{Y}} \\ &= \MIof{X;Y} \\ &= \Hof{X} - \Hof{X \given Y} \\ \implies \Hof{X \given Y} &\le \Hof{X}. \end{aligned}\)

The fact that conditioning reduces entropy, \(\Hof{X} \ge \Hof{X \given Y}\), is an important property by itself and is reminiscent of the data processing inequality. The conditional entropy \(\Hof{X \given Y}\) quantifies the remaining uncertainty about \(X\) after observing \(Y\). If \(X\) and \(Y\) are independent, then \(\Hof{X} = \Hof{X \given Y}\), as knowing \(Y\) does not provide any information about \(X\). On the other hand, if \(Y\) completely determines \(X\), then \(\Hof{X \given Y} = 0\), as there is no remaining uncertainty about \(X\) once \(Y\) is known. In general, conditioning can only reduce the uncertainty about \(X\), but it does not necessarily reduce it to zero.

Let’s move on and consider the KL data processing inequality.

🥬 Data Processing Inequality

A similar DPI can be expressed for different distributions \(\pof{x}\) and \(\qof{x}\) of the same random variable and the KL divergence between them. This DPI states that if we evolve two distributions using the same transition function, they cannot become less similar. The KL divergence is sometimes also referred to as “relative entropy”, so we could also call this the “relative data processing inequality”.

This can be formalized for distributions \(\pof{x}\) and \(\qof{x}\) and a stochastic transition function \(X \overset{\fof{y \given x}}{\longrightarrow} Y\). Here, we use that such a stochastic mapping \(Y = \fof{X}\) is equivalent to having a probability (density) \(\fof{y \given x}\):

\[\Kale{\pof{X}}{\qof{X}} \ge \Kale{\pof{Y}}{\qof{Y}},\]

where \(\pof{y \given x} = \fof{y \given x} = \qof{y \given x}\). The marginals after the transition are \(\pof{y} = \E{\pof{x}}{\fof{y \given x}}\) and \(\qof{y} = \E{\qof{x}}{\fof{y \given x}}\), so more explicitly:

\[\Kale{\pof{X}}{\qof{X}} \ge \Kale{\E{\pof{x}}{\fof{Y \given x}}}{\E{\qof{x}}{\fof{Y \given x}}}.\]

In their book Elements of Information Theory, Thomas and Cover describe this as “relative entropy never increases” and relate it to the second law of thermodynamics.

Example: Comparing Image Distributions

As an example, let:

\(\pof{x}\) be the true distribution of images in a dataset;
\(\qof{x}\) be a generative model that tries to mimic \(\pof{x}\); and
\(\fof{y \given x}\) be a function that thresholds images \(x\) into bilevel black and white images \(y\).

Then \(\pof{y}\) and \(\qof{y}\) will be more difficult to distinguish after the thresholding operation than \(\pof{x}\) and \(\qof{x}\). Converting to black and white images has lost information that could help distinguish the real and generated distributions.

This provides some intuition for why the KL divergence between distributions decreases under a shared stochastic mapping, as formalized by the KL data processing inequality. Processing through \(\fof{y \given x}\) makes the distributions harder to tell apart.

Counter-Example: Bayesian Inference

It might be inviting to think that this data processing inequality also applies to Bayesian inference, that is updating the model parameters based on new evidence. Then, we could argue that if two agents start with different prior beliefs but update based on the same evidence, their posterior beliefs will become more similar. However, this intuition is flawed: the data processing inequality does not apply to Bayesian inference.

Let’s walk through why. Consider:

\(\pof{\w}\) be an agent’s prior belief;
\(\qof{\w}\) be another agent’s different prior;
\(\pof{\w\given x}\) is the posterior after observing data \(x\); and
\(\qof{\w\given x}\) is the other agent’s posterior.

The priors \(\pof{\w}\) and \(\qof{\w}\) may have large divergence, representing very different initial beliefs. However, when conditioning on the same data \(x\), the KL divergence between \(\pof{\w \given x}\) and \(\qof{\w \given x}\) could increase or decrease—the data processing inequality does not give us any guarantee.

This is because \(\pof{\w}\) and \(\qof{\w}\) are not evolving under the same stochastic mapping. Rather, each prior is mapped to its respective posterior via Bayes’ rule, which operates differently on \(\opp\) and \(\opq\):

\[\begin{aligned} \pof{\w \given x} &= \frac{\pof{x \given \w}}{\pof{x}} \, \pof{\w}\\ \qof{\w \given x} &= \frac{\qof{x \given \w}}{\qof{x}} \, \qof{\w}. \end{aligned}\]

Even assuming that both agents have the same internal model, that is they use the same likelihood \(\pof{x \given \w} = \qof{x \given \w}\), the priors \(\pof{\w}\) and \(\qof{\w}\) will still influence the posterior distributions differently because they lead to different evidence terms \(\pof{x}\) and \(\qof{x}\):

\[\begin{aligned} \pof{x} &= \E{\pof{\w}}{\pof{x \given \w}}\\ \qof{x} &= \E{\qof{\w}}{\qof{x \given \w}}. \end{aligned}\]

Thus, the correct intuition is that observing the same data \(x\) does not necessarily bring the posterior beliefs closer together—they depend on the interplay between their specific priors and likelihoods. The data processing inequality does not directly apply to this Bayesian updating scenario:

\[\Kale{\qof{\W}}{\pof{\W}} {\color{red}{\not\ge}} \Kale{\qof{\W \given \mathcal{D}}}{\pof{\W \given \mathcal{D}}},\]

This counterexample highlights the importance of precisely understanding the assumptions underlying conceptual principles like the DPI. While the DPI provides insight about information dynamics in many cases, it does not universally apply, as exemplified here by Bayesian updating under different priors. As always, bear in mind that:

As we currently also seem to experience a world of increasing polarization, this counterexample might also serve as a reminder that different priors can lead to different beliefs, even when observing the same evidence. This is a fundamental aspect of Bayesian inference and the scientific method.

Proofs of the 🥬 DPI

We will prove this inequality in two different ways. First, we will develop a “brute-force” proof, and then we will look at a more elegant proof that follows Thomas and Cover. Importantly, we will also consider the equality case in detail.

Brute-force Proof

If \(\opp\) does not have support in \(\opq\), the inequality is trivially true because then \(\Kale{\pof{Y}}{\qof{Y}}=\infty\).

Thus, let’s now assume that \(\opp\) has support in \(\opq\). Then, we can brute-force using the definitions, starting from the cross-entropy:

\[\begin{aligned} \CrossEntropy{\pof{Y}}{\qof{Y}}&=\CrossEntropy{\pof{Y}}{\E{\qof{x}}{\pof{Y \given x}}}\\ &=\CrossEntropy{\pof{Y}}{\E{\qof{x}}{\frac{\pof{x \given Y}\pof{Y}}{\pof{x}}}}\\ &=\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}}+\CrossEntropy{\pof{Y}}{\pof{Y}}\\ &\overset{(1)}{=}\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}}+\xHof{\pof{Y}}\\ &\overset{(2)}{\le}\CrossEntropy{\pof{X, Y}}{\frac{\qof{X}}{\pof{X}}}+\xHof{\pof{Y}}\\ &\overset{(3)}{=}\CrossEntropy{\pof{X}}{\frac{\qof{X}}{\pof{X}}}+\xHof{\pof{Y}}\\ &\overset{(4)}{=}\Kale{\pof{X}}{\qof{X}}+\xHof{\pof{Y}}\\ \iff \Kale{\pof{Y}}{\qof{Y}}&\le\Kale{\pof{X}}{\qof{X}}, \end{aligned}\]

where we have used (1) that the cross-entropy of a distribution with itself is just the entropy, (2) that the cross-entropy is convex and we can apply Jensen’s inequality, (3) that the RHS side of the cross-entropy does not depend on \(Y\) and we can trivially marginalize it out, and (4) that the definition of the Kullback-Leibler divergence is equivalent an (unnormalized) cross-entropy over a fraction.

This makes it difficult to extract the case for equality, however.

Equality Case

We have only one inequality in above proof, and it stems from applying Jensen’s inequality. Remembering the equality case for Jensen’s inequality, we recall:

For (2), this is sadly slightly more complex than it might seem on first glance. Let’s unwrap the term:

\[\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}} = \E{\pof{y}}{-\log \E{\pof{x \given y}}{\frac{\qof{x}}{\pof{x}}}}.\]

We take an expectation over \(\pof{y}\), so we need to look at almost all \(\pof{x \given y} \not= 0\) for (almost all) \(\pof{y} \not= 0\) separately to consider equality. \(-\log x\) is strictly convex—and thus not linear—so we need \(f(x) = \frac{\qof{X}}{\pof{X}}\) to be constant for any fixed \(y\) with \(\pof{y} \not= 0\)—only then have we equality in Jensen’s inequality.

In the following, I will limit myself to the discrete case to avoid having to deal with measure theoryI currently don't have a good 'toolbox' to express simple ideas cleanly in measure theory. I'm working on it.. To obtain equality, for all \(y\) with \(\pof{y} \not= 0\) (i.e. we have support) and for all \(x_1, x_2\) with \(\pof{x_1 \given y}, \pof{x_2 \given y} \not= 0\), we need \(\frac{\qof{x_1}}{\pof{x_1}} = \frac{\qof{x_2}}{\pof{x_2}}\). Equivalently (for the reader, why is then \(\pof{x_1} \not= 0?\)):

\[\begin{aligned} \frac{\qof{x_1}}{\pof{x_1}} &= \frac{\qof{x_2}}{\pof{x_2}} \\ \iff \qof{x_1} &= \frac{\qof{x_2}}{\pof{x_2}} \, \pof{x_1} \\ \end{aligned}\]

This means that \(\qof{x} = C_y \pof{x}\) piecewise for all \(x\) for which \(\pof{x \given y} \not= 0\) for some fixed \(y\) with \(\pof{y} \not= 0\). That is if we keep \(y\) fixed, all the \(x\) for which \(\pof{x \given y} \not= 0\) have the same constant factor \(C_y\). Then for all \(y\) with \(\pof{y} \not= 0\), we have equality and overall equality in (2).

If for any \(x\) there are multiple \(y\), e.g. \(y_1, y_2\) for which \(\pof{x \given y} \not= 0\), then we have \(C_{y_1} = C_{y_2}\).

As an example, at the simplest, if this is the case for all \(y\), then \(C_y = 1\) constant.

As a side-note, this is a great reason why we often require full support for distributions as we then can avoid these piecewise constant factors (and the headaches they might cause).

Simpler Elegant Proof

Thomas and Cover provide a beautifully simple proof:

What does this mean? Whereas \(\fof{y \given x}\) is the ‘forward’ transition function, \(\pof{x \given y}\) and \(\qof{x \given y}\) are the ‘backward’ transition functions. We only have equality when the backward transition functions are equal (almost everywhere).

The statement on equality is not very informative yet though, so we have to put in a bit more work. Again, this is written for the discrete case.

This time we explicitly use Bayes’ rule to connect the forward and backward transition functions. First, we have to fix \(y\) such that \(\pof{y} \not= 0\) (i.e. \(y\) is in the support of \(\pof{y}\)) and then \(\qof{y} \not=0\). We have:

\[\begin{aligned} \pof{x \given y} &= \qof{x \given y} \\ \overset{\text{ass. }\pof{y} \not= 0}{\iff} \frac{\fof{y \given x}\pof{x}}{\pof{y}} &= \frac{\fof{y \given x}\qof{x}}{\qof{y}} \\ \overset{\text{ass. }\fof{y \given x}\not= 0}{\iff} \frac{\pof{x}}{\pof{y}} &= \frac{\qof{x}}{\qof{y}} \\ \iff \pof{x} &= \frac{\pof{y}}{\qof{y}} \, \qof{x}. \end{aligned}\]

For a given \(y\) with \(\pof{y} \not=0\), for the equality case, we see that for all \(x\) with \(\fof{y \given x} \not= 0\), \(\pof{x}\) and \(\qof{x}\) have to be coupled via piecewise constant factors.

As another example, if \(\fof{y \given x} \not=0\) (has full support) for all possible \(x\), for the equality case we have \(\pof{x} = \qof{x}\).

Compared to the previous equality case, we went a bit deeper and rewrote the conditions to consider the ratios between \(x\) and \(y\). Note we could have shown the same thing in the “brute-force” proof, too.

Altogether, we have see that both \(x\) and \(y\) are modulated by the same constant factor between \(\pof{\cdot}\) and \(\qof{\cdot}\). Essentially, this tells us that we could split our support into unconnected sub-domains and examine each individually for the equality case.

Overall Statement

We have the following overall statement:

(\(\pof{x} \ll \qof{x}\) means that \(\qof{x} > 0\) implies \(\pof{x} > 0\), so the KL divergence is not \(\infty\).) But more precisely, for \(\pof{x} \ll \qof{x}\), we have equality when:

\[\forall y, \pof{y} \not= 0 \exists C_y \in \mathbb{R}_{> 0} \forall x, \fof{y \given x}\not=0\colon \pof{x} = C_y \, \qof{x}.\]

Other Data Processing Inequalities

Now, we can use these ideas to derive a few additional results and even close the circle to the original data processing inequality.

Jensen-Shannon Divergence

The KL divergence is not a metric: the triangle inequality does not hold, and it is not symmetric.

However, we can symmetrize it to obtain the Jensen-Shannon divergence (JSD). The JSD is defined as the mean of the two KL divergences of the two distributions from their average. In essence, it makes the KL divergence symmetric:

\[\begin{aligned} \fof{x} &= \frac{\pof{x} + \qof{x}}{2}\\ \JSD{\pof{x}}{\qof{x}} &= \frac{1}{2} \Kale{\pof{x}}{\fof{x}} + \frac{1}{2} \Kale{\qof{x}}{\fof{x}}. \end{aligned}\]

Similar approaches can be used to “symmetrize” other concepts; for example matrices: \(\frac{1}{2} A + \frac{1}{2} A^T\) is also symmetric by construction for any matrix \(A\).

The JSD is still not a metric, but the square root of the Jensen-Shannon divergence is symmetric and satisfies the triangle inequality and gives us the Jensen-Shannon distance, a metric.

JSD-DPI

We can also obtain a data processing inequality for the Jensen-Shannon divergence and the Jensen-Shannon distance:

The proof uses the KL data processing inequality:

\[\begin{aligned} \JSD{\pof{X}}{\qof{X}} &= \frac{1}{2} \Kale{\pof{X}}{\fof{X}} + \frac{1}{2} \Kale{\qof{X}}{\fof{X}}\\ &\ge \frac{1}{2} \Kale{\pof{Y}}{\fof{Y}} + \frac{1}{2} \Kale{\qof{Y}}{\fof{Y}}\\ &= \JSD{\pof{Y}}{\qof{Y}}. \end{aligned}\]

We verify \(\fof{y} = \frac{\pof{y} + \qof{y}}{2}\) is the average of \(\pof{y}\) and \(\qof{y}\):

\[\begin{aligned} \fof{y} &= \E{\fof{x}}{\fof{y \given x}}\\ &= \E{\frac{\pof{x}+\qof{x}}{2}}{\fof{y \given x}}\\ &= \frac{1}{2} \E{\pof{x}}{\fof{y \given x}} + \frac{1}{2} \E{\qof{x}}{\fof{y \given x}}\\ &= \frac{1}{2} \pof{y} + \frac{1}{2} \qof{y}. \end{aligned}\]

Finally, \(\pof{x}, \qof{x} \ll \fof{x}\), and the equality condition of the KL data processing inequality gives us:

\[\begin{aligned} &\Kale{\pof{X \given Y}}{\fof{X \given Y}} = 0 &\\ \land \quad &\Kale{\qof{X \given Y}}{\fof{X \given Y}} = 0 &\\ \iff &\pof{x \given y} = \fof{x \given y} \land \qof{x \given y} = \fof{x \given y}& \forall x,y \\ \iff &\pof{x \given y} = \qof{x \given y}& \forall x,y. \end{aligned}\]

Mutual Information

The JSD can also be expressed as a mutual information. For \(\begin{aligned} Z &\sim \mathrm{Bernoulli}(\frac{1}{2}) = \fof{Z} \\ X \given Z = 0 &\sim \pof{x}\\ X \given Z = 1 &\sim \qof{x}, \end{aligned}\)

we have:

\[\JSD{\pof{X}}{\qof{X}} = \MIof{X;Z}.\]

This follows from rewriting the mutual information as a KL divergence:

\[\begin{aligned} \MIof{X;Z} &= \Kale{\fof{X \given Z}}{\fof{X}}\\ &= \E{\fof{z}} {\Kale{\fof{X \given Z = z}}{\fof{X}}}\\ &= \frac{1}{2} \Kale{\pof{x}}{\fof{x}} + \frac{1}{2} \Kale{\qof{x}}{\fof{x}}\\ &= \JSD{\pof{X}}{\qof{X}}. \end{aligned}\]

We can generalize this to the Markov chain \(Z \rightarrow X \rightarrow Y\) with \(\fof{z, x, y} = \fof{z} \fof{x \given z} \fof{y \given x}\) for any distribution \(\fof{z}\):

\[\begin{aligned} \MIof{X;Z} &= \Kale{\fof{X \given Z}}{\fof{X}}\\ &= \E{\fof{z}} {\Kale{\fof{X \given z}}{\fof{X}}}\\ &\overset{(1)}{\ge} \E{\fof{z}} {\Kale{\fof{Y \given z}}{\fof{Y}}}\\ &= \Kale{\fof{Y \given Z}}{\fof{Y}}\\ &= \MIof{Y;Z}, \end{aligned}\]

where \((1)\) follows from the KL data processing inequality.

This is just the data processing inequality we presented initially. We have gone full circle!

The equality gap (Jensen gap) is \(\Kale{\fof{X \given Y, Z}}{\fof{X \given Y}}\), and we have equality when:

\[\begin{aligned} \Kale{\fof{X \given Y, Z}}{\fof{X \given Y}} &= 0\\ \iff \MIof{X;Z \given Y} &= 0. \end{aligned}\]

This is exactly when \(X\) is independent of \(Z\) given \(Y\). (\(Y\) is a sufficient statistic in that case.)

Function-Space Variational Inference

So far we’ve explored the foundational aspects of the data processing inequality (DPI) and its extended forms, in particular the KL data processing inequality. Through detailed derivations and intuitive examples, we’ve demonstrated how these inequalities can be applied, emphasizing their significance and limitations. Specifically, we’ve shown how the KL data processing inequality relates to the reduction in information as data is processed. The examples and counterexample have hopefully demonstrated the nuances of applying these inequalities in different contexts.

This exploration sets the stage for diving into function-space variational inference and building up a robust understanding of it, leveraging the insights gained about the DPI and its implications in Bayesian deep learning.

Problem Setting & Notation

In the following, we will consider a classification task with cross-entropy loss, and we will use the following the random variables and distributions:

\(\y\) is the label,
\(\x\) is the input,
\(\qof{\y \given \x}\) is the predictive distribution we want to learn,
\(\pdata{\y \given \x}\) is the data distribution,
\(\Dany\) is the (training) dataset, and
\(C\) is the number of classes.

The probabilistic model is:

\[\pof{\y, \w \given \x} = \pof{\y \given \x, \w} \, \pof{\w}.\]

As before, I use upper-case letters for random variables, which we take an expectation over, e.g. in the KL divergence, and lower-case letters when I’m referring to specific observations or values that could be substituted (with the exception of \(\Dany\)).

Chain Rule of the 🥬 Divergence & DPI

An important property of the KL divergence is the chain rule:

\[\begin{aligned} &\Kale{\qof{\Y_n,...,\Y_1}}{\pof{\Y_n,...,\Y_1}} \\ &\quad = \sum_{i=1}^n \Kale{\qof{\Y_i \given \Y_{i-1}, ..., \Y_1}}{\pof{\Y_i \given \Y_{i-1}, ..., \Y_1}}. \end{aligned}\]

The chain rule yields a chain inequality for the DPI as well:

\[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} &\ge \Kale{\qof{\Y_n,...,\Y_1}}{\pof{\Y_n,...,\Y_1}}\\ &\ge \Kale{\qof{\Y_{n-1},...,\Y_1}}{\pof{\Y_{n-1},...,\Y_1}}\\ &\ge \Kale{\qof{\Y_1}}{\pof{\Y_1}}, \end{aligned}\]

where we start from the KL DPI and then apply the chain rule.

Deriving the Functional ELBO

The DPI has an intriguing connection to FSVI. Let’s say we want to approximate a Bayesian posterior \(\pof{\w \given \Dany}\) with a variational distribution \(\qof{\w}\). In standard VI, we would minimize \(\Kale{\qof{\W}}{\pof{\W \given \Dany}}\) to match the variational distribution to the Bayesian posterior. Specifically:

\[\begin{aligned} &\Kale{\qof{\W}}{\pof{\W \given \Dany}} =\\ &\quad = \underbrace{\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}}}_{\text{Evidence}\ \text{Bound}} + \log \pof{\Dany} \ge 0 \\ &\iff \underbrace{-\log \pof{\Dany}}_{=\xHof{\pof{\Dany}}} \le \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}}. \end{aligned}\]

This is an information-theoretic evidence (upper) bound on the information content \(-\log \pof{\Dany}\) of the data \(\Dany\) under the variational distribution \(\qof{\w}\), which we can minimize as an objective to approximiate \(\pof{\w \given \Dany}\) via \(\qof{\w}\).

In more probability-theory inspired literature, the negative of this bound is called the evidence lower bound (ELBO) and is maximized.

Both the ELBO and the information-theoretic evidence upper-bound are equivalent, and we can use either objective, but the information-theoretic perspective is obviously superior 🙃 I’ll refer to this as evidence bound from now on.

In FSVI (with a caveat I detail below), we apply the DPI to the prior KL divergence term and obtain a “functional” version of the evidence bound:

\[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} \ge \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}, \end{aligned}\]

where \(\Y... \given \x...\) are (finite or infinite) sets of samples. That is, we do not only optimize marginal distributions but also joint distributions.

The resulting objective:

\[\begin{aligned} \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}} \end{aligned}\]

is equal to the (negative) functional ELBO (fELBO) in “Functional variational Bayesian neural networks” by Sun et al. (2019)—with caveats that we discuss below.

Choosing the “Coreset” \(\x...\)

One important detail is the question of how to choose the \(\x...\):

Ideally, we want to choose them such that the DPI inequality is as tight as possible.

Given the chain inequality, it is obvious that the larger the set \(\x...\), the tighter the inequality will be. Hence, if we could choose an infinite set of points well, we might be able to get the tightest possible inequality. However, this might not be tractable, and in practice, it is often not.

Some works take a supremum over finite subsets of a certain size, essentially building a core-set as an approximation (Rudner et al., 2022a/b); others take an expectation over finite sets of input samples (Sun et al., 2019), which is not necessarily yielding the tightest inequality but provides an unbiased estimate; while again other works focus on finite datasets for which the all points can be taken into account (Klarner et al., 2023).

We will discuss the tightness of the inequality and the implications in the data limit below.

Focusing on the most important aspect of FSVI, we observe:

Application to Continual Learning

When we directly optimize the KL divergence on a finite input dataset, for example, we align \(\opq\) with the prior of \(\opp\) where it matters most: on the predictions of the observed data.

This is of particular interest in continual learning, where the prior for the next task is chosen to be the posterior from the previous task. In this case, the functional ELBO can be used to approximate the posterior of the previous model while incorporating new data.

For two great papers that are very readable and provide further insights, see “Continual Learning via Sequential Function-Space Variational Inference“ and “Tractable Function-Space Variational Inference in Bayesian Neural Networks“, both by Rudner et al. (2022).

Comparison to FSVI in the Literature

An important detail is that in above FSVI papers, the DPI is applied to the logits not the probabilities. That is, we have a distribution over logits, which adds an extra layer of uncertainty. Unlike the class probabilities, which we can also obtain from a deterministic neural network, thanks to the semantic interpretation of the outputs as distribution \(\pof{\y \given \x, \w}\), a deterministic neural network only provides us with deterministic logits and not a distribution over them. Only a (more) Bayesian approach can provide this.

In general, we have the following inequality chain that relates the KL divergence between the logits \(\L\), which is a vector unlike \(\Y\), and the KL divergence between the output probabilities for \(\Y\) to the KL divergence between the parameters \(\W\):

\[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} &\ge \Kale{\qof{\L...\given \x...}}{\pof{\L...\given \x...}} \\ &\ge \Kale{\qof{\Y...\given \x...}}{\pof{\Y...\given \x...}}. \end{aligned}\]

This follows that from applying the DPI twice, as the logits are functions of the weights and the weights are functions of the logits.

In practice, Rudner et al. (2022) and its continual learning follow-up linearize the logitsThe logits are the final activations of the neural network before applying the softmax function (in multi-class classification). They are not to be confused with the pre-logits, e.g. embeddings before the final linear layer. (similar to a Laplace approximation) and use the DPI to show (in their notation):

\[\mathbb{D}_{\mathrm{KL}}\left(q_{f(\cdot ; \boldsymbol{\Theta})} \| p_{f(\cdot ; \boldsymbol{\Theta})}\right) \leq \mathbb{D}_{\mathrm{KL}}\left(q_{\Theta} \| p_{\Theta}\right)\]

which in my notation is equivalent to the first application of the DPI above:

\[\Kale{\qof{\L...\given \x...}}{\pof{\L...\given \x...}} \le \Kale{\qof{\W}}{\pof{\W}}.\]

They maximize the fELBO objective:

\[\begin{aligned} \mathcal{F}\left(q_{\boldsymbol{\Theta}}\right) &=\mathbb{E}_{q_{f\left(\mathbf{x}_{\mathcal{D}} ; \boldsymbol{\Theta}\right)}}\left[\log p_{\mathbf{y} \mid f(\mathbf{X} ; \boldsymbol{\Theta})}\left(\mathbf{y}_{\mathcal{D}} \mid f\left(\mathbf{X}_{\mathcal{D}} ; \boldsymbol{\theta}\right)\right)\right]\\ &\quad -\sup _{\mathbf{X} \in \mathcal{X}_{\mathbb{N}}} \mathbb{D}_{\mathrm{KL}}\left(q_{f(\mathbf{X} ; \boldsymbol{\Theta})} \| p_{f(\mathbf{X} ; \boldsymbol{\Theta})}\right), \end{aligned}\]

which is equivalent to minimizing the information-theoretic objective:

\[\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\L... \given \x...}}{\pof{\L... \given \x...}},\]

if we choose the \(\x...\) to tighten the DPI inequality as much as possible (i.e. by “finding” the supremum).

Using the inequality chain from above, we can sandwich their objective between a regular (negative) ELBO and the (negative) functional ELBO, we have derived above:

\[\begin{aligned} &\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}} \\ &\quad \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\L... \given \x...}}{\pof{\L... \given \x...}} \\ &\quad \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}. \end{aligned}\]

Why are they using logits instead of probabilities? In practice, using the probabilities instead of logits when performing linearization is often cumbersome due to the non-linearity of the softmax functions, which requires Monte-Carlo sampling of the logits to obtain an approximation of the final probabilities. Furthermore, I speculate that sampling the logits can be more benign given that we often use ReLUs in the underlying neural networks. (Don’t quote me too strongly on this, though.)

Conceptually, this explains the derivation of their ELBO objective and also relates them to the ‘purer’ and simpler functional evidence bound derived above, but this raises the question of how these inequalities are different and what the gap between them tells us. Let’s address this question next.

The Equality Case and Equivalence Classes

When do we have equality? That is, when do we have:

\[\Kale{\qof{\W}}{\pof{\W}} = \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}?\]

And what does it tell us?

As we have seen in the first part of this post, we have equality in the DPI if and only:

\(\Kale{\qof{\W \given \Y..., \x...}}{\pof{\W \given \Y..., \x...}}=0\).

Given that we are trying to approximate the Bayesian posterior \(\pof{\w \given \Y..., \x...}\) using \(\qof{\w}\), this equality condition tells us that we would have to find the exact posterior for equality. Hence, it is unlikely that we will have equality in practice. From this, the next question immediately follows: what does this predictive prior term

\[\Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}\]

provides us with?

Another way to think about the gap between the two KL divergences is that one is parameter-based and the other one is not. This points to a deeper truth about overparameterized models used in deep learning:

The functional KL divergences won’t be affected by this as they are parameter-free and do not take into account the parameters of the model but only the predictions. The regular parameter-based KL divergence, however, would be affected by this—depending on the prior \(\pof{\w}\), they might express differences between the parameter distributions that have no effect on the outputs.

In other words, if the prior assigns different probability to otherwise equivalent parameters, this obviously changes the parameter posterior, while the outputs are invariant to these changes if the overall assigned probability to a given output remains the same.

For example, the paper “Deep Ensembles: A Loss Landscape Perspective” by Fort et al. (2020) examines the similarity of the predictions of models trained from different initializations and shows that the prediction space has a multi-modal loss landspace. In the language of FSVI, this is similar to analyzing the function-space distances between different models.

Equivalence Classes

Unless there are other considerations, it makes sense to use priors that assign the same density to parameters that are equivalent. Hence, for a given function \(\fof{\x ; \w}\), which determines the likelihood \(\pof{\y \given \x, \w} \triangleq \pof{y \given \fof{\x ; \w}}\), we can define an equivalence relation such that \(\w \sim \w'\) if and only if \(\fof{\x; \w} = \fof{\x; \w'}\) for all \(\x\). This equivalence relation partitions the parameter space into equivalence classes:

\[[\w] \triangleq \{\w' : \fof{x ; \w} = \fof{x ; \w} \quad \forall x \}.\]

A prior \(\pof{\w}\) induces a prior \(\hpof{[\w]}\) over the equivalence classes:

\[\hpof{[\w]} \triangleq \sum_{\w' \in [\w]} \pof{\w'}.\]

—or \(\int_{[\w]} \pof{\w'} \, d \w'\) for continuous \(\w\)—with the corresponding model:

\[\begin{aligned} \hpof{\y, [\w] \given \x} &\triangleq \hpof{\y \given \x, [\w]} \, \hpof{[\w]} \\ &= \pof{\y \given \x, \w} \, \hpof{[\w]}. \end{aligned}\]

Consistency

Importantly, the definition of the equivalence classes above is consistent with Bayesian inference:

This is easy to show with using Bayes’ rule:

\[\begin{aligned} \hpof{[\w] \given \Dany} &= \hpof{\Dany \given [\w]} \, \hpof{[\w]} / \hpof{\Dany} \\ &= \pof{\Dany \given \w} \sum_{\w' \in [\w]} \pof{\w'} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\Dany \given \w'} \, \pof{\w'} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\w' \given \Dany} \, \pof{\Dany} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\w' \given \Dany}. \end{aligned}\]

The last step follows from \(\hpof{\Dany}=\pof{\Dany}\):

\[\begin{aligned} \hpof{\Dany} &= \sum_{[\w]} \hpof{\Dany, [\w]} \\ &= \sum_{[\w]} \sum_{\w' \in [\w]} \pof{\Dany, \w'} \\ &= \sum_{\w'} \pof{\Dany, \w} \\ &= \pof{\Dany}. \end{aligned}\]

This also tells us that, for any \(\x\) and \(\y\):

\(\pof{\y... \given \x...} = \hpof{\y... \given \x...}\).

Given this consistency, we don’t have to differentiate between \(\hat\opp\) and \(\opp\) and can use \(\opp\) interchangeably. The same holds for \(\opq\).

These simple results deserve careful appreciation to fully internalize them.

Equality & Symmetries

We can view \([\w]\) as a projection from \(\w\) to its equivalence class \([\w]\). The DPI then gives us:

\[\Kale{\qof{\W}}{\pof{\W}} \ge \Kale{\qof{[\W]}}{\pof{[\W]}}.\]

And again: what does the gap between the two terms tell us?

Let’s look at a few examples to get a better understanding of this.

1. Trivial Constant Case

Let \(\fof{\x ; \w} = 0\) independent of any \(f\). Then \([\w] = [\w']\) for any \(\w\), \(\w'\).

For any approximate distribution \(\qof{\w}\), the induced \(\Kale{\qof{[\W]}}{\pof{[\W]}}=0\), while \(\Kale{\qof{\W}}{\pof{\W}}\) also includes superfluous divergence.

2. Unused Parameter

Let \(\y \given (\w_1, \w_2) = \w_1\) deterministic but independent of \(\w_2\). Then \([(\w_1, \w_2)] = [(\w_1, {\w'}_2)]\) for any \({\w'}_2\) and \([(\w_1,*)]\not=[({\w'}_1, *)]\) for any \(\w_1 \not= \w'_1\).

\(\Kale{\qof{[\W]}}{\pof{[\W]}}=\Kale{\qof{\W_1}}{\pof{\W_1}}\) captures the meaningful divergence between approximate and true distribution, while \(\Kale{\qof{\W}}{\pof{\W}}\) also includes any divergence across \(\w_2\) that has no effect on the predictions.

3. Periodic Parameter Space

Finally, let’s assume that the predictions are periodic in some way. That is, for example \(\y = \sin \w\). We then have \([\w] = [\w + 2\pi]\).

Further, let \(\pof{\w} = \operatorname{U}(\w; [0,2\pi \, N))\) for some \(N\) that determines the number of periods. Then, if we introduce another random variable \(K\), that captures which period we are in, we can (again) use the chain rule to write:

\[\begin{aligned} &\Kale{\qof{\W}}{\pof{\W}} \\ &\quad= \Kale{\qof{\W \given \W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \given \W \in [K\,2\pi, (K+1)\,2\pi]}} \\ &\quad + \Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}} \\ &\quad= \Kale{\qof{[\W]}}{\pof{[\W]}} \\ &\quad\quad + \Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}}. \end{aligned}\]

This follows from the setup of this specific example. Finally, we have:

\[\Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}} \le \log N.\]

So, if \(\opq\) only had support in a single period for example, the difference between \(\Kale{\qof{\W}}{\pof{\W}}\) and \(\Kale{\qof{[\W]}}{\pof{[\W]}}\) would be \(\log N\): the redundancy.

Predictive Prior

How does the predictive prior term fit into this? The DPI again yields the answer:

This tells us that the predictive prior term can at best measure the KL divergence between the equivalence classes of the parameters—and not between the parameters itself—but luckily, this is the more meaningful divergence anyway!

For the equality cases, we observe that:

we need a 1:1 mapping between parameters and equivalence classes for the first bound to be tight, and
we need \(\Kale{\qof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}}{\pof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}} \to 0\) for \(n \to \infty\) for the second bound to be tight, following the deductions in the first part of this post.

For 2.: as we know from the chain rule that

\[\Kale{\qof{\Y_n,...\Y_1\given\x_n,...,\x_1}}{\pof{\Y_n,...\Y_1\given\x_n,...,\x_1}}\]

is monotonically increasing in \(n\), and it is bounded by \(\Kale{\qof{[\W]}}{\pof{[\W]}}\) from above, it must convergeIt is a bounded monotonically increasing sequence.. So, when does it close the gap?

To give intuition that it might do that, and without attempting to prove this formally, we can appeal to Bernstein von Mises theorem, which states that the posterior distribution of the parameters converges to a Gaussian distribution with mean and variance given by the maximum likelihood estimate (MLE) as the number of data points tends to infinity as long as the model parameters are identifiable, that is the true parameters we want to learn are unique, and that they have support.

For the evidence bound to be meaningful, we already know that we need support of the approximate distribution \(\opq\) in the prior \(\opp\)—otherwise, the LHS is \(\infty\). Moreover, realizing that we take an expectation over \(\qof{\Y_n ,..., \Y_1 \given \x_n ,..., \x_1}\), we can decompose the KL term for the gap as:

\[\begin{aligned} &\Kale{\qof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}}{\pof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}} \\ &\quad = \E{\qof{\y_n,...,\y_1\given\x_n,...,\x_1}}{\Kale{\qof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}{\pof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}} \\ &\quad = \simpleE{\qof{[\w']}}{\E{\qof{\y_n,..,.\y_1\given\x_n,...,\x_1, [\w']}}{\Kale{\qof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}{\pof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}}}. \end{aligned}\]

That is, we sample a \([\w'] \sim \qof{[\w']}\) and then sample \(\y_n,...\y_1\given\x_n,...,\x_1\) from the corresponding \(\qof{\y_n,...\y_1\given\x_n,...,\x_1, [\w']}\) and marginalize over these. Crucially, \([\w']\) are the true parameters of the data-generating process for the inner KL divergence term. We thus take an expectation over KL terms fulfilling the conditions of the Bernstein von Mises theorem:

\[\begin{aligned} \Kale{\qof{[\W] \given \y_n,\x_1...\y_1, \x_1}}{\pof{[\W] \given \y_n,\x_1...\y_1, \x_1}} \to 0. \end{aligned}\]

In other words:

For a given \([w']\), in the space of equivalence classes as defined previously, the equivalence class of all MLE solutions in the data limit, \([MLE]\), will be unique by definition—the model is identifiable—and match \([\w']\)This follows from the consistency of MLE estimators but also from Berstein von Mises with a flat/uninformative prior..
As the MLE is prior-independent once there is support for it, both \(\opq\) and \(\opp\) will converge to the MLE \([\w']\) with sufficient data. Taking the expectation, this yields \(\Kale{\qof{[\W]\given \Y,..., \x...}}{\pof{[\W] \given \Y,..., \x...}} \to 0\) for \(n \to \infty\), and thus, we have:

\[\begin{aligned} & \Kale{\qof{[\W]}}{\pof{[\W]}} = \\ &\quad = \sup_{n\in \mathbb{N}} \Kale{\qof{\Y_n,...,\Y_1\given\x_n,...,\x_1}}{\pof{\Y_n,...,\Y_1\given\x_n,...,\x_1}}. \end{aligned}\]

(Again, this is not a formal proof but an intuition for why the gap might close in the data limit.)

In my opinion, this is a great result. We have shown both that the predictive prior term converges given our assumptions and that it converges to the symmetry-free parameter-based divergence in the data limit. This is a strong argument for the predictive prior term being meaningful and not just a technical trick.

Let’s appreciate one more thing: the predictive prior can consist of infinitely many data points and still converge to a finite value.

The FSVI Objective

The true objective of FSVI is to perform variational inference in the space of equivalence classes, approximating the posterior \(\pof{[\w] \given \Dany}\) that captures the meaningful aspects of the model while being invariant to parameter symmetries. This perspective reveals FSVI as a powerful and principled framework for Bayesian inference in overparameterized models, providing both conceptual and practical advantages over standard variational inference in parameter space.

This is because \(\pof{[\w] \given \Dany}\) captures the meaningful aspects of the posterior, as it is invariant to the symmetries and equivalences in the parameter space that leave the likelihood unchanged.

To achieve this, FSVI employs an implicit variational distribution \(\qof{[\w]}\) and minimizes the following objective, which is regular variational inference on the equivalence classes \(\Kale{\qof{[\W]}}{\pof{[\W] \given \Dany}}\), decomposed using the insights discussed:

\[\begin{aligned} &\Hof{\Dany} \le \Hof{\Dany} + \underbrace{\Kale{\qof{[\W]}}{\pof{[\W] \given \Dany}}}_{\ge 0} \\ &\quad = \Hof{\Dany} + \E{\qof{[\w]}}{-\log \pof{\Dany \given [\w]}} + \Kale{\qof{[\W]}}{\pof{[\W]}} - \Hof{\Dany}\\ &\quad = \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{[\W]}}{\pof{[\W]}} \\ &\quad = \sup_n \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y_n... \given \x_n...}}{\pof{\Y_n... \given \x_n...}} \\ &\quad \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y_n... \given \x_n...}}{\pof{\Y_n... \given \x_n...}} \quad \forall n. \end{aligned}\]

This objective differs from the standard variational inference objective in a subtle but crucial way: it operates in the space of equivalence classes \([\w]\) rather than the raw parameter space \(\w\). The likelihood term \(\pof{\Dany \given [\w]}\) is well-defined because all parameters \(\w\) within an equivalence class \([\w]\) induce the same likelihood by construction, allowing us to replace it with the likelihood term \(\pof{\Dany \given \w}\).

The key insight is that FSVI leverages the data processing inequality (DPI) to avoid explicitly constructing the equivalence classes for the prior distribution or specifying a model that operates on them directly. By matching the predictive distributions via the predictive KL term, FSVI implicitly matches the priors over equivalence classes in the limit of infinite data, as shown in the previous section.

This has advantages as overparameterized models are often simpler to optimize. The empirical evidence for this is the success of deep learning models, which are overparameterized and trained with stochastic gradient descent.

This highlights the elegance and conceptual clarity of FSVI:

Parameter Priors vs. Predictive Priors

In Bayesian deep learning, we often use parameter priors that are not meaningful and which also do not take parameter symmetries into account. For example, a unit Gaussian prior over the parameters of a neural network does not induce different predictions for different parameters necessarily. While this prior can be sensible from a parameter compression perspective (e.g. see Hinton and van Camp (1993)), this does not have to be the only consideration guiding us.

With function priors and predictive priors, we can specify more meaningful priors because we can focus on the predictions and ignore the parameters. More importantly, this connects Bayesian approaches to data augmentation and other regularization techniques as we will see next.

Given that priors over equivalence classes are difficult to express explicitly though, using the DPI to obtain a functional ELBO can be an easier way to express and approximate them.

Label Entropy Regularization

All this also helps us gain a new perspective on label entropy regularization. The functional evidence bound can be lower-bounded using the chain rule by:

\[\begin{aligned} \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}} \\ \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \E{\pdata{\x}}{\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}}, \end{aligned}\]

where we can expand the term under the second expectation to:

\[\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}=\CrossEntropy{\qof{\Y \given \x}}{\pof{\Y \given \x}} - \xHof{\qof{\Y \given \x}}.\]

Assuming that our prior yields a uniform distribution over the labels, we can drop the cross entropy term because it is constant and obtain:

\[\E{\qof{\w}}{-\log \pof{\Dany \given \w}} - \E{\pdata{\x}}{\xHof{\qof{\Y \given \x}}}.\]

This is the same as an MLE minimization objective with an additional entropy regularization term \(-\xHof{\qof{\Y \given \x}}\) for different \(\x\) that prevents the model from overfitting to the labels and collapsing to the one-hot encoding of the labels.

Thus, in the simplest approximation, the DPI and functional variational inference give us a new perspective on label entropy regularization.

Knowledge Distillation

Obviously, assuming non-uniform prior predictions, \(\E{\pdata{\x}}{\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}}\) can be related to knowledge distillation in deep neural networks as introduced by Hinton et al. (2015).

The main technical difference is that knowledge distillation is using the reverse KL divergence instead of the forward KL divergence, while the conceptual difference is that we are not distilling the knowledge from a teacher model but from the prior that we downweigh while also training our model on the data itself. However, the connection between knowledge distillation and continual learning using informative priors is manifest.

Conclusion

In this blog post, we took a deep dive into the data processing inequality (DPI) and its surprisingly far-reaching implications for modern Bayesian deep learning. By carefully examining the assumptions, equality conditions, and chain rule of the DPI, we arrived at an intuitive understanding of why function-space variational inference (FSVI) can be such a powerful tool. The DPI perspective shows how FSVI side-steps issues with high-dimensional parameter spaces by focusing on matching Bayesian predictive posteriors.

Reasoning about parameter equivalence classes under the lens of the DPI, we saw how predictive KL divergences can capture meaningful differences between models while ignoring superficial discrepancies due to symmetries. This provides a fresh perspective on the advantages of predictive priors over standard parameter priors commonly used in Bayesian neural networks.

A key insight that emerged is that FSVI’s true objective is to perform variational inference in the space of predictive equivalence classes, approximating the posterior \(\pof{[\w] \given \Dany}\) that captures the meaningful aspects of the model while being invariant to parameter symmetries. By employing an implicit variational distribution and leveraging the DPI, FSVI sidesteps the need to explicitly define equivalence classes, relying instead on the fact that matching predictive distributions is sufficient to match the posteriors over equivalence classes in the infinite data limit. This reveals FSVI as a principled and elegant approach to Bayesian inference in overparameterized models.

While our treatment only scratched the surface of the full mathematical story, the intuitions we developed allowed us to re-derive key results from the literature and uncover deep connections between seemingly disparate methods like entropy regularization, continual learning, and knowledge distillation. The examples and proofs peppered throughout solidified the core concepts.

More than a bag of technical tricks, the DPI reveals itself to be a powerful conceptual tool for reasoning about models, objectives, and algorithms. I hope this post inspires the reader to seek the fundamental principles underpinning machine learning innovations and to use those principles as a guide for future research. With a solid grasp of foundational tools like the DPI, we can all contribute to demystifying and unifying the rapidly evolving field of Bayesian deep learning.

Acknowledgements. Many thanks to Freddie Bickford Smith for very helpful comments and feedback on this post and to Tim Rudner for additional pointers to relevant literature and feedback on the FSVI section in particular 🤗 GPT-4 and Claude 3 Opus were used for editing.