This blog post explores the interplay between the Data Processing Inequality (DPI), a cornerstone concept in information theory, and Function-Space Variational Inference (FSVI) within the context of Bayesian deep learning. The DPI governs the transformation and flow of information through stochastic processes, and its unique connection to FSVI is employed to highlight FSVI's focus on Bayesian predictive posteriors over parameter space. The post examines various forms of the DPI, including the KL divergence based DPI, and provides intuitive examples and detailed proofs. It also explores the equality case of the DPI to gain a deeper understanding. The connection between DPI and FSVI is then established, showing how FSVI can measure a predictive divergence independent of parameter symmetries. The post relates FSVI to knowledge distillation and label entropy regularization, highlighting the practical relevance of the theoretical concepts. Throughout the post, theoretical concepts are intertwined with intuitive explanations and mathematical rigor, offering a comprehensive understanding of these complex topics. By examining these concepts in depth, the post provides valuable insights for both theory and practice in machine learning.
In information theory, the data processing inequality (DPI) expresses a fundamental idea: processing data (stochastically) cannot increase information. The DPI provides us with a powerful intuition about what information processing systems can do and what the limitations of data processing are.
In this blog post, we first study the DPI, developing intuition through vivid examples and detailed proofsâespecially the equality case, which is arguably the best way to understand inequalities. We will consider classic forms of the DPI as well as DPIs relating probability distributions more broadly. Then, we explore the intriguing connection between DPI and function-space variational inference (FSVI), a modern Bayesian deep learning technique that focuses on the Bayesian predictive posterior rather than the parameter space. Exploring this connection is important because it can provide new insights into FSVI on a fundamental level. We apply the DPI to recover several interesting results from the literature in a simple form and build intuitions for the relationship between parameter and functional priors.
Most importantly, we consider how FSVI can measure a predictive divergence between the approximate and true posterior which is independent of parameter symmetries. (With parameter symmetries, I refer to different parameters that yield the same predictions, which is very common in over-parameterized neural networks: think of parameter symmetries like different paths leading to the same destination; they might look different but end up at the same predictions
The following sections summarize the key takeaways of this blog post. If they donât make sense, donât worry: they will after reading this post.
The data processing inequality examines how information cannot increase due to processing. In information theory, it is usually stated based on a Markov chain of random variables \(X \rightarrow Y \rightarrow Z\) and their mutual information. We will look at different data processing inequalities that relate different distributions instead of different random variables. However, the blog posts in particular looks at the DPI when formulated using Kullback-Leibler (KL) divergences between distributions. I will use âđ„Ź divergenceâ in headings to add a bit of color. đ
Concretely, this KL DPI states that processing data stochastically can only reduce information. More formally:
That is, the KL divergence between \(\qof{Y}\) and \(\pof{Y}\) cannot be larger than the one between the original \(\qof{\W}\) and \(\pof{\W}\). Intuitively, the stochastic mapping \(\opf\) induces a bottleneck that reduces how well we can distinguish between \(\opp\) and \(\opq\). Finally we have equality when \(\Kale{\qof{\W \given Y}}{\pof{\W \given Y}} = 0\).
The paper âUnderstanding Variational Inference in Function-Spaceâ by Burt et al. (2021)
The data processing inequality states that if two random variables are transformed in this way, they cannot become easier to tell apart.
Generally, variational inference is a powerful technique for approximating complex Bayesian posteriors with simpler distributions. In its usual form, it optimizes an approximate, variational distribution to match the Bayesian parameter posterior as closely as possible. This way, it transforms the problem of Bayesian inference into an optimization problem.
However, especially for deep neural networks, obtaining a good approximation of the parameter space can be difficult. One reason is the sheer size of the parameter space. Additionally, the parameterization of a neural network often contains many symmetriesâdifferent parameter configurations can lead to the same predictions of the modelâthat are not taken into account either.
Here, Function-space variational inference (FSVI) side-steps some of these restrictions by only requiring that the variational distribution matches the Bayesian predictive posterior:
If we define equivalence classes of model parameters according to the model predictions, FSVI aims to approximate the posterior distribution over equivalence classes of parameters \(\pof{[\w] \given \Dany}\) rather than the posterior over raw parameters \(\pof{\w \given \Dany}\). This is because \(\pof{[\w] \given \Dany}\) captures the meaningful aspects of the posterior, as it is invariant to the symmetries and equivalences in the parameter space that leave the likelihood unchanged.
To achieve this, FSVI employs an implicit variational distribution \(\qof{[\w]}\) and minimizes an objective that is equivalent to regular variational inference on the equivalence classes \(\Kale{\qof{[\W]}}{\pof{[\W] \given \Dany}}\).
By using an implicit variational distribution and leveraging the DPI, FSVI sidesteps the need to explicitly define the equivalence classes or specify a model that operates on them directly. Instead, FSVI relies on the fact that matching the predictive prior distributions in the limit of infinite data is sufficient to match the posteriors over equivalence classes overall. This leads to approximations we can connect to label entropy regularization and knowledge distillation.
Information theory deals with the communication of information
The information content of an event \(x\) is denoted as \(\Hof{x}\) and is defined as \(-\log \pof{x}\). It represents the minimum amount of information needed to describe the occurrence of \(x\) given an underlying probability distribution. In machine learning, this information content is often used as a minimization objective, represented as the negative log-likelihood or cross-entropy when averaged over a dataset.
The entropy \(\Hof{X}\) of a random variable \(X\) is the expectation of its information content:
\[\Hof{X} \triangleq \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{-\log \pof{x}}.\]The entropy measures the average amount of information needed to describe the random variable \(X\). It provides a measure of uncertainty or randomness associated with \(X\). We can similarly define the entropy of a conditional distribution \(\Hof{X \given Y}\) and the joint entropy \(\Hof{X, Y}\).
The mutual information \(\MIof{X;Y}\) between two random variables \(X\) and \(Y\) is a measure of the amount of information that one random variable contains about the other. It is defined as:
\[\begin{aligned} \MIof{X;Y} & \triangleq \Hof{X} - \Hof{X \given Y} \\ &= \Hof{Y} - \Hof{Y \given X} \\ &= \Hof{X} + \Hof{Y} - \Hof{X, Y}. \end{aligned}\]We will also use the Kullback-Leibler divergence \(\Kale{\pof{X}}{\qof{X}}\) and the cross-entropy \(\CrossEntropy{\pof{X}}{\qof{X}}\):
\[\begin{aligned} \CrossEntropy{\pof{X}}{\qof{X}} & = \E{\pof{x}}{-\log \qof{x}}\\ \Kale{\pof{X}}{\qof{X}} & = \CrossEntropy{\pof{X}}{\qof{X}} - \Hof{X} \end{aligned}\]The cross-entropy quantifies the average number of bits needed to encode samples drawn from the true distribution \(\pof{X}\) using a different distribution \(\qof{X}\). The Kullback-Leibler divergence is a measure of the difference between two probability distributions and captures the additional bits needed to encode samples from \(\pof{X}\) compared to encoding them using the true distribution \(\qof{X}\).
Now that we have covered the notation, letâs delve into the data processing inequality.
The data processing inequality (DPI) is a fundamental inequality in information theory that states the mutual information between two random variables cannot increase through processing. The original DPI is typically stated for a Markov chain of random variables \(X \rightarrow Y \rightarrow Z\) and relates the mutual information terms as follows:
\[\MIof{X;Y} \ge \MIof{X;Z}.\]We can view \(\rightarrow\) as a processing or transition step that maps \(X\) to \(Y\) and \(Y\) to \(Z\), whereas the mapping can be deterministic or stochastic. The inequality tells us that processing the random variable \(X\) to obtain \(Y\) and further processing \(Y\) to obtain \(Z\) cannot increase the mutual information between \(X\) and \(Z\) compared to the mutual information between \(X\) and \(Y\).
The following three scenarios illustrate the data processing inequality using different mappings:
Consider an image processing pipeline with the following steps. Let:
In this case, \(X\) has more mutual information with \(Y\) than with \(Z\). The compression reduces information, but the image is still recognizable. However, after the additional processing of blurring and pixelating, the mutual information between \(X\) and \(Z\) is further reduced. This gives an intuitive example of how additional processing on data reduces the mutual information with the original data. Each processing step results in some loss of information.
Consider a supervised learning pipeline with the following steps. Let
Here, \(X \rightarrow Y \rightarrow Z\) forms a Markov chain. The data processing inequality tells us that the mutual information between the inputs \(X\) and predictions \(Z\) cannot exceed the mutual information between the inputs \(X\) and intermediate representations \(Y\):
\[\MIof{X; Y} \geq \MIof{X; Z}.\]This makes intuitive senseâthe intermediate representations \(Y\) are obtained by processing the raw inputs \(X\), so they cannot contain more information about \(X\) than \(X\) itself. The predictions \(Z\) are obtained by further processing \(Y\), so additional information may be lost, reducing the mutual information with the original inputs \(X\).
As a more concrete example, consider an image classification model. Let:
The convolutional layers will extract features from the input images, but cannot extract more information than present in the original images. The predicted labels are obtained by further processing these convolutional features, so may lose some fine-grained information about the original inputs.
An autoencoder compresses the input \(X\) into a latent code \(Y\) and then tries to reconstruct the original input from the code, producing \(\hat{X}\). Let:
The data processing inequality tells us again:
\[\MIof{X; Y} \geq \MIof{X; \hat{X}}.\]The latent code \(Y\) is obtained by compressing \(X\), so cannot contain more information. The reconstruction \(\hat{X}\) tries to recover \(X\) from \(Y\), but some information may be lost, reducing the mutual information with \(X\).
Intuitively, autoencoders try to preserve as much mutual information between inputs \(X\) and reconstructions \(\hat{X}\) as possible by learning latent representations \(Y\) that compress inputs without losing too much information. The data processing inequality quantifies this information bottleneck.
The proof is simple and connects the DPI to another important inequality.
First we note that the Markov Chain implies the following factorization of the joint distribution:
\[\pof{x, y, z} = \pof{x} \pof{y \given x} \pof{z \given y}.\]Using this factorization, we can express the mutual information terms:
\[\begin{aligned} \MIof{X;Y} &= \Hof{X} - \Hof{X \given Y} \\ &\ge \Hof{X} - \Hof{X \given Z} \\ &= \MIof{X;Z}. \end{aligned}\]This relies on \(\Hof{X \given Y} \le \Hof{X \given Z}\). Why is this true?
We have the following chain of inequalities:
\[\Hof{X \given Y} = \underbrace{\MIof{X ; Z \given Y}}_{\overset{(1)}{=}0} + \Hof{X \given Y, Z} \overset{(2)}{\le} \Hof{X \given Z}.\](1) follows from the Markov chain property: when \(X \rightarrow Y \rightarrow Z\), \(X\) does not depend on \(Z\) at all when conditioned on \(Y\); and (2) follows from the fact that conditioning reduces entropy, i.e. \(\Hof{A \given B} \le \Hof{A}.\)
The equality gap \(\Hof{X \given Y, Z} - \Hof{X \given Z}\) corresponds to the mutual information \(\MIof{X ; Y \given Z}\). This mutual information measures the extra information about \(X\) contained in \(Y\) that is not already conveyed by \(Z\). It is zero if and only if \(X \rightarrow Z \rightarrow Y\) forms a Markov chain, indicating that \(Z\) is a sufficient statistic for \(X\).
We can easily show that conditioning reduces entropy by using the non-negative property of the mutual information:
\(\begin{aligned} 0 &\le \Kale{\pof{X,Y}}{\pof{X}\pof{Y}} \\ &= \MIof{X;Y} \\ &= \Hof{X} - \Hof{X \given Y} \\ \implies \Hof{X \given Y} &\le \Hof{X}. \end{aligned}\)
The fact that conditioning reduces entropy, \(\Hof{X} \ge \Hof{X \given Y}\), is an important property by itself and is reminiscent of the data processing inequality. The conditional entropy \(\Hof{X \given Y}\) quantifies the remaining uncertainty about \(X\) after observing \(Y\). If \(X\) and \(Y\) are independent, then \(\Hof{X} = \Hof{X \given Y}\), as knowing \(Y\) does not provide any information about \(X\). On the other hand, if \(Y\) completely determines \(X\), then \(\Hof{X \given Y} = 0\), as there is no remaining uncertainty about \(X\) once \(Y\) is known. In general, conditioning can only reduce the uncertainty about \(X\), but it does not necessarily reduce it to zero.
Letâs move on and consider the KL data processing inequality.
A similar DPI can be expressed for different distributions \(\pof{x}\) and \(\qof{x}\) of the same random variable and the KL divergence between them. This DPI states that if we evolve two distributions using the same transition function, they cannot become less similar. The KL divergence is sometimes also referred to as ârelative entropyâ, so we could also call this the ârelative data processing inequalityâ.
This can be formalized for distributions \(\pof{x}\) and \(\qof{x}\) and a stochastic transition function \(X \overset{\fof{y \given x}}{\longrightarrow} Y\). Here, we use that such a stochastic mapping \(Y = \fof{X}\) is equivalent to having a probability (density) \(\fof{y \given x}\):
\[\Kale{\pof{X}}{\qof{X}} \ge \Kale{\pof{Y}}{\qof{Y}},\]where \(\pof{y \given x} = \fof{y \given x} = \qof{y \given x}\). The marginals after the transition are \(\pof{y} = \E{\pof{x}}{\fof{y \given x}}\) and \(\qof{y} = \E{\qof{x}}{\fof{y \given x}}\), so more explicitly:
\[\Kale{\pof{X}}{\qof{X}} \ge \Kale{\E{\pof{x}}{\fof{Y \given x}}}{\E{\qof{x}}{\fof{Y \given x}}}.\]In their book Elements of Information Theory, Thomas and Cover describe this as ârelative entropy never increasesâ and relate it to the second law of thermodynamics.
As an example, let:
Then \(\pof{y}\) and \(\qof{y}\) will be more difficult to distinguish after the thresholding operation than \(\pof{x}\) and \(\qof{x}\). Converting to black and white images has lost information that could help distinguish the real and generated distributions.
This provides some intuition for why the KL divergence between distributions decreases under a shared stochastic mapping, as formalized by the KL data processing inequality. Processing through \(\fof{y \given x}\) makes the distributions harder to tell apart.
It might be inviting to think that this data processing inequality also applies to Bayesian inference, that is updating the model parameters based on new evidence. Then, we could argue that if two agents start with different prior beliefs but update based on the same evidence, their posterior beliefs will become more similar. However, this intuition is flawed: the data processing inequality does not apply to Bayesian inference.
Letâs walk through why. Consider:
The priors \(\pof{\w}\) and \(\qof{\w}\) may have large divergence, representing very different initial beliefs. However, when conditioning on the same data \(x\), the KL divergence between \(\pof{\w \given x}\) and \(\qof{\w \given x}\) could increase or decreaseâthe data processing inequality does not give us any guarantee.
This is because \(\pof{\w}\) and \(\qof{\w}\) are not evolving under the same stochastic mapping. Rather, each prior is mapped to its respective posterior via Bayesâ rule, which operates differently on \(\opp\) and \(\opq\):
\[\begin{aligned} \pof{\w \given x} &= \frac{\pof{x \given \w}}{\pof{x}} \, \pof{\w}\\ \qof{\w \given x} &= \frac{\qof{x \given \w}}{\qof{x}} \, \qof{\w}. \end{aligned}\]Even assuming that both agents have the same internal model, that is they use the same likelihood \(\pof{x \given \w} = \qof{x \given \w}\), the priors \(\pof{\w}\) and \(\qof{\w}\) will still influence the posterior distributions differently because they lead to different evidence terms \(\pof{x}\) and \(\qof{x}\):
\[\begin{aligned} \pof{x} &= \E{\pof{\w}}{\pof{x \given \w}}\\ \qof{x} &= \E{\qof{\w}}{\qof{x \given \w}}. \end{aligned}\]Thus, the correct intuition is that observing the same data \(x\) does not necessarily bring the posterior beliefs closer togetherâthey depend on the interplay between their specific priors and likelihoods. The data processing inequality does not directly apply to this Bayesian updating scenario:
\[\Kale{\qof{\W}}{\pof{\W}} {\color{red}{\not\ge}} \Kale{\qof{\W \given \mathcal{D}}}{\pof{\W \given \mathcal{D}}},\]This counterexample highlights the importance of precisely understanding the assumptions underlying conceptual principles like the DPI. While the DPI provides insight about information dynamics in many cases, it does not universally apply, as exemplified here by Bayesian updating under different priors. As always, bear in mind that:
As we currently also seem to experience a world of increasing polarization, this counterexample might also serve as a reminder that different priors can lead to different beliefs, even when observing the same evidence. This is a fundamental aspect of Bayesian inference and the scientific method.
We will prove this inequality in two different ways. First, we will develop a âbrute-forceâ proof, and then we will look at a more elegant proof that follows Thomas and Cover. Importantly, we will also consider the equality case in detail.
If \(\opp\) does not have support in \(\opq\), the inequality is trivially true because then \(\Kale{\pof{Y}}{\qof{Y}}=\infty\).
Thus, letâs now assume that \(\opp\) has support in \(\opq\). Then, we can brute-force using the definitions, starting from the cross-entropy:
\[\begin{aligned} \CrossEntropy{\pof{Y}}{\qof{Y}}&=\CrossEntropy{\pof{Y}}{\E{\qof{x}}{\pof{Y \given x}}}\\ &=\CrossEntropy{\pof{Y}}{\E{\qof{x}}{\frac{\pof{x \given Y}\pof{Y}}{\pof{x}}}}\\ &=\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}}+\CrossEntropy{\pof{Y}}{\pof{Y}}\\ &\overset{(1)}{=}\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}}+\xHof{\pof{Y}}\\ &\overset{(2)}{\le}\CrossEntropy{\pof{X, Y}}{\frac{\qof{X}}{\pof{X}}}+\xHof{\pof{Y}}\\ &\overset{(3)}{=}\CrossEntropy{\pof{X}}{\frac{\qof{X}}{\pof{X}}}+\xHof{\pof{Y}}\\ &\overset{(4)}{=}\Kale{\pof{X}}{\qof{X}}+\xHof{\pof{Y}}\\ \iff \Kale{\pof{Y}}{\qof{Y}}&\le\Kale{\pof{X}}{\qof{X}}, \end{aligned}\]where we have used (1) that the cross-entropy of a distribution with itself is just the entropy, (2) that the cross-entropy is convex and we can apply Jensenâs inequality, (3) that the RHS side of the cross-entropy does not depend on \(Y\) and we can trivially marginalize it out, and (4) that the definition of the Kullback-Leibler divergence is equivalent an (unnormalized) cross-entropy over a fraction.
This makes it difficult to extract the case for equality, however.
We have only one inequality in above proof, and it stems from applying Jensenâs inequality. Remembering the equality case for Jensenâs inequality, we recall:
For (2), this is sadly slightly more complex than it might seem on first glance. Letâs unwrap the term:
\[\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}} = \E{\pof{y}}{-\log \E{\pof{x \given y}}{\frac{\qof{x}}{\pof{x}}}}.\]We take an expectation over \(\pof{y}\), so we need to look at almost all \(\pof{x \given y} \not= 0\) for (almost all) \(\pof{y} \not= 0\) separately to consider equality. \(-\log x\) is strictly convexâand thus not linearâso we need \(f(x) = \frac{\qof{X}}{\pof{X}}\) to be constant for any fixed \(y\) with \(\pof{y} \not= 0\)âonly then have we equality in Jensenâs inequality.
In the following, I will limit myself to the discrete case to avoid having to deal with measure theory
This means that \(\qof{x} = C_y \pof{x}\) piecewise for all \(x\) for which \(\pof{x \given y} \not= 0\) for some fixed \(y\) with \(\pof{y} \not= 0\). That is if we keep \(y\) fixed, all the \(x\) for which \(\pof{x \given y} \not= 0\) have the same constant factor \(C_y\). Then for all \(y\) with \(\pof{y} \not= 0\), we have equality and overall equality in (2).
If for any \(x\) there are multiple \(y\), e.g. \(y_1, y_2\) for which \(\pof{x \given y} \not= 0\), then we have \(C_{y_1} = C_{y_2}\).
As an example, at the simplest, if this is the case for all \(y\), then \(C_y = 1\) constant.
As a side-note, this is a great reason why we often require full support for distributions as we then can avoid these piecewise constant factors (and the headaches they might cause).
Thomas and Cover provide a beautifully simple proof:
What does this mean? Whereas \(\fof{y \given x}\) is the âforwardâ transition function, \(\pof{x \given y}\) and \(\qof{x \given y}\) are the âbackwardâ transition functions. We only have equality when the backward transition functions are equal (almost everywhere).
The statement on equality is not very informative yet though, so we have to put in a bit more work. Again, this is written for the discrete case.
This time we explicitly use Bayesâ rule to connect the forward and backward transition functions. First, we have to fix \(y\) such that \(\pof{y} \not= 0\) (i.e. \(y\) is in the support of \(\pof{y}\)) and then \(\qof{y} \not=0\). We have:
\[\begin{aligned} \pof{x \given y} &= \qof{x \given y} \\ \overset{\text{ass. }\pof{y} \not= 0}{\iff} \frac{\fof{y \given x}\pof{x}}{\pof{y}} &= \frac{\fof{y \given x}\qof{x}}{\qof{y}} \\ \overset{\text{ass. }\fof{y \given x}\not= 0}{\iff} \frac{\pof{x}}{\pof{y}} &= \frac{\qof{x}}{\qof{y}} \\ \iff \pof{x} &= \frac{\pof{y}}{\qof{y}} \, \qof{x}. \end{aligned}\]For a given \(y\) with \(\pof{y} \not=0\), for the equality case, we see that for all \(x\) with \(\fof{y \given x} \not= 0\), \(\pof{x}\) and \(\qof{x}\) have to be coupled via piecewise constant factors.
As another example, if \(\fof{y \given x} \not=0\) (has full support) for all possible \(x\), for the equality case we have \(\pof{x} = \qof{x}\).
Compared to the previous equality case, we went a bit deeper and rewrote the conditions to consider the ratios between \(x\) and \(y\). Note we could have shown the same thing in the âbrute-forceâ proof, too.
Altogether, we have see that both \(x\) and \(y\) are modulated by the same constant factor between \(\pof{\cdot}\) and \(\qof{\cdot}\). Essentially, this tells us that we could split our support into unconnected sub-domains and examine each individually for the equality case.
We have the following overall statement:
(\(\pof{x} \ll \qof{x}\) means that \(\qof{x} > 0\) implies \(\pof{x} > 0\), so the KL divergence is not \(\infty\).) But more precisely, for \(\pof{x} \ll \qof{x}\), we have equality when:
\[\forall y, \pof{y} \not= 0 \exists C_y \in \mathbb{R}_{> 0} \forall x, \fof{y \given x}\not=0\colon \pof{x} = C_y \, \qof{x}.\]Now, we can use these ideas to derive a few additional results and even close the circle to the original data processing inequality.
The KL divergence is not a metric: the triangle inequality does not hold, and it is not symmetric.
However, we can symmetrize it to obtain the Jensen-Shannon divergence (JSD). The JSD is defined as the mean of the two KL divergences of the two distributions from their average. In essence, it makes the KL divergence symmetric:
\[\begin{aligned} \fof{x} &= \frac{\pof{x} + \qof{x}}{2}\\ \JSD{\pof{x}}{\qof{x}} &= \frac{1}{2} \Kale{\pof{x}}{\fof{x}} + \frac{1}{2} \Kale{\qof{x}}{\fof{x}}. \end{aligned}\]Similar approaches can be used to âsymmetrizeâ other concepts; for example matrices: \(\frac{1}{2} A + \frac{1}{2} A^T\) is also symmetric by construction for any matrix \(A\).
The JSD is still not a metric, but the square root of the Jensen-Shannon divergence is symmetric and satisfies the triangle inequality and gives us the Jensen-Shannon distance, a metric.
We can also obtain a data processing inequality for the Jensen-Shannon divergence and the Jensen-Shannon distance:
The proof uses the KL data processing inequality:
\[\begin{aligned} \JSD{\pof{X}}{\qof{X}} &= \frac{1}{2} \Kale{\pof{X}}{\fof{X}} + \frac{1}{2} \Kale{\qof{X}}{\fof{X}}\\ &\ge \frac{1}{2} \Kale{\pof{Y}}{\fof{Y}} + \frac{1}{2} \Kale{\qof{Y}}{\fof{Y}}\\ &= \JSD{\pof{Y}}{\qof{Y}}. \end{aligned}\]We verify \(\fof{y} = \frac{\pof{y} + \qof{y}}{2}\) is the average of \(\pof{y}\) and \(\qof{y}\):
\[\begin{aligned} \fof{y} &= \E{\fof{x}}{\fof{y \given x}}\\ &= \E{\frac{\pof{x}+\qof{x}}{2}}{\fof{y \given x}}\\ &= \frac{1}{2} \E{\pof{x}}{\fof{y \given x}} + \frac{1}{2} \E{\qof{x}}{\fof{y \given x}}\\ &= \frac{1}{2} \pof{y} + \frac{1}{2} \qof{y}. \end{aligned}\]Finally, \(\pof{x}, \qof{x} \ll \fof{x}\), and the equality condition of the KL data processing inequality gives us:
\[\begin{aligned} &\Kale{\pof{X \given Y}}{\fof{X \given Y}} = 0 &\\ \land \quad &\Kale{\qof{X \given Y}}{\fof{X \given Y}} = 0 &\\ \iff &\pof{x \given y} = \fof{x \given y} \land \qof{x \given y} = \fof{x \given y}& \forall x,y \\ \iff &\pof{x \given y} = \qof{x \given y}& \forall x,y. \end{aligned}\]The JSD can also be expressed as a mutual information. For \(\begin{aligned} Z &\sim \mathrm{Bernoulli}(\frac{1}{2}) = \fof{Z} \\ X \given Z = 0 &\sim \pof{x}\\ X \given Z = 1 &\sim \qof{x}, \end{aligned}\)
we have:
\[\JSD{\pof{X}}{\qof{X}} = \MIof{X;Z}.\]This follows from rewriting the mutual information as a KL divergence:
\[\begin{aligned} \MIof{X;Z} &= \Kale{\fof{X \given Z}}{\fof{X}}\\ &= \E{\fof{z}} {\Kale{\fof{X \given Z = z}}{\fof{X}}}\\ &= \frac{1}{2} \Kale{\pof{x}}{\fof{x}} + \frac{1}{2} \Kale{\qof{x}}{\fof{x}}\\ &= \JSD{\pof{X}}{\qof{X}}. \end{aligned}\]We can generalize this to the Markov chain \(Z \rightarrow X \rightarrow Y\) with \(\fof{z, x, y} = \fof{z} \fof{x \given z} \fof{y \given x}\) for any distribution \(\fof{z}\):
\[\begin{aligned} \MIof{X;Z} &= \Kale{\fof{X \given Z}}{\fof{X}}\\ &= \E{\fof{z}} {\Kale{\fof{X \given z}}{\fof{X}}}\\ &\overset{(1)}{\ge} \E{\fof{z}} {\Kale{\fof{Y \given z}}{\fof{Y}}}\\ &= \Kale{\fof{Y \given Z}}{\fof{Y}}\\ &= \MIof{Y;Z}, \end{aligned}\]where \((1)\) follows from the KL data processing inequality.
This is just the data processing inequality we presented initially. We have gone full circle!
The equality gap (Jensen gap) is \(\Kale{\fof{X \given Y, Z}}{\fof{X \given Y}}\), and we have equality when:
\[\begin{aligned} \Kale{\fof{X \given Y, Z}}{\fof{X \given Y}} &= 0\\ \iff \MIof{X;Z \given Y} &= 0. \end{aligned}\]This is exactly when \(X\) is independent of \(Z\) given \(Y\). (\(Y\) is a sufficient statistic in that case.)
So far weâve explored the foundational aspects of the data processing inequality (DPI) and its extended forms, in particular the KL data processing inequality. Through detailed derivations and intuitive examples, weâve demonstrated how these inequalities can be applied, emphasizing their significance and limitations. Specifically, weâve shown how the KL data processing inequality relates to the reduction in information as data is processed. The examples and counterexample have hopefully demonstrated the nuances of applying these inequalities in different contexts.
This exploration sets the stage for diving into function-space variational inference and building up a robust understanding of it, leveraging the insights gained about the DPI and its implications in Bayesian deep learning.
In the following, we will consider a classification task with cross-entropy loss, and we will use the following the random variables and distributions:
The probabilistic model is:
\[\pof{\y, \w \given \x} = \pof{\y \given \x, \w} \, \pof{\w}.\]As before, I use upper-case letters for random variables, which we take an expectation over, e.g. in the KL divergence, and lower-case letters when Iâm referring to specific observations or values that could be substituted (with the exception of \(\Dany\)).
An important property of the KL divergence is the chain rule:
\[\begin{aligned} &\Kale{\qof{\Y_n,...,\Y_1}}{\pof{\Y_n,...,\Y_1}} \\ &\quad = \sum_{i=1}^n \Kale{\qof{\Y_i \given \Y_{i-1}, ..., \Y_1}}{\pof{\Y_i \given \Y_{i-1}, ..., \Y_1}}. \end{aligned}\]The chain rule yields a chain inequality for the DPI as well:
\[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} &\ge \Kale{\qof{\Y_n,...,\Y_1}}{\pof{\Y_n,...,\Y_1}}\\ &\ge \Kale{\qof{\Y_{n-1},...,\Y_1}}{\pof{\Y_{n-1},...,\Y_1}}\\ &\ge \Kale{\qof{\Y_1}}{\pof{\Y_1}}, \end{aligned}\]where we start from the KL DPI and then apply the chain rule.
The DPI has an intriguing connection to FSVI. Letâs say we want to approximate a Bayesian posterior \(\pof{\w \given \Dany}\) with a variational distribution \(\qof{\w}\). In standard VI, we would minimize \(\Kale{\qof{\W}}{\pof{\W \given \Dany}}\) to match the variational distribution to the Bayesian posterior. Specifically:
\[\begin{aligned} &\Kale{\qof{\W}}{\pof{\W \given \Dany}} =\\ &\quad = \underbrace{\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}}}_{\text{Evidence}\ \text{Bound}} + \log \pof{\Dany} \ge 0 \\ &\iff \underbrace{-\log \pof{\Dany}}_{=\xHof{\pof{\Dany}}} \le \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}}. \end{aligned}\]This is an information-theoretic evidence (upper) bound on the information content \(-\log \pof{\Dany}\) of the data \(\Dany\) under the variational distribution \(\qof{\w}\), which we can minimize as an objective to approximiate \(\pof{\w \given \Dany}\) via \(\qof{\w}\).
In more probability-theory inspired literature, the negative of this bound is called the evidence lower bound (ELBO) and is maximized.
Both the ELBO and the information-theoretic evidence upper-bound are equivalent, and we can use either objective, but the information-theoretic perspective is obviously superior đ Iâll refer to this as evidence bound from now on.
In FSVI (with a caveat I detail below), we apply the DPI to the prior KL divergence term and obtain a âfunctionalâ version of the evidence bound:
\[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} \ge \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}, \end{aligned}\]where \(\Y... \given \x...\) are (finite or infinite) sets of samples. That is, we do not only optimize marginal distributions but also joint distributions.
The resulting objective:
\[\begin{aligned} \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}} \end{aligned}\]is equal to the (negative) functional ELBO (fELBO) in âFunctional variational Bayesian neural networksâ by Sun et al. (2019)
One important detail is the question of how to choose the \(\x...\):
Ideally, we want to choose them such that the DPI inequality is as tight as possible.
Given the chain inequality, it is obvious that the larger the set \(\x...\), the tighter the inequality will be. Hence, if we could choose an infinite set of points well, we might be able to get the tightest possible inequality. However, this might not be tractable, and in practice, it is often not.
Some works take a supremum over finite subsets of a certain size, essentially building a core-set as an approximation (Rudner et al., 2022a
We will discuss the tightness of the inequality and the implications in the data limit below.
Focusing on the most important aspect of FSVI, we observe:
When we directly optimize the KL divergence on a finite input dataset, for example, we align \(\opq\) with the prior of \(\opp\) where it matters most: on the predictions of the observed data.
This is of particular interest in continual learning, where the prior for the next task is chosen to be the posterior from the previous task. In this case, the functional ELBO can be used to approximate the posterior of the previous model while incorporating new data.
For two great papers that are very readable and provide further insights, see âContinual Learning via Sequential Function-Space Variational Inferenceâ
In practice, Rudner et al. (2022) and its continual learning follow-up linearize the logits
which in my notation is equivalent to the first application of the DPI above:
\[\Kale{\qof{\L...\given \x...}}{\pof{\L...\given \x...}} \le \Kale{\qof{\W}}{\pof{\W}}.\]They maximize the fELBO objective:
\[\begin{aligned} \mathcal{F}\left(q_{\boldsymbol{\Theta}}\right) &=\mathbb{E}_{q_{f\left(\mathbf{x}_{\mathcal{D}} ; \boldsymbol{\Theta}\right)}}\left[\log p_{\mathbf{y} \mid f(\mathbf{X} ; \boldsymbol{\Theta})}\left(\mathbf{y}_{\mathcal{D}} \mid f\left(\mathbf{X}_{\mathcal{D}} ; \boldsymbol{\theta}\right)\right)\right]\\ &\quad -\sup _{\mathbf{X} \in \mathcal{X}_{\mathbb{N}}} \mathbb{D}_{\mathrm{KL}}\left(q_{f(\mathbf{X} ; \boldsymbol{\Theta})} \| p_{f(\mathbf{X} ; \boldsymbol{\Theta})}\right), \end{aligned}\]which is equivalent to minimizing the information-theoretic objective:
\[\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\L... \given \x...}}{\pof{\L... \given \x...}},\]if we choose the \(\x...\) to tighten the DPI inequality as much as possible (i.e. by âfindingâ the supremum).
Using the inequality chain from above, we can sandwich their objective between a regular (negative) ELBO and the (negative) functional ELBO, we have derived above:
\[\begin{aligned} &\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}} \\ &\quad \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\L... \given \x...}}{\pof{\L... \given \x...}} \\ &\quad \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}. \end{aligned}\]Why are they using logits instead of probabilities? In practice, using the probabilities instead of logits when performing linearization is often cumbersome due to the non-linearity of the softmax functions, which requires Monte-Carlo sampling of the logits to obtain an approximation of the final probabilities. Furthermore, I speculate that sampling the logits can be more benign given that we often use ReLUs in the underlying neural networks. (Donât quote me too strongly on this, though.)
Conceptually, this explains the derivation of their ELBO objective and also relates them to the âpurerâ and simpler functional evidence bound derived above, but this raises the question of how these inequalities are different and what the gap between them tells us. Letâs address this question next.
When do we have equality? That is, when do we have:
\[\Kale{\qof{\W}}{\pof{\W}} = \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}?\]And what does it tell us?
As we have seen in the first part of this post, we have equality in the DPI if and only:
\(\Kale{\qof{\W \given \Y..., \x...}}{\pof{\W \given \Y..., \x...}}=0\).
Given that we are trying to approximate the Bayesian posterior \(\pof{\w \given \Y..., \x...}\) using \(\qof{\w}\), this equality condition tells us that we would have to find the exact posterior for equality. Hence, it is unlikely that we will have equality in practice. From this, the next question immediately follows: what does this predictive prior term
\[\Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}\]provides us with?
Another way to think about the gap between the two KL divergences is that one is parameter-based and the other one is not. This points to a deeper truth about overparameterized models used in deep learning:
The functional KL divergences wonât be affected by this as they are parameter-free and do not take into account the parameters of the model but only the predictions. The regular parameter-based KL divergence, however, would be affected by thisâdepending on the prior \(\pof{\w}\), they might express differences between the parameter distributions that have no effect on the outputs.
In other words, if the prior assigns different probability to otherwise equivalent parameters, this obviously changes the parameter posterior, while the outputs are invariant to these changes if the overall assigned probability to a given output remains the same.
For example, the paper âDeep Ensembles: A Loss Landscape Perspectiveâ by Fort et al. (2020)
Unless there are other considerations, it makes sense to use priors that assign the same density to parameters that are equivalent. Hence, for a given function \(\fof{\x ; \w}\), which determines the likelihood \(\pof{\y \given \x, \w} \triangleq \pof{y \given \fof{\x ; \w}}\), we can define an equivalence relation such that \(\w \sim \w'\) if and only if \(\fof{\x; \w} = \fof{\x; \w'}\) for all \(\x\). This equivalence relation partitions the parameter space into equivalence classes:
\[[\w] \triangleq \{\w' : \fof{x ; \w} = \fof{x ; \w} \quad \forall x \}.\]A prior \(\pof{\w}\) induces a prior \(\hpof{[\w]}\) over the equivalence classes:
\[\hpof{[\w]} \triangleq \sum_{\w' \in [\w]} \pof{\w'}.\]âor \(\int_{[\w]} \pof{\w'} \, d \w'\) for continuous \(\w\)âwith the corresponding model:
\[\begin{aligned} \hpof{\y, [\w] \given \x} &\triangleq \hpof{\y \given \x, [\w]} \, \hpof{[\w]} \\ &= \pof{\y \given \x, \w} \, \hpof{[\w]}. \end{aligned}\]Importantly, the definition of the equivalence classes above is consistent with Bayesian inference:
This is easy to show with using Bayesâ rule:
\[\begin{aligned} \hpof{[\w] \given \Dany} &= \hpof{\Dany \given [\w]} \, \hpof{[\w]} / \hpof{\Dany} \\ &= \pof{\Dany \given \w} \sum_{\w' \in [\w]} \pof{\w'} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\Dany \given \w'} \, \pof{\w'} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\w' \given \Dany} \, \pof{\Dany} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\w' \given \Dany}. \end{aligned}\]The last step follows from \(\hpof{\Dany}=\pof{\Dany}\):
\[\begin{aligned} \hpof{\Dany} &= \sum_{[\w]} \hpof{\Dany, [\w]} \\ &= \sum_{[\w]} \sum_{\w' \in [\w]} \pof{\Dany, \w'} \\ &= \sum_{\w'} \pof{\Dany, \w} \\ &= \pof{\Dany}. \end{aligned}\]This also tells us that, for any \(\x\) and \(\y\):
\(\pof{\y... \given \x...} = \hpof{\y... \given \x...}\).
Given this consistency, we donât have to differentiate between \(\hat\opp\) and \(\opp\) and can use \(\opp\) interchangeably. The same holds for \(\opq\).
These simple results deserve careful appreciation to fully internalize them.
We can view \([\w]\) as a projection from \(\w\) to its equivalence class \([\w]\). The DPI then gives us:
\[\Kale{\qof{\W}}{\pof{\W}} \ge \Kale{\qof{[\W]}}{\pof{[\W]}}.\]And again: what does the gap between the two terms tell us?
Letâs look at a few examples to get a better understanding of this.
Let \(\fof{\x ; \w} = 0\) independent of any \(f\). Then \([\w] = [\w']\) for any \(\w\), \(\w'\).
For any approximate distribution \(\qof{\w}\), the induced \(\Kale{\qof{[\W]}}{\pof{[\W]}}=0\), while \(\Kale{\qof{\W}}{\pof{\W}}\) also includes superfluous divergence.
Let \(\y \given (\w_1, \w_2) = \w_1\) deterministic but independent of \(\w_2\). Then \([(\w_1, \w_2)] = [(\w_1, {\w'}_2)]\) for any \({\w'}_2\) and \([(\w_1,*)]\not=[({\w'}_1, *)]\) for any \(\w_1 \not= \w'_1\).
\(\Kale{\qof{[\W]}}{\pof{[\W]}}=\Kale{\qof{\W_1}}{\pof{\W_1}}\) captures the meaningful divergence between approximate and true distribution, while \(\Kale{\qof{\W}}{\pof{\W}}\) also includes any divergence across \(\w_2\) that has no effect on the predictions.
Finally, letâs assume that the predictions are periodic in some way. That is, for example \(\y = \sin \w\). We then have \([\w] = [\w + 2\pi]\).
Further, let \(\pof{\w} = \operatorname{U}(\w; [0,2\pi \, N))\) for some \(N\) that determines the number of periods. Then, if we introduce another random variable \(K\), that captures which period we are in, we can (again) use the chain rule to write:
\[\begin{aligned} &\Kale{\qof{\W}}{\pof{\W}} \\ &\quad= \Kale{\qof{\W \given \W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \given \W \in [K\,2\pi, (K+1)\,2\pi]}} \\ &\quad + \Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}} \\ &\quad= \Kale{\qof{[\W]}}{\pof{[\W]}} \\ &\quad\quad + \Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}}. \end{aligned}\]This follows from the setup of this specific example. Finally, we have:
\[\Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}} \le \log N.\]So, if \(\opq\) only had support in a single period for example, the difference between \(\Kale{\qof{\W}}{\pof{\W}}\) and \(\Kale{\qof{[\W]}}{\pof{[\W]}}\) would be \(\log N\): the redundancy.
How does the predictive prior term fit into this? The DPI again yields the answer:
This tells us that the predictive prior term can at best measure the KL divergence between the equivalence classes of the parametersâand not between the parameters itselfâbut luckily, this is the more meaningful divergence anyway!
For the equality cases, we observe that:
For 2.: as we know from the chain rule that
\[\Kale{\qof{\Y_n,...\Y_1\given\x_n,...,\x_1}}{\pof{\Y_n,...\Y_1\given\x_n,...,\x_1}}\]is monotonically increasing in \(n\), and it is bounded by \(\Kale{\qof{[\W]}}{\pof{[\W]}}\) from above, it must converge
To give intuition that it might do that, and without attempting to prove this formally, we can appeal to Bernstein von Mises theorem, which states that the posterior distribution of the parameters converges to a Gaussian distribution with mean and variance given by the maximum likelihood estimate (MLE) as the number of data points tends to infinity as long as the model parameters are identifiable, that is the true parameters we want to learn are unique, and that they have support.
For the evidence bound to be meaningful, we already know that we need support of the approximate distribution \(\opq\) in the prior \(\opp\)âotherwise, the LHS is \(\infty\). Moreover, realizing that we take an expectation over \(\qof{\Y_n ,..., \Y_1 \given \x_n ,..., \x_1}\), we can decompose the KL term for the gap as:
\[\begin{aligned} &\Kale{\qof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}}{\pof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}} \\ &\quad = \E{\qof{\y_n,...,\y_1\given\x_n,...,\x_1}}{\Kale{\qof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}{\pof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}} \\ &\quad = \simpleE{\qof{[\w']}}{\E{\qof{\y_n,..,.\y_1\given\x_n,...,\x_1, [\w']}}{\Kale{\qof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}{\pof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}}}. \end{aligned}\]That is, we sample a \([\w'] \sim \qof{[\w']}\) and then sample \(\y_n,...\y_1\given\x_n,...,\x_1\) from the corresponding \(\qof{\y_n,...\y_1\given\x_n,...,\x_1, [\w']}\) and marginalize over these. Crucially, \([\w']\) are the true parameters of the data-generating process for the inner KL divergence term. We thus take an expectation over KL terms fulfilling the conditions of the Bernstein von Mises theorem:
\[\begin{aligned} \Kale{\qof{[\W] \given \y_n,\x_1...\y_1, \x_1}}{\pof{[\W] \given \y_n,\x_1...\y_1, \x_1}} \to 0. \end{aligned}\]In other words:
For a given \([w']\), in the space of equivalence classes as defined previously, the equivalence class of all MLE solutions in the data limit, \([MLE]\), will be unique by definitionâthe model is identifiableâand match \([\w']\)
As the MLE is prior-independent once there is support for it, both \(\opq\) and \(\opp\) will converge to the MLE \([\w']\) with sufficient data. Taking the expectation, this yields \(\Kale{\qof{[\W]\given \Y,..., \x...}}{\pof{[\W] \given \Y,..., \x...}} \to 0\) for \(n \to \infty\), and thus, we have:
(Again, this is not a formal proof but an intuition for why the gap might close in the data limit.)
In my opinion, this is a great result. We have shown both that the predictive prior term converges given our assumptions and that it converges to the symmetry-free parameter-based divergence in the data limit. This is a strong argument for the predictive prior term being meaningful and not just a technical trick.
Letâs appreciate one more thing: the predictive prior can consist of infinitely many data points and still converge to a finite value.
The true objective of FSVI is to perform variational inference in the space of equivalence classes, approximating the posterior \(\pof{[\w] \given \Dany}\) that captures the meaningful aspects of the model while being invariant to parameter symmetries. This perspective reveals FSVI as a powerful and principled framework for Bayesian inference in overparameterized models, providing both conceptual and practical advantages over standard variational inference in parameter space.
This is because \(\pof{[\w] \given \Dany}\) captures the meaningful aspects of the posterior, as it is invariant to the symmetries and equivalences in the parameter space that leave the likelihood unchanged.
To achieve this, FSVI employs an implicit variational distribution \(\qof{[\w]}\) and minimizes the following objective, which is regular variational inference on the equivalence classes \(\Kale{\qof{[\W]}}{\pof{[\W] \given \Dany}}\), decomposed using the insights discussed:
\[\begin{aligned} &\Hof{\Dany} \le \Hof{\Dany} + \underbrace{\Kale{\qof{[\W]}}{\pof{[\W] \given \Dany}}}_{\ge 0} \\ &\quad = \Hof{\Dany} + \E{\qof{[\w]}}{-\log \pof{\Dany \given [\w]}} + \Kale{\qof{[\W]}}{\pof{[\W]}} - \Hof{\Dany}\\ &\quad = \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{[\W]}}{\pof{[\W]}} \\ &\quad = \sup_n \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y_n... \given \x_n...}}{\pof{\Y_n... \given \x_n...}} \\ &\quad \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y_n... \given \x_n...}}{\pof{\Y_n... \given \x_n...}} \quad \forall n. \end{aligned}\]This objective differs from the standard variational inference objective in a subtle but crucial way: it operates in the space of equivalence classes \([\w]\) rather than the raw parameter space \(\w\). The likelihood term \(\pof{\Dany \given [\w]}\) is well-defined because all parameters \(\w\) within an equivalence class \([\w]\) induce the same likelihood by construction, allowing us to replace it with the likelihood term \(\pof{\Dany \given \w}\).
The key insight is that FSVI leverages the data processing inequality (DPI) to avoid explicitly constructing the equivalence classes for the prior distribution or specifying a model that operates on them directly. By matching the predictive distributions via the predictive KL term, FSVI implicitly matches the priors over equivalence classes in the limit of infinite data, as shown in the previous section.
This has advantages as overparameterized models are often simpler to optimize. The empirical evidence for this is the success of deep learning models, which are overparameterized and trained with stochastic gradient descent.
This highlights the elegance and conceptual clarity of FSVI:
In Bayesian deep learning, we often use parameter priors that are not meaningful and which also do not take parameter symmetries into account. For example, a unit Gaussian prior over the parameters of a neural network does not induce different predictions for different parameters necessarily. While this prior can be sensible from a parameter compression perspective (e.g. see Hinton and van Camp (1993)
With function priors and predictive priors, we can specify more meaningful priors because we can focus on the predictions and ignore the parameters. More importantly, this connects Bayesian approaches to data augmentation and other regularization techniques as we will see next.
Given that priors over equivalence classes are difficult to express explicitly though, using the DPI to obtain a functional ELBO can be an easier way to express and approximate them.
All this also helps us gain a new perspective on label entropy regularization. The functional evidence bound can be lower-bounded using the chain rule by:
\[\begin{aligned} \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}} \\ \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \E{\pdata{\x}}{\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}}, \end{aligned}\]where we can expand the term under the second expectation to:
\[\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}=\CrossEntropy{\qof{\Y \given \x}}{\pof{\Y \given \x}} - \xHof{\qof{\Y \given \x}}.\]Assuming that our prior yields a uniform distribution over the labels, we can drop the cross entropy term because it is constant and obtain:
\[\E{\qof{\w}}{-\log \pof{\Dany \given \w}} - \E{\pdata{\x}}{\xHof{\qof{\Y \given \x}}}.\]This is the same as an MLE minimization objective with an additional entropy regularization term \(-\xHof{\qof{\Y \given \x}}\) for different \(\x\) that prevents the model from overfitting to the labels and collapsing to the one-hot encoding of the labels.
Thus, in the simplest approximation, the DPI and functional variational inference give us a new perspective on label entropy regularization.
Obviously, assuming non-uniform prior predictions, \(\E{\pdata{\x}}{\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}}\) can be related to knowledge distillation in deep neural networks as introduced by Hinton et al. (2015)
The main technical difference is that knowledge distillation is using the reverse KL divergence instead of the forward KL divergence, while the conceptual difference is that we are not distilling the knowledge from a teacher model but from the prior that we downweigh while also training our model on the data itself. However, the connection between knowledge distillation and continual learning using informative priors is manifest.
In this blog post, we took a deep dive into the data processing inequality (DPI) and its surprisingly far-reaching implications for modern Bayesian deep learning. By carefully examining the assumptions, equality conditions, and chain rule of the DPI, we arrived at an intuitive understanding of why function-space variational inference (FSVI) can be such a powerful tool. The DPI perspective shows how FSVI side-steps issues with high-dimensional parameter spaces by focusing on matching Bayesian predictive posteriors.
Reasoning about parameter equivalence classes under the lens of the DPI, we saw how predictive KL divergences can capture meaningful differences between models while ignoring superficial discrepancies due to symmetries. This provides a fresh perspective on the advantages of predictive priors over standard parameter priors commonly used in Bayesian neural networks.
A key insight that emerged is that FSVIâs true objective is to perform variational inference in the space of predictive equivalence classes, approximating the posterior \(\pof{[\w] \given \Dany}\) that captures the meaningful aspects of the model while being invariant to parameter symmetries. By employing an implicit variational distribution and leveraging the DPI, FSVI sidesteps the need to explicitly define equivalence classes, relying instead on the fact that matching predictive distributions is sufficient to match the posteriors over equivalence classes in the infinite data limit. This reveals FSVI as a principled and elegant approach to Bayesian inference in overparameterized models.
While our treatment only scratched the surface of the full mathematical story, the intuitions we developed allowed us to re-derive key results from the literature and uncover deep connections between seemingly disparate methods like entropy regularization, continual learning, and knowledge distillation. The examples and proofs peppered throughout solidified the core concepts.
More than a bag of technical tricks, the DPI reveals itself to be a powerful conceptual tool for reasoning about models, objectives, and algorithms. I hope this post inspires the reader to seek the fundamental principles underpinning machine learning innovations and to use those principles as a guide for future research. With a solid grasp of foundational tools like the DPI, we can all contribute to demystifying and unifying the rapidly evolving field of Bayesian deep learning.
Acknowledgements. Many thanks to Freddie Bickford Smith for very helpful comments and feedback on this post and to Tim Rudner for additional pointers to relevant literature and feedback on the FSVI section in particular đ€ GPT-4 and Claude 3 Opus were used for editing.
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX