...in which we find a connection between meta-learning literature and a paper studying how well CNNs deal with nuisance transforms in a class-imbalanced setting. Closer inspection reveals a surprising amount of similarity - from meta-information to loss functions. This implies that the current conception of meta-learning might be too narrow.
At the last ICLR conference, Zhou et al. [2022]
Here is a quick summary of their findings: If we train a Convolutional Neural Net (CNN) to classify animals on a set of randomly brightened and darkened images of cats and dogs, it will learn to ignore the scene’s brightness. We say that the CNN learned that classification is invariant to the nuisance transformation of randomly changing the brightness of an image. We now add a set of leopards to the training data, but fewer examples of them (they are hard to photograph) than we have cats and dogs. However, we keep using the same random transformations. The training set thus becomes class-imbalanced.
We might expect a sophisticated learner to look at the entire dataset, recognize the random brightness modifications across all species of animal and henceforth ignore brightness when making predictions. If this applied to our experiment, the CNN would be similarly good at ignoring lighting variations on all animals. Furthermore, we would expect the CNN to become more competent at ignoring lighting variations in proportion to the total amount of images, irrespective of which animal they depict.
Zhou et al.
However, there is a solution: Zhou et al.
So why is this an interesting result?
In the field of machine learning many have dreamed for a long time
Before we proceed to the main post, let’s clarify some definitions. If you are already familiar with the subject, you may skip this part. If you have only a vague notion of contemporary meta-learning you will be able to follow the article anyway. However, if you want to know more, here is a gentle introduction to MAML, one of the most popular methods.
In many real-world classification datasets, the number of examples for each class varies. Class-imbalanced classification refers to classification on datasets where the frequencies of class labels vary significantly.
It is generally more difficult for a neural network to learn to classify classes with fewer examples
Transformations are alterations of data. In the context of image classification, nuisance transformations are alterations that do not affect the class labels of the data. A model is said to be invariant to a nuisance transformation if it can successfully ignore the transformation when predicting a class label.
We can formally define a nuisance transformation$$T(\cdot |x)$$
as a distribution over transformation functions. An example of a nuisance transformation might be a distribution over rotation matrices of different angles, or lighting transformations with different exposure values. By definition, nuisance transformations have no impact on class labels $y$, only on data $x$. A perfectly transformation-invariant classifier would thus completely ignore them, i.e.,
$$ \hat{P}_w(y = j|x) = \hat{P}_w(y = j|x'), \; x' \sim T(\cdot |x). $$
(see Zhou et al.
Let’s take a more detailed look at the experiment Zhou et al.
Zhou et al.
To measure the invariance of the trained model to the applied transformation Zhou et al.
If the learner is invariant to the transformation, the predicted probability distribution over class labels should be identical for the transformed and untransformed images. In that case, the KLD should be zero and greater than zero otherwise. The higher the expected KL-divergence, the more the applied transformation impacts the network’s predictions.
The result: eKLD falls with class size. This implies that the CNN does not learn that there are the same nuisance transformations on all images and therefore does not transfer this knowledge to the classes with less training data. A CNN learns invariance separately for each class (see also Zhou et al.
You might think this is a cool experiment, but how is it related to meta-learning?
And, indeed, in contemporary literature meta-learning is often conceived of as learning multiple tasks. In an much-cited 2022 survey, Hosepdales et al. write:
Meta-learning is most commonly understood as learning to learn; the process of improving a learning algorithm over multiple learning episodes. In contrast, conventional ML improves model predictions over multiple data instances.
In another popular survey Vanschoren [2018] describes the meta-learning process as follows:
First, we need to collect meta-data that describe prior learning tasks and previously learned models. They comprise the exact algorithm configurations used to train the models, including hyperparameter settings, pipeline compositions and/or network architectures, the resulting model evaluations, such as accuracy and training time, the learned model parameters, such as the trained weights of a neural net, as well as measurable properties of the task itself, also known as meta-features.
Francheschi et al. [2018] basically equate meta-learning (ML) with hyperparameter optimization (HO):
[…] both HO and ML essentially boil down to nesting two search problems: at the inner level we seek a good hypothesis (as in standard supervised learning) while at the outer level we seek a good configuration (including a good hypothesis space) where the inner search takes place.
This perspective on meta-learning seems to indicate that “true” meta-learning requires a rigid structure of multiple discrete tasks that is optimized over. However, in the invariance transfer setting we neither have multiple learning episodes, i.e., we learn over multiple data instances, nor any “meta-features”. Also, adding a class to the dataset does not exactly constitute a new “task”, even though knowledge of the nuisance transform is applicable.
So is Zhou et al.’s
Let’s look at one of the original papers on meta-learning. In the 1998 book “Learning to learn” Sebastian Thrun & Lorien Pratt define an algorithm as capable of “Learning to learn” if it improves its performance in proportion to the number of tasks it is exposed to:
an algorithm is said to learn to learn if its performance at each task improves with experience and with the number of tasks. Put differently, a learning algorithm whose performance does not depend on the number of learning tasks, which hence would not benefit from the presence of other learning tasks, is not said to learn to learn
Now this seems a much looser definition. How might this apply to the experiment just outlined? In the introduction, we thought about how a sophisticated learner might handle a dataset like the one described in the last section. We said that a sophisticated learner would learn that the nuisance transformations are applied uniformly to all classes. Therefore, if we added more classes to the dataset, the learner would become more invariant to the transformations because we expose it to more examples of them. Since this is part of the classification task for each class, the learner should, everything else being equal, become better at classification, especially on classes with few training examples. To see this, we must think of the multi-classification task not as a single task but as multiple mappings from image features to activations that must be learned, as a set of binary classification tasks. Thrun and Pratt continue:
For an algorithm to fit this definition, some kind of transfer must occur between multiple tasks that must have a positive impact on expected task-performance
.
This transfer is what Zhou et al.
Zhou et al.
MUNIT networks are capable of performing image-to-image translation, which means that they can translate an image from one domain, such as pictures of leopards, into another domain, such as pictures of house cats. The translated image should look like a real house cat while still resembling the original leopard image. For instance, if the leopard in the original image has its eyes closed, the translated image should contain a house cat with closed eyes. Eye state is a feature present in both domains, so a good translator should not alter it. On the other hand, a leopard’s fur is yellow and spotted, while a house cat’s fur can be white, black, grey, or brown. To make the translated images indistinguishable from real house cats, the translator must thus replace leopard fur with house cat fur.
MUNIT networks learn to perform translations by correctly distinguishing the domain-agnostic features (such as eye state) from the domain-specific features (such as the distribution of fur color). They embed an image into two latent spaces: a content space that encodes the domain-agnostic features and a style space that encodes the domain-specific features (see figure above).
To transform a leopard into a house cat, we can encode the leopard into a content and a style code, discard the leopard-specific style code, randomly select a cat-specific style code, and assemble a house cat image that looks similar by combining the leopard’s content code with the randomly chosen cat style code (see figure below).
Zhou et al.
However, if the domain included house cats and apples, fur color is not a valid style feature. If it was, the translator might translate fur color on an apple and give it black fur, which would look suspiciously out of place. Whatever house cats and apples have in common - maybe their position or size in the frame - would be a valid style feature. We would expect an intra-domain translator on an apples-and-cats dataset to change the position and size of an apple but not to turn it into a cat (not even partially).
It turns out that on a dataset with uniformly applied nuisance transformations, the nuisance transformations are valid style features: The result of randomly rotating an apple cannot be discerned as artificial when images of all classes, house cats and apples, were previously randomly rotated.
Zhou et al.
So MUNIT decomposes the example-specific information, e.g., whether something is an apple or a house cat, from the meta-information, i.e., nuisance transformations applied to the entire dataset. When we add more classes, it has more data and can better learn the transformation distribution $T(\cdot |x)$. Does solving a meta-learning problem make MUNIT a meta-learner? Let’s look at the relationship MUNIT has with contemporary meta-learners
To see how well MUNIT fits the definition of meta-learning, let’s see what the same survey papers we consulted earlier consider the structure of a meta-learning algorithm.
Hospedales et al. [2021]
$$ \underset{\omega}{\mathrm{min}} \; \mathbb{E}_{\mathcal{T} \sim p(\mathcal{T})} \; \mathcal{L}(\mathcal{D}, \omega), $$
where $ \omega $ is parameters trained exclusively on the meta-level, i.e., the meta-knowledge learnable from the task distribution
This meta-knowledge is what the meta-learner accumulates and transfers across the tasks. Collecting meta-knowledge allows the meta-learner to improve its expected task performance with the number of tasks. The meta-knowledge in the experiment of Zhou et al.
The task-centered view of meta-learning brings us to a related issue: A meta-learner must discern and decompose task-specific knowledge from meta-knowledge. Contemporary meta-learners decompose meta-knowledge through the different objectives of their inner and outer loops and their respective loss terms. They store meta-knowledge in the outer loop’s parameter set $ \omega $ but must not learn task-specific information there. Any unlearned meta-features lead to slower adaptation, negatively impacting performance, meta-underfitting. On the other hand, any learned task-specific features will not generalize to unseen tasks in the distribution, thus also negatively impacting performance, meta-overfitting.
We recall that, similarly, MUNIT
Although the single-domain application of MUNIT explicitly learns a single task and scales “over multiple data instances” instead of “multiple learning episodes”
As we shall see, this is even visible when comparing their formalizations as optimization problems.
Francheschi et al. [2018]
$$ \bbox[5pt, border: 2px solid blue]{ \begin{align*} \omega^{*} = \underset{\omega}{\mathrm{argmin}} \sum_{i=1}^{M} \mathcal{L}^{meta}(\theta^{* \; (i)}(\omega), D^{val}_i), \end{align*} } $$
where $M$ describes the number of tasks in a batch, $\mathcal{L}^{meta}$ is the meta-loss function, and $ D^{val}_i $ is the validation set of the task $ i $. $\omega$ represents the parameters exclusively updated in the outer loop. $ \theta^{* \; (i)} $ represents an inner loop learning a task that we can formally express as a sub-objective constraining the primary objective
$$ \bbox[5pt, border: 2px solid red]{ \begin{align*} s.t. \; \theta^{* \; (i)} = \underset{\theta}{\mathrm{argmin}} \; \mathcal{L^{task}}(\theta, \omega, D^{tr}_i), \end{align*} } $$
where $ \theta $ are the model parameters updated in the inner loop, $ \mathcal{L}^{task} $ is the loss function by which they are updated and $ D^{tr}_i $ is the training set of the task $ i $
While not adhering to Francheschi et al.’s [2018] notion of a meta-learner as “nesting two search problems”, it turns out that the loss functions of MUNIT can be similarly decomposed:
MUNIT’s loss function consists of two adversarial (GAN)
MUNIT’s GAN loss term is
$$ \begin{align*} &\mathcal{L}^{x_{2}}_{GAN}(\theta_d, \theta_c, \theta_s) \\\\ =& \;\mathbb{E}_{c_{1} \sim p(c_{1}), s_{2} \sim p(s_{2})} \left[ \log (1 -D_ {2} (G_{2} (c_{1}, s_{2}, \theta_c, \theta_s), \theta_d)) \right] \\ +& \;\mathbb{E}_{x_{2} \sim p(x_{2})} \left[ \log(D_{2} (x_{2}, \theta_d)) \right], \end{align*} $$
where the $ \theta_d $ represents the parameters of the discriminator network, $p(x_2)$ is the data of the second domain, $ c_1 $ is the content embedding of an image from the first domain to be translated. $ s_2 $ is a random style code of the second domain. $ D_2 $ is the discriminator of the second domain, and $ G_2 $ is its generator. MUNIT’s full objective function is:
$$ \begin{align*} \underset{\theta_c, \theta_s}{\mathrm{argmin}} \; \underset{\theta_d}{\mathrm{argmax}}& \;\mathbb{E}_{c_{1} \sim p(c_{1}), s_{2} \sim p(s_{2})} \left[ \log (1 -D_ {2} (G_{2} (c_{1}, s_{2}, \theta_c, \theta_s), \theta_d)) \right] \\ +& \; \mathbb{E}_{x_{2} \sim p(x_{2})} \left[ \log(D_{2} (x_{2}, \theta_d)) \right], + \; \mathcal{L}^{x_{1}}_{GAN}(\theta_d, \theta_c, \theta_s) \\ +& \;\mathcal{L}_{recon}(\theta_c, \theta_s) \end{align*} $$
(compare
$$ \bbox[5px, border: 2px solid blue]{ \begin{align*} \omega^{*} & = \{ \theta_c^*, \theta_s^* \} \\\\ & = \underset{\theta_c, \theta_s}{\mathrm{argmin}} \; \mathbb{E}_{c_{1} \sim p(c_{1}), s_{2} \sim p(s_{2})} \left[ \log (1 -D_ {2} (G_{2} (c_{1}, s_{2}, \theta_c, \theta_s), \theta_d^{*})) \right] \\ & + \mathcal{L}_{recon}(\theta_c, \theta_s), \end{align*} } $$
We then add a single constraint, a subsidiary maximization problem for the discriminator function:
$$ \bbox[5px, border: 2px solid red]{ \begin{align*} &s.t. \;\theta_d^{*} \\\\ & = \underset{\theta_d}{\mathrm{argmax}} \; \mathbb{E}_{c_{1} \sim p(c_{1}), s_{2} \sim p(s_{2})} \left[ \log (1 -D_ {2} (G_{2} (c_{1}, s_{2}, \theta_c, \theta_s), \theta_d)) \right] \\ & + \mathbb{E}_{x_{2} \sim p(x_{2})} \left[ \log(D_{2} (x_{2}, \theta_d)) \right] \end{align*} } $$
Interestingly, this bi-level view does not only resemble a meta-learning procedure as expressed above, but the bi-level optimization also facilitates a similar effect. Maximizing the discriminator’s performance in the constraint punishes style information encoded as content information. If style information is encoded as content information, the discriminator detects artifacts of the original domain in the translated image. Similarly, a meta-learner prevents meta-overfitting via an outer optimization loop.
However, MUNIT, while representable as a bi-level optimization problem does not “essentially boil down to nesting two search problems”.
So it appears that while not conforming to any verbal definition of a contemporary meta-learner MUNIT seems to:
a) adhere to multiple formalizations made in the very same publications to define meta-learning
b) solve a meta-learning problem via GIT when applied to a single domain (if you agree with the conclusion of the previous chapter)
We thus conclude:
When applied to a single domain MUNIT does meta-learn as it combines information from all classes to extract the transformation distribution. While it does not perform classification explicitly, the class information of an image is encoded in MUNIT’s content space. Since MUNIT is trained in an unsupervised way, it is probably closer to a distance metric than an actual class label. We might thus classify single-domain MUNIT as an unsupervised, generative meta-learner.
That invariance transfer and GIT are meta-learning and that MUNIT is a meta-learner is important. Granted, it is not especially hard to see that invariance transfer is a form of “learning to learn” or that Image-to-Image translation is essentially a mechanism to decompose class-specific form general features.
However, because contemporary meta-learning has been narrowly cast as “improving a learning algorithm over multiple learning episodes”
In these authors opinion this is not GIT’s fault, but a sign that meta-learning has recently been conceived of too narrowly. Zhou et al.’s
A too-narrow conception goes further than obscuring some experiment’s significance though: Meta-learning as a field has recently struggled to compete with less specialized architectures
Zhou et al.’s
Zhou et al.’s
Using GIT, Zhou et al.
Our discussion of Zhou et al.’s
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX