Towards Robust Foundation Models: Adversarial Contrastive Learning
Foundation models pre-trained on large-scale unlabelled datasets using self-supervision can be generalizable to a wide range of downstream tasks. Existing work has shown that adversarial attacks can effectively fool any downstream models fine-tuned from a pre-trained foundation model. The existence of such adversarial attacks necessitates the development of robust foundation models which can yield both standard generalization and adversarial robustness to safety-critical downstream tasks. Currently, adversarial contrastive learning (ACL) is one of the most effective methods for outputting a robust foundation model. ACL incorporates contrastive learning with adversarial data to effectively output a robust representation without requiring costly annotations. In this blog, we introduced two NeurIPS 2023 publications that can enhance ACL's efficacy and efficiency, respectively. (1) This blog introduces Adversarial Invariant Regularization (AIR) which is a state-of-the-art ACL algorithm. A causal theoretical framework is built to interpret ACL, and then the AIR algorithm is derived from the causal framework to regulate and improve the ACL. (2) This blog also introduces a Robustness-aware Coreset Selection (RCS) method to speed up ACL. RCS does not require label information and searches for an informative training subset that can maintain the adversarial robustness. For the first time, RCS enables the application of ACL on the large-scale ImageNet-1K dataset.
Foundation Models
Foundation models are pre-trained on large-scale unlabelled datasets using self-supervised learning methods, which is generalizable to a wide range of downstream tasks via fine-tuning. For example, GPT-3 has been successfully commercialized as a powerful text generation application. Vision transformer has been widely used in computer vision tasks such as object detection and medical analysis . BLIP is a vision-language pre-trained model that can perform many vision-language tasks such as the visual question answering task . CLAP is a language-audio pre-trained model that can be used for understanding the pair of texts and audio.
Contrastive Learning (CL)
To build foundation models, contrastive learning (CL) is one of the popular self-supervised learning methods. CL aims to maximize the agreement between different natural views of the original data.
Let \(f_\theta: \mathcal{X} \rightarrow \mathcal{Z}\) be a feature extractor parameterized by \(\theta\), \(g:\mathcal{Z} \rightarrow \mathcal{V}\) be a projection head that maps representations to the space where the contrastive loss is applied, and \(\tau_i, \tau_j: \mathcal{X} \rightarrow \mathcal{X}\) be two transformation operations randomly sampled from a pre-defined transformation set \(\mathcal{T}\). Given a mini-batch \(B \sim \mathcal{X}^\beta\) consisting of \(\beta\) samples, we denote the augmented minibatch \(B^\prime = \{ \tau_i(x_k), \tau_j(x_k) \mid \forall x_k \in B \}\) consisting of \(2\beta\) samples. We take \(h_\theta(\cdot) = g \circ f_\theta(\cdot)\) and \(x_k^u = \tau_u(x_k)\) for any \(x_k \sim \mathcal{X}\) and \(u \in \{i,j\}\). The contrastive loss between different natural views (i.e., \(x_k^i\) and \(x_k^j\)) is formulated as follows:
where \(\mathrm{sim}(\cdot,\cdot)\) is the cosine similarity function.
Intuitively, CL aims to maximize the agreement between different natural views (the dash blue lines).
How to implement CL at the pre-training stage in practice?
Click here to see the Pytorch code for calculating contrastive loss. You can copy-paste it to calculate the contrastive loss in convenience. The code is copied from https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.
Besides, you can use the following script to conduct self-supervised pre-training via CL using ResNet-18 on CIFAR-10:
Robust Foundation Models
Existing work has shown that there exist adversarial attacks that can fool the foundation representations to output incorrect predictions by adding imperceptible adversarial perturbations to the original inputs in downstream tasks. The existence of adversarial attacks necessitates the development of robust foundation models in safety-critical downstream tasks.
The foundation representation is vulnerable to adversarial attacks, which wrongly predicts a car as 'NOT a car'.
Robust foundation models are pre-trained on large-scale datasets via robust self-supervised learning methods. Robust foundation models have the following two critical properties:
Robust foundation representations is generalizable to downstream tasks;
Fine-tuned robust foundation representations is adversarially robust against adversarial attacks in downstream tasks.
Adversarial Contrastive Learning (ACL)
To learn robust foundation representations, adversarial contrastive learning (ACL) is one of the most popular and effective robust self-supervised learning methods. ACL incorporates CL with adversarial data to build a robust foundation model without requiring costly annotations. ACL aims to maximize the agreement between different natural views as well as the agreement between different adversarial views. The adversarial contrastive loss given a data point \(x_k \in \mathcal{X}\) is formulated as follows:
Note that \(\omega \in [0,1]\) is a scalar and \(\mathcal{B}_\epsilon[x]\) is a constraint that ensures the adversarial data \(\tilde{x}\) is in the \(\epsilon\)-ball around data \(x\).
Intuitively, ACL aims to maximize the agreement between different natural view (the dash blue lines) and the agreement between different adversarial views (the dash red lines).
Here is the generation procedure of adversarial data via Projected Gradient Descent (PGD) . Given an initial positive pair \((x_k^{i,(0)}, x_k^{j,(0)})\), PGD step \(T \in \mathbb{N}\), step size \(\rho > 0\), and adversarial budget \(\epsilon \geq 0\), PGD iteratively updates the pair of data from \(t=0\) to \(T-1\) as follows:
where \(\Pi_{\mathcal{B}_\epsilon[x]}\) projects the data into the \(\epsilon\)-ball around the initial point \(x\). Generating adversarial data requires \(T\) iterations of forwarding and back-propagations, which makes the training procedure extremely slow.
The generation procedure of adversarial data in ACL. The adversarial data $\tilde{x}_k^i$ and $\tilde{x}_k^j$ are updated from the low-loss region to the high-loss region step by step according to the loss gradient.
At each epoch, ACL conducts steps (1) and (2) alternatively:
Step (1): generating adversarial data (i.e., \(\tilde{x}_k^i\) and \(\tilde{x}_k^j\)) via PGD;
Step (2): updating model parameters via minimizing adversarial contrastive loss to maximize agreements on the adversarial data and natural data.
How to implement ACL at the pre-training stage in practice?
Click here to see the Pytorch code for calculating adversarial contrastive loss. You can copy-paste it to calculate the adversarial contrastive loss in convenience. The code is copied from https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.
Besides, you can use the following script to conduct robust self-supervised pre-training via ACL using ResNet-18 on CIFAR-10:
How to utilize robust foundation representations via fine-tuning in downstream tasks?
At the fine-tuning stage, a classifier is randomly initialized and appended to the pre-trained feature extractor for solving the classification tasks. There are three types of fine-tuning modes:
Standard linear fine-tuning (SLF): only standardly fine-tuning the classifier while freezing the feature extractor.
Adversarial linear fine-tuning (ALF): only adversarially fine-tuning the classifier while freezing the feature extractor.
Adversarial full fine-tuning (AFF): adversarially fine-tuning both the feature extractor and the classifier.
You can use the following script to transfer an adversarially pre-trained ResNet-18 on CIFAR-10 to a downstream task CIFAR-100 via fine-tuning:
Note that MODE=ALL refers to that the finetuning.py sequentially conducts fine-tuning of all three modes (i.e., SLF, ALF, and AFF) and outputs the result via each fine-tuning mode in the log file $FINETUNE_DIR/results/log.txt.
Enhancing ACL via Adversarial Invariant Regularization (AIR)
Here, we introduce the NeurIPS 2023 paper which proposes Adversarial Invariant Regularization (AIR) that regulates both standard and robust representations to be style-independent based on a causal theoretical framework. Empirically, AIR yields state-of-the-art performance in terms of robustness against adversarial attacks and common corruption as well as the standard generalization in downstream tasks.
Causal View of ACL
AIR first introduces the causal graph of the ACL as shown in the following figure.
The causal graph of the ACL.
During the data generation procedure:
\(c\) is the content variable, which can be regarded as the original data in the datasets.
\(s\) is the style factor, which can regarded as the data transformation functions that can modify the content while maintaining the semantic meaning of the content. Note that factors \(c\) and \(s\) are independent.
\(x\) is the natural data, which is decided by the content factor \(c\) and the style factor \(s\).
\(y_t \in \{ y_i \}_{i=1}^{T}\) is the label from an unknown downstream task. Note that \(y_t\) is only decided by the content factor \(c\).
\(y^R\) is the proxy label, which is a refinement of $y_t$. \(y^R\) is used for self-supervised learning without labels. As illustrated in the following figure, the label dog is refined into proxy labels golden Retriever with yellow hair and labrador retriever with black hair. Therefore, when there is no target label, we can train models by differentiating these two different pictures using the contrastive loss.
The illustration of the proxy label $y^R$ which is a refinement of the label $y_t$.
\(\tilde{x}\) is the adversarial data of $x$. Since the generation procedure of \(\tilde{x}\) in ACL does not use the labels, the adversarial data \(\tilde{x}\) is decided by the natural data \(x\) and the model parameter \(\theta\).
During the learning procedure, ACL optimizes the parameters \(\theta\) by maximizing the conditional probabilities both \(p(y^R \mid x)\) and \(p(y^R \mid \tilde{x})\).
the Methodology of AIR
Style-invariant criterion.
From the causal view of ACL, the learning procedure should satisfy the style-independent criterion. That is to say, the intervention on the style factor should not affect the conditional probability, i.e., \(p^{do(\tau_i)}(y^R \mid x) = p^{do(\tau_j)}(y^R \mid x)\) where \(do(\tau)\) is the intervention approximated by the data augmentation function $\tau \in \mathcal{T}$.
According to causal reasoning, the style factor $s$ should not affect $p(y^R \mid x)$.
Assuming that the path \(x \rightarrow \tilde{x} \rightarrow y^R\) in the causal graph satisfies the Markov condition, we can obtain that
The conditional probability \(p^{do(\tau_u)}(y^R \mid \tilde{x})\) for \(u \in \{i,j\}\) is calculated as the cosine similarity between the original data \(x\) and the adversarial data \(\tilde{x}^u\) normalized by the softmax function:
Note that \(y^R\) is only decided by the content factor \(c\). Empirically, the content factor \(c\) can be approximated by the original data \(x\) from the datasets.
The conditional probability \(p^{do(\tau_u)}(\tilde{x} \mid x)\) for \(u \in \{i,j\}\) is calculated as the cosine similarity between the natural data \(x^u\) and the adversarial data \(\tilde{x}^u\) normalized by the softmax function:
in which \(\epsilon \geq 0\) is the adversarial budget, \(B\) is a mini-batch, and \(\mathrm{KL}(p(x) \| q(x); B) = \sum_{x \in B} p(x) \log \frac{p(x)}{q(x)}\) denotes the Kullback–Leibler (KL) divergence.
We provide an illustration of AIR for ACL. The AIR aims to maximize the agreements between the original data and the adversarial view (the dash yellow lines) and the agreements between the natural view and the adversarial view (the dash pink lines).
Intuitively, AIR aims to maximize the agreements among different natural views, different adversarial views, and original data.
Learning objective of AIR enhanced ACL.
The learning objective of AIR is formulated as follows:
Click here to see the Pytorch code for calculating AIR loss. You can copy-paste it to calculate the AIR loss in convenience.
Besides, you can use the following script to conduct robust self-supervised pre-training via AIR using ResNet-18 on CIFAR-10:
Empirical Results
AIR yields state-of-the-art cross-task robustness transferability against adversarial attacks.
\(\mathcal{D}_1 \rightarrow \mathcal{D}_2\) refers to that the model is pre-trained on dataset \(\mathcal{D}_1\) and fine-tuned on downstream dataset \(\mathcal{D}_2\).
SA refers the standard accuracy calculated as the average accuracy on the natural test data in the downstream dataset \(\mathcal{D}_2\).
AA refers to the robust accuracy calculated as the average accuracy on the adversarial test data generated via adversarial attacks in the downstream dataset \(\mathcal{D}_2\).
AIR yields state-of-the-art cross-task robustness transferability against common corruptions.
CS-# refers to the the average accuracy evaluated on the test data under common corruptions with corruption severity (CS) of # \(\in\) {1,3,5} in the downstream dataset \(\mathcal{D}_2\).
To reproduce the above results of the transferability from CIFAR-10 to CIFAR-100, you can use the following scripts.
At the pre-training stage, you can conduct AIR using ResNet-18 on CIFAR-10.
At the fine-tuning stage, you can fine-tune the pre-trained ResNet-18 to downstream task CIFAR-100. During the fine-tuning stage, the following script will automatically conduct all three fine-tuning modes (i.e., SLF, ALF, and AFF). After the fine-tuning stage, you can check the standard accuracy, the robust accuracy under adversarial attacks and common cottuptions under each fine-tuning method from a log file at $FINETUNE_DIR/results/log.txt.
Robust Self-Supervised Learning (RobustSSL) Benchmark The website of RobustSSL Benchmark is at https://robustssl.github.io/.
A screenshot of the leaderboard shown in RobustSSL Benchmark.
Efficient ACL via Robustness-Aware Coreset Selection (RCS)
Here, we introduce the NeurIPS 2023 spotlight paper which proposes Robustness-Aware Coreset Selection (RCS) that selects an informative coreset without label annotations to speed up ACL. Theoretically, Xu et al. (2023) show that a greedy search algorithm can efficiently find the coreset. Empirically, RCS can speed up both ACL and supervised robust pre-training by a large margin on CIFAR and ImageNet-1K datasets without significantly hurting the robustness transferability. This paper for the first time proves the concept of the possibility of applying ACL on large-scale datasets.
Motivation—ACL is Inefficient
ACL is computationally prohibitive on large-scale datasets since generating adversarial data requires expensive computational overheads.
Empirically, ACL on the entire ImageNet-1K dataset (1,281,167 training data points) requires about 650 hours evaluated on RTX A5000 GPUs. Due to the inefficiency of ACL, ACL has not yet been applied to ImageNet-1K datasets without RCS.
ACL is inefficient because $T$ PGD steps require expensive computational overheads.
the Methodology of RCS
Intuition of RCS.
To speed up ACL, RCS takes an intuitive idea which is to find an informative training subset (called “coreset”). The coreset can directly decrease the number of training samples, thus significantly accelerating ACL. Besides, since the coreset is informative, which is beneficial in improving \(f\)’s adversarial robustness, it should guarantee the ACL to output an effective robust foundation model.
RCS generates an informative coreset to make ACL efficiently obtain an effective robust foundation model.Image from https://medium.com/analytics-vidhya/sampling-statistical-approach-in-machine-learning-4903c40ebf86.
Representational Distance (RD) as a measurement of \(f\)’s adversarial robustness without labels.
RD of a data point \(\ell_\mathrm{RD}(x;\theta)\) is quantified by the representational distance between the natural data and its adversarial counterpart, i.e.,
in which the PGD method is used to generate adversarial data \(\tilde{x}\) within the \(\epsilon\)-ball centered at \(x\) and \(d(\cdot, \cdot): \mathcal{V} \times \mathcal{V} \rightarrow \mathbb{R}\) is a distance function, such as the KL divergence. The smaller the RD is, the representations are of less sensitivity to adversarial perturbations, thus being more adversarially robust.
Objective function of RCS.
To realize the intuitive idea, RCS is formulated as follows:
in which \(S^*\) is the coreset, \(U\) is an unlabled validation set, \(k \in (0,1]\) is subset fraction that controls the size of coreset, and \(\mathcal{L}_{\mathrm{RD}}(U; \theta(S)) = \sum_{x \in U} \ell_\mathrm{RD}(x; \theta(S))\), and \(\mathcal{L}_\mathrm{ACL}(S; \theta) = \sum_{x \in S} \ell_\mathrm{ACL}(x; \theta)\).
Intuitively, given a coreset \(S^*\), after the model parameters are updated to \(\theta(S^{*})\) via minimizing the ACL loss on the coreset \(\mathcal{L}_\mathrm{ACL}(S^*; \theta)\), the model will achieve the minimizied RD loss on the validation dataset \(\mathcal{L}_{\mathrm{RD}}(U; \theta(S^*))\), thus being adversarially robust.
Then, RCS can be converted into a problem of maximizing a set function subject to a cardinality constraint as follows:
where \(G:2^\mathcal{X} \rightarrow \mathbb{R}\) is a set function, \(\theta(S)\) is estimated using the one-step approximation and \(\eta \in \mathbb{R}^+\) is the learning rate.
RCS via Greedy Search.
The vanilla solution of traversing all subsets and selecting the subset that has the largest \(G_\theta(S)\) is intractable. Xu et al. (2023) show that the set function \(G_\theta(S)\) satisfies the following two critical properties, which motivates a greedy search to efficiently search for the coreset.
The set function \(G_\theta(S)\) is proved as submodularIn reality, the authors of RCS rigorously proved a proxy set function as weakly submodular. Further, the authors of RCS proved that the greedy search algorithm provides a guaranteed lower bound for the proposed set function maximization problem based on a weakly submodular proxy set function. For more details, please refer to the paper of RCS. which satisfies the following two properties:
Monotonicity: As more data is added to the set, the representation becomes better. \(G(x\mid X)=G(S \cup \{x\}) - G(S) \geq 0\) for any \(S \subseteq X\) and \(x \in X \setminus S\).
Diminishing returns: As the set has more data, the marginal gain of extra data for learning representations gradually diminishes. \(\mathop{\forall}\limits_{A,B \mid A \subseteq B} G_\theta(x \mid A) \geq G_\theta(x \mid B)\).
Therefore, RCS greedily searches for the data \(x\) that has the largest marginal gain and then adds them into the coreset.
Pseudo-code of efficient ACL via RCS.
Step 1 (Warm-up): Warm up training on the entire training set to find a better starting point \(f_\theta\).
Step 2.1 (RCS): \(S \gets\emptyset\). \(\theta' \gets \theta\). Compute gradients \(Q \gets \{ q_k = \nabla_\theta \mathcal{L}_\mathrm{ACL}(x_k; \theta) \mid \forall x_k \in X \}\) on unlabeled training dataset \(X\).
Step 4: Every a few epochs, go to Step 2.1 to generate a new coreset; otherwise go to Step 3 to update model parameters. The algorithm stops when reaching the final training epoch.
A pipeline of efficient ACL via RCS. After the warm-up periods, the model is trained on the coreset. Thus, RCS makes the training procedure much more efficient by decreasing the number of training data.
Intuitively, RCS greedily selects and adds the data \(x\) whose training loss gradient (i.e., \(\nabla_\theta\mathcal{L}_\mathrm{ACL}(\{x\}, \theta)\)) and validation loss gradient (i.e, \(\nabla_\theta\mathcal{L}_\mathcal{RD}(U; \theta(S))\)) have the most similarity into the coreset. In this way, training on the data selected by RCS is most beneficial in optimizing the RD loss, which is thus most helpful to improve \(f\)’s adversarial robustness.
The term speed-up ratio refers to the ratio of the time consumption of pre-training on the training set to the the time consumption of pre-training on the training subset. Thus, the larger the speed-up ratio is, the more efficient the pre-training procedure is.
The terms standard test accuracy and robust test accuracy refer to the average accuracy evaluated on natural test data and adversarial test data, respectively. Thus, the higher the line is, the more effective the pre-training method is.
The results obtained by RCS located in the upper-right corner is more efficient and more effective.
To reproduce the above results of the robustness transferability from CIFAR-10 to CIFAR-100, you can use the following scripts.
At the pre-training stage, you can conduct ACL via RCS using ResNet-18 on CIFAR-10.
At the fine-tuning stage, you can fine-tune the pre-trained ResNet-18 on CIFAR-100. The test accuracy are saved in $FINETUNE_DIR/results/log.txt.
For the first time, ACL was conducted efficiently on ImageNet-1K via RCS. The results prove the possibility of applying ACL on large-scale datasets. Here, SA refers to standard test accuracy and RA refers to the robust test accuracy.
To reproduce the above results of the robustness transferability from ImageNet-1K to CIFAR-10, you can use the following scripts.
At the pre-training stage, you can ACL via RCS using Wide ResNet with width 10 and depth 28 (WRN-28-10) on ImageNet-1K of \(32 \times 32\) resolution.
At the fine-tuning stage, you can fine-tune the ImageNet-1K pre-trained models on CIFAR-10.
RCS can speed up Standard Adversarial Training (SAT) on ImageNet-1K. The results show that RCS is applicable to robust pre-training in the supervised setting.
To reproduce the above results of the robustness transferability from ImageNet-1K to CIFAR-10, you can use the following scripts.
At the pre-training stage, you can conduct SAT using WRN-28-10 on ImageNet-1K of \(32 \times 32\) resolution.
At the fine-tuning stage, you can fine-tune ImageNet-1K pre-trained WRN-28-10 on CIFAR-10.
For attribution in academic contexts, please cite this work as