-
Masked Language Model with ALiBi and CLAP head
As a new approach to positional encoding, Attention with Linear Biases (ALiBi) uses linear biases of the attention weights to encode positional information, with capability of context length extrapolation. In their paper however, Press et al. focus on the perplexity of autoregressive decoder-only language models, leaving the question of downstream tasks and its applicability to encoder-attention open. In this blogpost, we attempt to bridge the gap by testing masked language models (MLMs) with encoder-attention ALiBi and prediction head similar to the counterparts of the original ALiBi models. We find that while simplified prediction head may be beneficial, performance of MLMs with encoder-attention ALiBi starts to deteriorate with 2048 sequence length at larger scales. We put our results in the context of related recent experiments and tentatively identify the circumstances more challenging to positional encoding designs. Finally, we open-source our MLMs, with BERT-level performance and 2048 context length.
-
On Bayesian Model Selection: The Marginal Likelihood, Cross-Validation, and Conditional Log Marginal Likelihood
Bayesian model selection has long relied on the marginal likelihood and related quantities, often motivated by the principle of Occam's razor. Following the paper 'Bayesian Model Selection, the Marginal Likelihood, and Generalization' by Lotfi et al. (2022), this blog post critically examines the conventional focus on the marginal likelihood and related quantities for Bayesian model selection as a direct consequence of Occam's razor. We find that the suitability of these criteria depends on the specific context and goals of the modeling task. We revisit the concepts of log marginal likelihood (LML), cross-validation, and the recently introduced conditional log marginal likelihood (CLML), highlighting their connections and differences through an information-theoretic lens. Through thought experiments and empirical observations, we explore the behavior of these model selection criteria in different data regimes under model misspecification and prior-data conflict, finding that the conditional marginal cross-entropy, closely related to cross-validation, is often more reliable for optimizing generalization performance. We review relevant literature, compare the CLML and validation loss for deep neural networks, and using a toy Bayesian linear regression, we demonstrate that all the discussed quantities can fail to reliably predict generalization. Our takeaways are that: there is no one-size-fits-all solution; the choice of model selection quantity depends on the specific context and goals; and in the future, we should take into account model complexity as well and not assume a uniform model prior. While this work leaves scope for more rigorous theoretical justification and more wide-ranging empirical investigation (along with deeper engagement with philosophical implications), it nevertheless provides grounds for questioning the primacy of the (conditional) log marginal likelihood and encourages critical thinking about its foundations, aiming for a more nuanced understanding of Bayesian model selection.
-
RLHF without RL - Direct Preference Optimization
We discuss the RL part of RLHF and its recent displacement by direct preference optimization (DPO). With DPO, a language model can be aligned with human preferences without sampling from an LM, thereby significantly simplifying the training process. By now, DPO has been implemented in many projects and seems to be here to stay.
-
The Hidden Convex Optimization Landscape of Two-Layer ReLU Networks
In this article, we delve into the research paper titled 'The Hidden Convex Optimization Landscape of Regularized Two-Layer ReLU Networks'. We put our focus on the significance of this study and evaluate its relevance in the current landscape of the theory of machine learning. This paper describes how solving a convex problem can directly give the solution to the highly non-convex problem that is optimizing a two-layer ReLU Network. After giving some intuition on the proof through a few examples, we will observe the limits of this model as we might not yet be able to throw away the non-convex problem.
-
The N Implementation Details of RLHF with PPO
Reinforcement Learning from Human Feedback (RLHF) is pivotal in the modern application of language modeling, as exemplified by ChatGPT. This blog post delves into an in-depth exploration of RLHF, attempting to reproduce the results from OpenAI's inaugural RLHF paper, published in 2019. Our detailed examination provides valuable insights into the implementation details of RLHF, which often go unnoticed.
-
Towards Robust Foundation Models: Adversarial Contrastive Learning
Foundation models pre-trained on large-scale unlabelled datasets using self-supervision can be generalizable to a wide range of downstream tasks. Existing work has shown that adversarial attacks can effectively fool any downstream models fine-tuned from a pre-trained foundation model. The existence of such adversarial attacks necessitates the development of robust foundation models which can yield both standard generalization and adversarial robustness to safety-critical downstream tasks. Currently, adversarial contrastive learning (ACL) is one of the most effective methods for outputting a robust foundation model. ACL incorporates contrastive learning with adversarial data to effectively output a robust representation without requiring costly annotations. In this blog, we introduced two NeurIPS 2023 publications that can enhance ACL's efficacy and efficiency, respectively. (1) This blog introduces Adversarial Invariant Regularization (AIR) which is a state-of-the-art ACL algorithm. A causal theoretical framework is built to interpret ACL, and then the AIR algorithm is derived from the causal framework to regulate and improve the ACL. (2) This blog also introduces a Robustness-aware Coreset Selection (RCS) method to speed up ACL. RCS does not require label information and searches for an informative training subset that can maintain the adversarial robustness. For the first time, RCS enables the application of ACL on the large-scale ImageNet-1K dataset.
-
Understanding gradient inversion attacks from the prior knowledge perspective
In this blogpost, we mention multiple works in gradient inversion attacks, point out the chanllenges we need to solve in GIAs, and provide a perspective from the prior knowledge to understand the logic behind recent papers.
-
Understanding in-context learning in transformers
We propose a technical exploration of In-Context Learning (ICL) for linear regression tasks in transformer architectures. Focusing on the article Transformers Learn In-Context by Gradient Descent by J. von Oswald et al., published in ICML 2023 last year, we provide detailed explanations and illustrations of the mechanisms involved. We also contribute novel analyses on ICL, discuss recent developments and we point to open questions in this area of research.
-
Unraveling The Impact of Training Samples
How do we quantify the influence of datasets? Recent works on Data Attribution Methods shed light on this problem. In this blog post, we introduce Data Attribution Methods which leverage robust statistics and surrogate functions, and present their applications like distinguishing the feature selection difference of learning algorithms, detecting data leakage, and assessing model robustness.
-
What exactly has TabPFN learned to do?
TabPFN [Hollmann et al., 2023], a Transformer model pretrained to perform in-context learning on fresh tabular classification problems, was presented at the last ICLR conference. To better understand its behavior, we treat it as a black-box function approximator generator and observe its generated function approximations on a varied selection of training datasets. Exploring its learned inductive biases in this manner, we observe behavior that is at turns either brilliant or baffling. We conclude this post with thoughts on how these results might inform the development, evaluation, and application of prior-data fitted networks (PFNs) in the future.