
A New Alchemy: Language Model Development as a Subfield?
This blog post makes the case that the body of research on language models become sufficiently large and mature that we can start thinking about “language model development” as a new subfield. To support this claim, we sketch out the focuses and methodologies of this new subfield. In addition, we provide some personal reflections on what to do when your field of study gives birth to a new one.

Behavioral Differences in ModeSwitching Exploration for Reinforcement Learning
In 2022, researchers from Google DeepMind presented an initial study on modeswitching exploration, by which an agent separates its exploitation and exploration actions more coarsely throughout an episode by intermittently and significantly changing its behavior policy. We supplement their work in this blog post by showcasing some observed behavioral differences between modeswitching and monolithic exploration on the Atari suite and presenting illustrative examples of its benefits. This work aids practitioners and researchers by providing practical guidance and eliciting future research directions in modeswitching exploration.

Bridging the Data Processing Inequality and FunctionSpace Variational Inference
This blog post explores the interplay between the Data Processing Inequality (DPI), a cornerstone concept in information theory, and FunctionSpace Variational Inference (FSVI) within the context of Bayesian deep learning. The DPI governs the transformation and flow of information through stochastic processes, and its unique connection to FSVI is employed to highlight FSVI's focus on Bayesian predictive posteriors over parameter space. The post examines various forms of the DPI, including the KL divergence based DPI, and provides intuitive examples and detailed proofs. It also explores the equality case of the DPI to gain a deeper understanding. The connection between DPI and FSVI is then established, showing how FSVI can measure a predictive divergence independent of parameter symmetries. The post relates FSVI to knowledge distillation and label entropy regularization, highlighting the practical relevance of the theoretical concepts. Throughout the post, theoretical concepts are intertwined with intuitive explanations and mathematical rigor, offering a comprehensive understanding of these complex topics. By examining these concepts in depth, the post provides valuable insights for both theory and practice in machine learning.

Building Diffusion Model's theory from ground up
Diffusion Models, a new generative model family, have taken the world by storm after the seminal paper by Ho et al. [2020]. While diffusion models are often described as a probabilistic Markov Chains, their underlying principle is based on the decadeold theory of Stochastic Differential Equations (SDE), as found out later by Song et al. [2021]. In this article, we will go back and revisit the 'fundamental ingredients' behind the SDE formulation and show how the idea can be 'shaped' to get to the modern form of Scorebased Diffusion Models. We'll start from the very definition of the 'score', how it was used in the context of generative modeling, how we achieve the necessary theoretical guarantees and how the critical design choices were made to finally arrive at the more 'principled' framework of Scorebased Diffusion. Throughout this article, we provide several intuitive illustrations for ease of understanding.

Deep Equilibrium Models For Algorithmic Reasoning
In this blogpost we discuss the idea of teaching neural networks to reach fixed points when reasoning. Specifically, on the algorithmic reasoning benchmark CLRS the current neural networks are told the number of reasoning steps they need, which they shouldn't be given. While a quick fix is to add a termination network that predicts when to stop, a much more salient inductive bias is that the neural network shouldn't change its answer any further once the answer is correct, i.e. it should reach a fixed point. This is supported by denotational semantics, which tells us that while loops that terminate are the minimum fixed points of a function. We implement this idea with the help of deep equilibrium models and discuss several hurdles one encounters along the way. We show on several algorithms from the CLRS benchmark the partial success of this approach and the difficulty in making it work robustly across all algorithms.

Double Descent Demystified
Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle

Elaborating on the Value of Flow Matching for Density Estimation
The transfer of matchingbased training from Diffusion Models to Normalizing Flows allows to fit expressive continuous normalizing flows efficiently and therefore enables their usage for different kinds of density estimation tasks. One particularly interesting task is SimulationBased Inference, where Flow Matching enabled several improvements. The post shall focus on the discussion of Flow Matching for Continuous Normalizing Flows. To highlight the relevance and the practicality of the method, their use and advantages for SimulationBased Inference is elaborated.

Exploring Metalearned Curiosity Algorithms
This blog post delves into Alet et al.'s ICLR 2020 paper, Metalearning curiosity algorithms, which introduces a unique approach to metalearning curiosity algorithms. Instead of metalearning neural network weights, the focus is on metalearning pieces of code, allowing it to be interpretable by humans. The post explores the two metalearned algorithms, namely Fast Action Space Transition (FAST) and CycleConsistency Intrinsic Motivation (CCIM).

Fair ModelBased Reinforcement Learning Comparisons with Explicit and Consistent Update Frequency
Implicit update frequencies can introduce ambiguity in the interpretation of modelbased reinforcement learning benchmarks, obscuring the real objective of the evaluation. While the update frequency can sometimes be optimized to improve performance, realworld applications often impose constraints, allowing updates only between deployments on the actual system. This blog post emphasizes the need for evaluations using consistent update frequencies across different algorithms to provide researchers and practitioners with clearer comparisons under realistic constraints.

Fairness in AI: two philosophies or just one?
The topic of fairness in AI has garnered more attention over the last year, recently with the arrival of the EU's AI Act. This goal of achieving fairness in AI is often done in one of two ways, namely through counterfactual fairness or through group fairness. These research strands originate from two vastly differing ideologies. However, with the use of causal graphs, it is possible to show that they are related and even that satisfying a fairness group measure means satisfying counterfactual fairness.

How to compute Hessianvector products?
The product between the Hessian of a function and a vector, the Hessianvector product (HVP), is a fundamental quantity to study the variation of a function. It is ubiquitous in traditional optimization and machine learning. However, the computation of HVPs is often considered prohibitive in the context of deep learning, driving practitioners to use proxy quantities to evaluate the loss geometry. Standard automatic differentiation theory predicts that the computational complexity of an HVP is of the same order of magnitude as the complexity of computing a gradient. The goal of this blog post is to provide a practical counterpart to this theoretical result, showing that modern automatic differentiation frameworks, JAX and PyTorch, allow for efficient computation of these HVPs in standard deep learning cost functions.

It's Time to Move On: Primacy Bias and Why It Helps to Forget
'The Primacy Bias in Deep Reinforcement Learning' demonstrates how the first experiences of a deep learning model can cause catastrophic memorization and how this can be prevented. In this post we describe primacy bias, summarize the authors' key findings, and present a simple environment to experiment with primacy bias.