We discuss the RL part of RLHF and its recent displacement by direct preference optimization (DPO). With DPO, a language model can be aligned with human preferences without sampling from an LM, thereby significantly simplifying the training process. By now, DPO has been implemented in many projects and seems to be here to stay.
Reinforcement learning from human feedback (RLHF) is an important technique for aligning (large) language models (LM) with human preferences. It was introduced by Christiano et al.
RLHF for language models works roughly as follows:
Train a parameterized reward function $r_\phi$ (mapping pairs $(x,y)$ to scalars) on the collected preferences by minimizing the loss
\[\mathcal{L}(r) = \mathbb{E}_{(x, y_{rank_i})} \left[ \log \frac{e^{r(x, y_{rank_i})}}{\sum_{j=1}^N e^{r(x, y_{rank_j})}} \right].\]This loss is inspired by the Bradley-Terry model
Fine-tune the LM by viewing it as a policy $\pi_\theta$ and using RL with the learned reward function $r_\phi$ as the reward. For this step, a separate dataset of prompts $\mathcal{D}_{\text{RL}}$ is used to query the LM and collect completions. Since the reward is learned on a very limited subset of possible completions, and is therefore unreliable in off-distribution data, it would be unwise to aim at optimizing it without any regularization.
The typical choice of regularization is the KL-divergence between the policy (i.e. the aligned/fine-tuned LM) and a reference policy $\pi_{\text{ref}}$ (usually the pretrained LM before fine-tuning). The RLHF objective then becomes
\[\tag{1} \label{eq:rlhf} J(\pi) = \mathbb{E}_{x \sim \mathcal{D}_\text{RL}, y\sim \pi_\theta(y \mid x)} \left[ r_\phi(x, y)- \beta D_{\text{KL}} \left( \pi(y, s) || \pi_\text{ref}(y, s) \right) \right],\]which is then used to find the optimal policy $\pi_\theta$ by some optimization algorithm, typically a variant of proximal policy optimization (PPO)
The resulting LLMs are very powerful and so widely used that we don’t need to further elaborate on their performance here. Note, however, that the RLHF scheme has quite some complexity when it comes to actually making it work in practice
From the beginning, RLHF has sparked some controversy. Some regarded it as one of the prime applications of reinforcement learning (which may currently be perceived as “less hot” than LLMs, wherefore applying RL in LLMs is in the former’s favor). At the same time, others were skeptical about whether RLHF is reinforcement learning at all.
Indeed, some crucial components of RL are missing in RLHF. First, the current forms of RLHF do not involve sequential decision-making (although there is some work on that, e.g., the ILQL algorithm
Even more troubling than the non-sequential nature of RLHF may be its information flow. While the policy optimization of RLHF is framed as an online RL algorithm, the environment consists of the policy itself. Usually, in online RL an agent is able to extract new information from the environment. In RLHF, however, the information is not “new” in the sense that it is not extracted from something external to the agent itself. The only information not originally contained in the LM is in the preferences data (notably, not even in the completions themselves, but only in their rankings), and it is only used to fit a reward function. Thus, RLHF is more reminiscent of offline RL or supervised learning than of online RL.
Because of this 1-step nature of RLHF and due to the (unusual for RL) application of training enormous models, the majority of RLHF software is not set up to be compatible with gym(nasium) or other environment interfaces. Take, for example, the well known trl and trlx libraries, which barely mention environments at all. A notable exception is the RL4LMs project by AllenAI, which unfortunately seems to be abandoned, and is based on the deprecated gym instead of gymnasium. For practical RLHF, training in parallel on massive datasets is a necessary requirement, which somewhat complicates the use of standard environment and training interfaces.
The view that RLHF is not “really” RL, or at least does not have to be, has become even more popular after the publication of the DPO algorithm
The direct preference optimization (DPO) algorithm for aligning language models (LM) by Rafailov et al.
The mathematical derivation of DPO is short and insightful. It is based on the following observations:
The RLHF objective (\ref{eq:rlhf}) has an exact (non-parametric) solution for the optimal policy $\pi_r$:
\[\pi_r(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp \left( \frac{1}{\beta} r(x, y) \right).\]This expression is well known in the RL literature and is sometimes referred to as Boltzmann policy (note that in the 1-step RL setting, the Q-function is given by the reward itself).
Similar results were proved in the REPS algorithm
For simplicity, let us consider that only two completions are collected per input, which are then ranked as $y_w$ and $y_l$ (for winning and losing). DPO can be easily extended to the case of more completions per input, but the notation becomes more cumbersome.
The reward $r_\phi$ is then learned by minimizing the loss:
\[\mathcal{L}_\phi = \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[ \log \frac{ e ^ {r_\phi(x, y_w)}}{ e^{r_\phi(x, y_w)} + e^{r_\phi(x, y_l)}} \right]\]which is equivalent to
\[\tag{3} \label{eq:reward-loss-binary} \mathcal{L}_\phi = - \mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \left[ \log \sigma \left( r_\phi(x, y_w) - r_\phi(x, y_l) \right) \right],\]where $\sigma$ is the sigmoid function. Note that only differences of rewards enter (\ref{eq:reward-loss-binary}).
After plugging the expression for the policy \ref{eq:reward-as-function-of-policy} into the loss \ref{eq:reward-loss-binary}, the partition function $Z(x)$ cancels out. Replacing the optimal $\pi_r$ with the parameterized $\pi_\theta$, the DPO objective is obtained as
\[\mathcal{L}_{\text{DPO}}(\pi_\theta ; \pi_{\text{ref}}) := - \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \right].\]Thus, instead of first learning a reward and then finding the optimizing policy, one directly finds the optimal policy such that its reward as obtained from (\ref{eq:reward-as-function-of-policy}) corresponds to collected human preferences (i.e., a reward that optimizes (\ref{eq:reward-loss-binary})). Note that while the induced reward function itself is intractable, the differences of rewards remain tractable and can be computed using the learned policy. This should be sufficient for practical purposes, where rewards are mostly used to rank completions and, e.g., perform rejection sampling.
The paper includes some more details and a discussion of the interpretation of the DPO update, and a detailed comparison to standard RLHF, but the essence of the method is captured by the above derivation. DPO can be easily extended to the case of more completions per input.
The original experiments in the paper were conducted on small-scale models and datasets, and as such were not very convincing. We partially include them here for completeness:
Fortunately, DPO’s simplicity has made it attractive to many researchers and engineers. By now, only a few months after the publication of the paper, it is already included in trl as well as the ray-based library OpenRLHF (which is notably not using rllib, but that’s a story for another day). Moreover, several large models have been trained with DPO, including Zephyr 7B and the 70B parameters TÜLU 2. Here is what the authors of the latter had to say about DPO
DPO training significantly improves AlpacaEval and MT-Bench performance. At all sizes, DPO training provides significant improvements in AlpacaEval, with our largest DPO-trained model significantly outperforming GPT-3.5-turbo-0314 (89.4 vs. 95.1) and is competitive with GPT-4 ... We also observe that DPO training provides a large boost in MT-Bench performance for the 13B and 70B size models, with TÜLU 2+DPO 70B being the best-performing open model compared to all other models on the MT-Bench leaderboard.
DPO training is stable at large scales. We find that DPO training scales without issues with 70Bsize models, with DPO training still providing large benefits for open-ended generation (AlpacaEval) even at the 70B size. This suggests DPO is a promising path for training large models on human feedback without the engineering complexity required by PPO. To our knowledge, TÜLU 2+DPO 70B is the largest publicly-released DPO-trained model.
DPO does not dramatically harm most other metrics. We find that DPO training does not significantly change performance in most other metrics we measure, such as factual reasoning (MMLU) or reasoning (BBH, GSM8k), with the exception of multilinguality (which we discuss below). This suggests that DPO training does not significantly change model capabilities. DPO training significantly drops multilingual capabilities. We find that DPO training significantly drops performance in TydiQA, which tests the multilingual capabilities of our model. However, we note that both our supervised finetuning and DPO data mixes do not explicitly contain multilingual data, and are majority English-language. As such, DPO training is likely to make multilingual outputs further out-of-distribution, and mixing in multilingual data at instruction tuning and DPO training stages may significantly improve these results.
DPO training increases model verbosity. As seen in Table 4, TÜLU 2+DPO models generally output answers of longer length than those trained without DPO. This is in line with prior work showing a bias toward verbosity from RLHF training. However, we note that our DPO-trained models appear dramatically less verbose than other openweight models, which future work will investigate.
One may find it surprising that supervised learning is able to replace RL on a formal level. For RLHF, new data is sampled from the language model, and for DPO this is not the case.
However, after paying closer attention to the information flow of RLHF as described above, it may not be too surprising after all. The sampled data is not really new - it is created using the very same model that one is trying to optimize. The rewards for these samples are also not new, they are obtained by fitting a reward function to the preferences, and no new human preferences are retrieved during optimization. So from the information-flow perspective, supervised learning and RL are indeed equivalent in this particular case. Maybe Francois Chollet was not too extreme for suggesting to get rid of deep RL altogether in his tweet (note that it predates DPO. Personally, I don’t believe in a complete futility of deep RL, but for RLHF he was on point):
The answer to "when should I use deep RL" is that you shouldn't -- you should reframe your problem as a supervised learning problem, which is the only thing that curve-fitting can handle. In all likelihood this applies to RLHF for LLMs.
— François Chollet (@fchollet) February 27, 2023
.
Another surprising aspect of DPO is the question: Why has nobody done this before? Hopefully after reading this blog post, you will agree that the derivation of DPO is not particularly complicated, so why did it take almost 4 years after the introduction of RLHF? Especially considering how tricky RLHF can be to implement. I don’t have an answer, though my intuition is that sometimes as a community we put too much effort into following a working solution, instead of taking a step back and searching for a simpler path. We might have witnessed a large scale instance of the Region-beta paradox.
As a final note on community dynamics: supervised and self-supervised learning are now making more headlines compared to reinforcement learning, and DPO might have the effect of slowing down the complicated (but, as I believe, necessary) marriage of RL and LLMs. I do think that planning and search should play some part of LLM training in the future, although only for settings in which there is an actual environment from which new information can be extracted (like tool-use or robotics). For now, however, taking the RL out of RLHF seems like a good step forward. If DPO can be made beneficial for most LLM trainings, I believe that one can firmly answer the opening question of this blog as:
Is RLHF really (online) RL? No, it is not.
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX