Are LLMs a proof that the 'bitter lesson' holds for NLP? Perhaps the opposite is true: they work due to the scale of human data, and not just computation.
The ‘bitter lesson’ post by Rich Sutton
(a) ‘soft’: approaches relying on human knowledge have historically been beaten by those that instead rely on computation (b) ‘hard’: scale is all you need
(a) is ‘bitter’ in the sense that it implies that the ML researchers may not have much to contribute intellectually to their work, since ‘almost no innovation is required beyond scale’. This is indeed depressing
(b) is a much stronger claim, and I don’t think many NLP researchers are seriously committed to it. It is not hard to see why. The current LLMs look like a monument to scaling, and they often do work well, but the result is still unsatisfactory: there are clear and serious issues that we don’t know how to fix. This includes factuality
These days, pointing out the problems with LLMs often evokes a strawman response along the lines of “you’re just hating LLMs, which many people do find useful”. Let me preemptively stress that the question of model utility is completely orthogonal to the ‘bitter lesson’ discussion. Much older and weaker models have been useful in practice. Utility depends on the match between a model and a specific application, how easy it is to identify errors, how critical and frequent they are, and the cost and robustness of human oversight for that application. Say, if we wanted to generate horoscopes, scale is absolutely all we need.
A long-standing critique of LLMs is that they lack a ‘world model’, defined as ‘a computational framework that a system (a machine, or a person or other animal) uses to track what is happening in the world’
As always, the devil is in the details. Whose knowledge are we talking about? The original ‘bitter lesson’ post clearly refers to the researchers building ML systems:
Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. — R. Sutton, The 'Bitter Lesson'
But this leaves a giant loophole for handcrafted resources used for training. This includes not only the expert linguistic resources, but also all the human-created training data. I do not think anybody in ML community would seriously argue that we overcame the ‘garbage in, garbage out’ problem: the better data we have, the easier it is to learn for any model. But for models intended to represent human speech, humans are the only possible source of data (see below for discussion of synthetic data). That introduces a paradox into the ‘bitter lesson, which implies that the dependence on human knowledge is a weakness. Sutton himself stated the following more recently
It’s an interesting question whether large language models are a case of the bitter lesson. They are clearly a way of using massive computation, things that will scale with computation up to the limits of the Internet. But they’re also a way of putting in lots of human knowledge. This is an interesting question. It’s a sociological or industry question. Will they reach the limits of the data and be superseded by things that can get more data just from experience rather than from people? — R. Sutton on Dwarkesh Podcast
I would argue that in LLMs, scaling exacerbated rather than eliminated our dependence on handcrafted knowledge. The entire field is hyper-focused on benchmarks, and there are incredible incentives to win by any means necessary (even testing multiple model variants to pick one that happens to work better in a particular setting
Let us consider the FrontierMaths controversy: OpenAI results came from a setup in which they had exclusive access to FrontierMaths data
The desire for access to FrontierMaths data aligns with ML 101 and decades of engineering practice: the most effective way to get a model to work is usually using data that is somehow similar to the test data. The logical conclusion is optimize the whole training process for a benchmark, and this is exactly what we are doing with ‘data ablations’
If dependence of human knowledge is overall problematic, why do LLMs work well in so many cases? Language data follows the Zipfian distribution, which basically guarantees that a large portion of the frequent phenomena will be covered no matter what samples we take. Luckily, the same is true for benchmarks, especially given that we mostly focus on a single language. This allows LLMs to cover a lot of patterns without nobody ever defining them, and to make evaluation difficult. But the same distribution also guarantees an awkward long tail, remaining forever out of our reach. And since we have no idea which patterns are missing, we also have no idea how to further improve the model, apart from the current strategy of whack-a-specific-mole. That doesn’t feel satisfactory from either an engineering or a scientific perspective. It also leaves various minorities to find out the hard way that the model doesn’t work well for them (e.g. speakers of dialects may be systematically at a disadvantage when interacting with LLM-based systems
If the current LLM training is not the way, then what should we do? Sutton generally objects to token prediction as a valid training ‘goal’
My take is that the dependence on human knowledge is unavoidable when human knowledge (rather than only world physics or abstract logic) is a big part of what is being modeled. LLMs are better conceptualized not as individual entities, but as a ‘social and cultural technology’
If so – if we want models that better fit the human world, we will need better human data. And for that, we have to abandon the original ‘bitter lesson’ analogy with Go and chess, where information comes for free. Currently LLM training corpora are created by exploiting the historical data and established practices, but long-term this will work only if people are incentivized to make high-quality contributions
To see how far we’d get without thinking about incentives in the information ecosphere, consider the sad state of ML conference reviewing, with which we are all too familiar. One of the obvious problems is reliance on volunteer effort of reviewers, who do not have meaningful incentives to invest much effort in reviewing. Everyone’s careers are advanced only by working on their own papers, rather than writing careful reviews for others and teaching junior researchers how to review well. Predictably, this often results in poor-quality or even simply generated reviews. And that is in an area where everyone has at least some intrinsic motivation (scientific goals) and a good deal of shared ‘language’ (core methodology).
To conclude: a chess-world-like ‘bitter lesson’ solution is simply impossible in NLP, which fundamentally depends on human knowledge. This means that, even apart of any ethical or socioeconomic considerations, pure self-interest requires the field to reward not only models, but also high-quality knowledge contributions from humans that enable useful outputs. The current models do not support that kind of data attribution, so the research challenge for ML community is to either develop new learning methods that do, or to invent attribution methods for LLMs that are sufficiently faithful, interpretable, and not prohibitively expensive in compute overhead. That would be a win-win situation: the society would have a more sustaiable information ecosphere, and the model developers would be able to better debug their models.
Can we not not make the same argument for the multimodal language models, and thus also for the field of vision, or music? Yes, we can! I focus on language models, since this is my area of research, but the argument is based not on modality per se, but on the source of data that is needed for the model. If it’s about humans – they are the only possible source of such information. Say, vision models for abstract logic puzzles about a block world can be trained on generated images of that block world, but vision models for household robots need information about real human homes (and that kind of information is already being collected now). Music models could be trained on music generated with cellular automata, but not if the goal is to model the music tastes of specific group of humans at a given point in time.
What about generated/synthetic data? Augmentation with generated data often does work in practice; according to Sara Hooker, “the cost of generating synthetic data is now low enough that we can treat the data space as malleable and something which can be optimized”
For the form, augmentation with generated text is clearly possible to at least some extent. Perhaps the most common case is machine-translated text for lower-resource languages (e.g. this was done in the Aya dataset
But we cannot generate content. We cannot conjure up either the facts in the generated examples of summaries, or the languages to translate into (not to mention the ever-changing dialectal subtleties). The generating model still relies for that on the sample of human knowledge it was trained on, and no architectural improvements can overcome that dependence. If the generating model was itself distilled from some other model, this knowledge bottleneck simply shifts to the first model. And, according to the US Copyright office
What about emergence? This term is used in at least 4 senses in the ML community. In the sense of ‘something the model was not explicitly trained for’, for language models we have neither a proof of that happening, nor even the methodology for obtaining such a proof
Many thanks to the anonymous reviewers, and to Yoav Goldberg, Louis Jaburi, Bertram Højer and Nils Grünefeld for conversations and feedback. This work was supported by a research grant ([VIL60860]) from Villum Fonden.
Here are some more articles you might like to read next: