Are LLMs a proof that the 'bitter lesson' holds for NLP? Perhaps the opposite is true: they work due to the scale of human data, and not just computation.
The ‘bitter lesson’ post by Rich Sutton
(a) ‘soft’: approaches relying on human knowledge have historically been beaten by those that instead rely on computation (b) ‘hard’: scale is all you need
(a) is ‘bitter’ in the sense that it implies that the ML researchers may not have much to contribute intellectually to their work, since ‘almost no innovation is required beyond scale’. This is indeed depressing
(b) is a much stronger claim, and I don’t think many NLP researchers are seriously committed to it. It is not hard to see why. The current LLMs look like a monument to scaling, and they often do work well, but the result is also unsatisfactory: there are clear and serious issues that we don’t know how to fix. This includes factuality
These days pointing out the problems with LLMs often evokes a strawman response along the lines of “you’re just hating LLMs, which so many people find useful”. So let me preemptively stress that the question of model utility is completely orthogonal to the ‘bitter lesson’ discussion. Much older and weaker models have been useful in practice. Utility depends on the match between a model and a specific application, how easy it is to identify errors, how critical and frequent they are, and the cost and robustness of human oversight for that application. Say, if we just wanted to generate horoscopes, scale is absolutely all we need.
A long-standing critique of LLMs is that they lack a ‘world model’, defined as ‘a computational framework that a system (a machine, or a person or other animal) uses to track what is happening in the world’
As always, the devil is in the details. Whose knowledge do we mean? The original post clearly talks about the researchers building ML systems:
Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. — R. Sutton, The 'Bitter Lesson'
But this leaves a giant loophole for handcrafted resources used for training. Nobody would argue that we overcame the ‘garbage in, garbage out’ principle: the higher quality data we have, the easier it is to learn with any model. But for language models the humans are the only possible source of high-quality data! That introduces a paradox into the ‘bitter lesson: if it is correct overall, this dependence, even for training, would also be a critical weakness. Just a few months ago, Sutton himself stated the following
It’s an interesting question whether large language models are a case of the bitter lesson. They are clearly a way of using massive computation, things that will scale with computation up to the limits of the Internet. But they’re also a way of putting in lots of human knowledge. This is an interesting question. It’s a sociological or industry question. Will they reach the limits of the data and be superseded by things that can get more data just from experience rather than from people? — R. Sutton on Dwarkesh Podcast
I think the part where we went wrong in reading the original post is the analogy with chess and Go, from which it starts. In games, the data is ‘free’ to generate via simulation, we can choose where to ‘learn’, and the information we get is uniform in quality. Texts are exactly the opposite: someone has to have created them, we have to train on what we have rather than what we’d like to have, and the quality depends on the incentives of the creator of the text. And no, synthetic text doesn’t ‘solve’ this, since the generating model is itself reliant on the same human knowledge.
I would argue that in LLMs, scaling exacerbated rather than eliminated our dependence on handcrafted knowledge. And in the process, it also decreased the amount of control over it that we may have as researchers, while bringing in huge socioeconomic issues
Let us consider the FrontierMaths case: OpenAI results came from a setup in which they had exclusive access to FrontierMaths data
The desire for access to FrontierMaths data aligns with ML 101 and decades of engineering practice: the most effective way to get a model to work is usually finding some data that is somehow similar to the test data. A telling recent trend is ‘data ablations’
This approach does work well for many cases: the Zipfian distribution of language data basically guarantees that a large portion of the common cases will be covered no matter what samples we take, and so we don’t even need to bother defining them. But the same distribution also guarantees an awkward long tail, remaining forever out of our reach. And since we have no idea which patterns we covered and which we’re missing, we also have no idea how to further improve the model, apart from the current strategy of whack-a-specific-mole. That doesn’t feel satisfactory from either an engineering or a scientific perspective.
If the current LLM training is not the way, then what should we do? Sutton generally objects to token prediction as a valid training ‘goal’
My take is that the dependence on human knowledge is unavoidable when human knowledge (rather than only world physics or abstract logic) is a big part of what is being modeled (see Alison Gopnik’s discussion of LLMs as ‘social and cultural technologies’
To see how far we’d get without thinking about incentives, consider e.g. the sad state of ML conference reviewing, with which we are all too familiar. One of the obvious problems is reliance on volunteer effort of reviewers, who are not acknowledged and do not have meaningful incentives to invest much effort in reviewing. Everyone’s careers are advanced only by working on their own papers, rather than writing careful reviews for others and teaching junior researchers how to review well. Predictably, this often results in poor-quality or even simply generated reviews. And that’s in an area where everyone has at least some intrinsic motivation (scientific goals) and a good deal of shared ‘language’ (core ML methodology).
If we accept that a chess-world-like ‘bitter lesson’ solution is impossible in NLP, then better models will always require not only clever engineering, but also better human knowledge. This would explain why in practice we have seen so much work on not only modeling, but also large-scale data curation (e.g.
Here are some more articles you might like to read next: