The human knowledge loophole in the 'bitter lesson' for LLMs

Are LLMs a proof that the 'bitter lesson' holds for NLP? Perhaps the opposite is true: they work due to the scale of human data, and not just computation.

Introduction

The ‘bitter lesson’ post by Rich Sutton has been interpreted in the ML community in (at least) two ways:

(a) ‘soft’: approaches relying on human knowledge have historically been beaten by those that instead rely on computation (b) ‘hard’: scale is all you need

(a) is ‘bitter’ in the sense that it implies that the ML researchers may not have much to contribute intellectually to their work, since ‘almost no innovation is required beyond scale’. This is indeed depressing . Some interpret this ‘lesson’ as saying that researchers roles is designing the best way to leverage computation , or defining the problems and evaluating the solutions .

(b) is a much stronger claim, and I don’t think many NLP researchers are seriously committed to it. It is not hard to see why. The current LLMs look like a monument to scaling, and they often do work well, but the result is also unsatisfactory: there are clear and serious issues that we don’t know how to fix. This includes factuality , spurious patterns , non-differentiation between input text and instructions, which leads to prompt injection vulnerability .

These days pointing out the problems with LLMs often evokes a strawman response along the lines of “you’re just hating LLMs, which so many people find useful”. So let me preemptively stress that the question of model utility is completely orthogonal to the ‘bitter lesson’ discussion. Much older and weaker models have been useful in practice. Utility depends on the match between a model and a specific application, how easy it is to identify errors, how critical and frequent they are, and the cost and robustness of human oversight for that application. Say, if we just wanted to generate horoscopes, scale is absolutely all we need.

A long-standing critique of LLMs is that they lack a ‘world model’, defined as ‘a computational framework that a system (a machine, or a person or other animal) uses to track what is happening in the world’ . I agree that something like that is necessary for e.g. coherence of dialogue and personas. But there’s another factor worth pondering. Are LLMs actually a manifestation of ‘bitter lesson’ in the interpretation (a)? As in, is their success due to non-reliance on human knowledge?

The Human Knowledge Loophole

As always, the devil is in the details. Whose knowledge do we mean? The original post clearly talks about the researchers building ML systems:

Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. — R. Sutton, The 'Bitter Lesson'

But this leaves a giant loophole for handcrafted resources used for training. Nobody would argue that we overcame the ‘garbage in, garbage out’ principle: the higher quality data we have, the easier it is to learn with any model. But for language models the humans are the only possible source of high-quality data! That introduces a paradox into the ‘bitter lesson: if it is correct overall, this dependence, even for training, would also be a critical weakness. Just a few months ago, Sutton himself stated the following :

It’s an interesting question whether large language models are a case of the bitter lesson. They are clearly a way of using massive computation, things that will scale with computation up to the limits of the Internet. But they’re also a way of putting in lots of human knowledge. This is an interesting question. It’s a sociological or industry question. Will they reach the limits of the data and be superseded by things that can get more data just from experience rather than from people? — R. Sutton on Dwarkesh Podcast

I think the part where we went wrong in reading the original post is the analogy with chess and Go, from which it starts. In games, the data is ‘free’ to generate via simulation, we can choose where to ‘learn’, and the information we get is uniform in quality. Texts are exactly the opposite: someone has to have created them, we have to train on what we have rather than what we’d like to have, and the quality depends on the incentives of the creator of the text. And no, synthetic text doesn’t ‘solve’ this, since the generating model is itself reliant on the same human knowledge.

I would argue that in LLMs, scaling exacerbated rather than eliminated our dependence on handcrafted knowledge. And in the process, it also decreased the amount of control over it that we may have as researchers, while bringing in huge socioeconomic issues .

Let us consider the FrontierMaths case: OpenAI results came from a setup in which they had exclusive access to FrontierMaths data , and the creators of the benchmark did not disclose it. If OpenAI, an organization with that much tech talent, opts to resort to such tricks - this sends a signal that the best strategy is… relying on task-specific human knowledge. Let’s even suppose that the data was not used directly for training, but e.g. used to guide the creation or selection of some extra data that would be used instead (either for training, in-context examples or prompt tuning). Conceptually it’s still a clear case of deliberate injection of human-defined patterns in order to make the system work better. Gemini 3 has since been reported to perform better on this benchmark without priviledged access, but given the current incentives, I wouldn’t just accept the reported performance of a ‘closed’ system on this or any other benchmark. Especially one with so much media traction.

The desire for access to FrontierMaths data aligns with ML 101 and decades of engineering practice: the most effective way to get a model to work is usually finding some data that is somehow similar to the test data. A telling recent trend is ‘data ablations’ : literally selecting language model training data so as to optimize for specific pre-selected benchmarks. I can’t help seeing this as replacing the GOFAI practice of writing hundreds of patterns by sampling hundreds of examples for each pattern.

This approach does work well for many cases: the Zipfian distribution of language data basically guarantees that a large portion of the common cases will be covered no matter what samples we take, and so we don’t even need to bother defining them. But the same distribution also guarantees an awkward long tail, remaining forever out of our reach. And since we have no idea which patterns we covered and which we’re missing, we also have no idea how to further improve the model, apart from the current strategy of whack-a-specific-mole. That doesn’t feel satisfactory from either an engineering or a scientific perspective.

Beyond The Bitter Lesson

If the current LLM training is not the way, then what should we do? Sutton generally objects to token prediction as a valid training ‘goal’ , and argues for learning from real-world experience. But reinforcement learning on the real world has so far been avoided for very good reasons: imagine e.g. that a company just offloads its customer service chatbot training onto customers who will bear the costs of any errors. I hope that this is not what we’re about to experience. Ilya Sutskever also argues that the current RL efforts in post-training actually requires more knowledge engineering than LM pre-training.

My take is that the dependence on human knowledge is unavoidable when human knowledge (rather than only world physics or abstract logic) is a big part of what is being modeled (see Alison Gopnik’s discussion of LLMs as ‘social and cultural technologies’ ). For such a domain, we will always have to rely on human contributions, and progress will hence be tied to both better knowledge and better modeling. But the original analogy with the game world, in which information comes for free, does not serve us well. Currently we are still exploiting the historical data and established practices, but long-term this will work only if people are treated as stakeholders, whose contributions are acknowledged and incentivized. It’s not even a ‘learn once and run forever’ situation: our world will be forever changing, and forever long-tailed with all kinds of underrepresented minorities, requiring further and fresher data.

To see how far we’d get without thinking about incentives, consider e.g. the sad state of ML conference reviewing, with which we are all too familiar. One of the obvious problems is reliance on volunteer effort of reviewers, who are not acknowledged and do not have meaningful incentives to invest much effort in reviewing. Everyone’s careers are advanced only by working on their own papers, rather than writing careful reviews for others and teaching junior researchers how to review well. Predictably, this often results in poor-quality or even simply generated reviews. And that’s in an area where everyone has at least some intrinsic motivation (scientific goals) and a good deal of shared ‘language’ (core ML methodology).

If we accept that a chess-world-like ‘bitter lesson’ solution is impossible in NLP, then better models will always require not only clever engineering, but also better human knowledge. This would explain why in practice we have seen so much work on not only modeling, but also large-scale data curation (e.g. ). And if that is the case, then even apart of any ethical or socioeconomic considerations, pure self-interest in succes of our field requires us to go beyond rewarding models, and start rewarding high-quality knowledge contributions. My bet is that the key to better NLP systems is a new generation of learning methods that support meaningful incentives for human contributors. We have already established a thriving subfield for methods that optimize for both performance and efficiency, now it is time to go big on data attribution.

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Fairness Audits as Theater: When Metrics Mask Structural Harm
  • FANS - Frequency-Adaptive Noise Shaping for Diffusion Models
  • Effect of Parallel Environments and Rollout Steps in PPO
  • Beyond Attention as a Graph
  • Attention Sinks from the Graph Perspective