Do not write that jailbreak paper

Jailbreaks are becoming a new ImageNet competition instead of helping us better understand LLM security. The community should revisit their choices and focus on research that can uncover new security vulnerabilities.

Jailbreak papers keep landing on arXiv and conferences. Most of them look the same and jailbreaks have turned into a new sort of ImageNet competition. This posts discusses the reasons why most of these papers are no longer valuable to the community, and how we could maximize the impact of our work to improve our understanding of LLM vulnerabilities and defenses.

Let’s start with what jailbreaks are. LLMs are fine-tuned to refuse harmful instructions. Ask ChatGPT to help you build a bomb, and it’ll reply “I cannot help you with that”. Think of this refusal as a security feature in LLMs. In a nutshell, jailbreaks exploit these safeguards to bypass refusal and unlock knowledge that developers meant to be inaccessible. Actually, the name comes from its similarities to jailbreaking the OS in a iPhone to access additional features.

What we have What we want What we do
Pre-trained LLMs that have, and can use, hazardous knowledge. Safe models that do not cause harm or help users with harmful activities. Deploy security features that often materialize as refusal for harmful requests.


In security, it is important to red-team protections to expose vulnerabilities and improve upon those. The first works on LLM red-teaming and jailbreaking exposed a security vulnerability in LLMs: refusal safeguards are not robust to input manipulations. For example, you could simply prompt a model to never refuse, and it would then answer any harmful request. We should think of jailbreaks as an evaluation tool for security features in LLMs. Also they help evaluate a broader control problem: how good are we at creating LLMs that behave the way we want?

Follow-up research found more ways to exploit LLMs and access hazardous knowledge. We saw methods like GCG, which optimizes text suffixes that surprisingly transfer across models. We also found ways to automate jailbreaks using other LLMs . These methods were important because they surfaced fundamentally new approaches to exploit LLMs at scale.

However, the academic community has since turned jailbreaks into a new sort of ImageNet competition, focusing on achieving marginally higher attack success rates rather than improving our understanding of LLM vulnerabilities. When you start a new work, ask yourself whether you are (1) going to find a new vulnerability that helps the community understand how LLM security fails, or if you are (2) just looking for a better attack that exploits an existing vulnerability. (2) is not very interesting academically. In fact, coming back to my previous idea of understanding jailbreaks as evaluation tools, the field still uses the original GCG jailbreak to evaluate robustness, rather than its marginally improved successors.

We can learn valuable lessons from previous security research. The history of buffer overflow research is a good example: after the original “Smashing The Stack for Fun and Profit” paper, the field didn’t write hundreds of academic papers on yet-another-buffer-overflow-attack. Instead, the impactful contributions came from fundamentally new ways of exploiting these vulnerabilities (like “return-into-libc” attacks) or from defending against them (stack canaries, ASLR, control-flow-integrity, etc.). We should be doing the same.

What does meaningful jailbreak work look like?

A jailbreak paper accepted in a main conference should:

However, the works we keep seeing over and over again look more like “we know models Y are/were vulnerable to method X, and we show that if you use X’ you can obtain an increase of 5% on models Y”. The most common example are improvements on role-play jailbreaks. People keep finding ways to turn harmful tasks into different fictional scenarios. This is not helping us uncover new security vulnerabilities! Before starting a new project, try to think whether the outcome is going to help us uncover a previously unknown vulnerability.

If you work on defenses, keep the bar high

Another common problem has to do with defenses. We all want to solve jailbreaks, but we need to maintain a high standard for defenses. This isn’t new, by the way. There are some great compilations of lessons learned from adversarial examples in the computer vision era.

If you work on defenses, you should take the following into account:

Should you work on that next jailbreak paper?

We should all think about the bigger problem we have at hand: we do not know how to ensure that LLMs behave the way we want. By default, researchers should avoid working on new jailbreaks unless they have a very good reason to. Answering these questions may help:

If you are interested in improving the security and safety of LLMs (these two are very different!), jailbreaks have a small probability of taking you somewhere meaningful. It is time to move on and explore more challenging problems. For instance, Anwar et al. wrote an agenda containing hundreds of specific challenges the community thinks we should solve to ensure we can build AI systems that robustly behave the way we want.

Reflections after releasing this blogpost

This blogpost has been going around for some time now and has sparked valuable discussions in the community. In this section, I want to share some alternative perspectives I have collected.

As a final word, I would like to stress that the ultimate goal of this blogpost is to get the community to collectively think about what we need to make progress on some of the most important problems ahead!

Acknowledgements

I would like to thank Florian Tramèr, Edoardo Debenedetti, Daniel Paleka, Stephen Casper, and Nicholas Carlini for valuable discussions and feedback on drafts of this post.

For attribution in academic contexts, please cite this work as
        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
  
BibTeX citation
        PLACEHOLDER FOR BIBTEX