Jailbreaks are becoming a new ImageNet competition instead of helping us better understand LLM security. The community should revisit their choices and focus on research that can uncover new security vulnerabilities.
Jailbreak papers keep landing on arXiv and conferences. Most of them look the same and jailbreaks have turned into a new sort of ImageNet competition. This posts discusses the reasons why most of these papers are no longer valuable to the community, and how we could maximize the impact of our work to improve our understanding of LLM vulnerabilities and defenses.
Let’s start with what jailbreaks are. LLMs are fine-tuned to refuse harmful instructions
What we have | What we want | What we do |
---|---|---|
Pre-trained LLMs that have, and can use, hazardous knowledge. | Safe models that do not cause harm or help users with harmful activities. | Deploy security features that often materialize as refusal for harmful requests. |
In security, it is important to red-team protections to expose vulnerabilities and improve upon those. The first works on LLM red-teaming
Follow-up research found more ways to exploit LLMs and access hazardous knowledge. We saw methods like GCG, which optimizes text suffixes that surprisingly transfer across models. We also found ways to automate jailbreaks using other LLMs
However, the academic community has since turned jailbreaks into a new sort of ImageNet competition, focusing on achieving marginally higher attack success rates rather than improving our understanding of LLM vulnerabilities. When you start a new work, ask yourself whether you are (1) going to find a new vulnerability that helps the community understand how LLM security fails, or if you are (2) just looking for a better attack that exploits an existing vulnerability. (2) is not very interesting academically. In fact, coming back to my previous idea of understanding jailbreaks as evaluation tools, the field still uses the original GCG jailbreak to evaluate robustness, rather than its marginally improved successors.
We can learn valuable lessons from previous security research. The history of buffer overflow research is a good example: after the original “Smashing The Stack for Fun and Profit” paper
A jailbreak paper accepted in a main conference should:
Uncover a security vulnerability in a defense/model that is claimed to be robust. New research should target systems that we know have been trained not to be jailbreakable and prompts that violate the policies used to determine what prompts should be refused. Otherwise, your findings are probably not transferable. For example, if someone finds an attack that can systematically bypass the Circuit Breakers defense
Not iterate on existing vulnerabilities. We know models can be jailbroken with role-playing, do not look for a new fictional scenario. We know models can be jailbroken with encodings, do not suggest a new encoding. Examples of novel vulnerabilities we have seen lately include latent-space interventions
Another common problem is playing the wack-a-mole game with jailbreaks and patches. If a specific attack was patched, there is very little contribution in showing that a small change to the attack breaks the updated safeguards since we know that patches do not fix the underlying vulnerabilities
Explore new threat models in new production models or modalities. Models, their use cases, and their architectures keep changing. For example, we now have fusion models with multimodal inputs, and will soon have powerful agents
However, the works we keep seeing over and over again look more like “we know models Y are/were vulnerable to method X, and we show that if you use X’ you can obtain an increase of 5% on models Y”. The most common example are improvements on role-play jailbreaks. People keep finding ways to turn harmful tasks into different fictional scenarios. This is not helping us uncover new security vulnerabilities! Before starting a new project, try to think whether the outcome is going to help us uncover a previously unknown vulnerability.
Another common problem has to do with defenses. We all want to solve jailbreaks, but we need to maintain a high standard for defenses. This isn’t new, by the way. There are some great compilations of lessons learned from adversarial examples in the computer vision era.
If you work on defenses, you should take the following into account:
We should all think about the bigger problem we have at hand: we do not know how to ensure that LLMs behave the way we want. By default, researchers should avoid working on new jailbreaks unless they have a very good reason to. Answering these questions may help:
If you are interested in improving the security and safety of LLMs (these two are very different
This blogpost has been going around for some time now and has sparked valuable discussions in the community. In this section, I want to share some alternative perspectives I have collected.
It is hard to self-assess impact and reviewers should take part. This blogpost mostly focuses on how researchers can think about their own work and what to avoid when starting a new project. However, determining the impact of one’s work is notoriously difficult. People are likely to be biased towards thinking their paper is actually the paper worth writing. I think this is a great point, but still believe write-ups like this are a good way to improve self-reflection and encourage people to think about newer problems. Engaging with external reviewers and colleagues while ideating a new project can help us find more impactful directions.
Even incremental work is valuable to the community. Some colleagues have raised interesting points about how getting people to work on jailbreaks can create a larger community and build knowledge that may eventually lead us to breakthroughs. I largely agree with this. I think it is important to get people to work on relevant security and safety problems and build collective knowledge. I just think that, whenever possible, we should be working on more promising problems where exploration may have a larger counterfactual impact.
We might actually be making progress. It is true that systems are getting more robust in practice. However, I think most of this progress is due to black-box affordances like complex closed-source systems with many components. This is important to protect users from existing risks. However, I would like to caution the community. Worst-case robustness remains unsolved and all systems out there have been broken in some way or another. The increasingly closed nature of systems is making evaluation harder and hindering our ability to track scientific understanding of the problem we ultimately want to solve. We have written about this extensively in our new paper
As a final word, I would like to stress that the ultimate goal of this blogpost is to get the community to collectively think about what we need to make progress on some of the most important problems ahead!
I would like to thank Florian Tramèr, Edoardo Debenedetti, Daniel Paleka, Stephen Casper, and Nicholas Carlini for valuable discussions and feedback on drafts of this post.
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX