We document the rise of the Generative AI Archaeologist, whose tools include linear algebra and probability theory, jailbreaking, and debuggers, compared to the metal detectors, pickaxes, and radar surveys of traditional archaeology. GenAI Archaeologists have reported findings both through luck by observing unexpected behaviour in publicly accessible models, and by exploiting the mathematical properties of models. In this blog, we survey four types of discoveries unearthed by GenAI Archaeologists and discuss the status of those findings.
Figure 1: Examples of archaeological discoveries (a) Göbekli Tepe was found through professional survey, (b) the Terracotta Army was found by farmers by luck, and (c) the Antikythera mechanism was found by coincidence in a shipwreck. Images by Teomancimit, Zossolino, and Logg Tandy, respectively, used under Creative Commons Attribution licenses.
Archaeology reveals the secrets of human history through the traces left behind by our ancestors. Some discoveries are the result of careful survey using sophisticated tools to uncover what cannot be seen by the naked eye. One of the earliest examples of human village habitation is found at Göbekli Tepe, Turkey, first discovered in 1963, but only excavated in 1995. This village offers important new insights about early farming culture that changed our understanding of early human history. Other discoveries are the result of pure luck: the Terracotta Army of Xi’an, China, is one such example. This grand funeral act for the First Emperor of China was described in ancient Chinese texts but lost to time; its location was only re-discovered in 1974 by a group of rural farmers. A little closer to the heart of computer scientists, the Antikythera mechanism was found by coincidence in a shipwreck off the coast of Antikythera, Greece, in 1901. The mechanism is an analogue computer, preserved for nearly 2,000 years, that was used to predict the position of celestial bodies. Such is the sophistication of this device, that there is no evidence of similar complexity until 1,400 years later
In contrast to studying historical physical artefacts, Generative AI models are present-day digital objects that should not require anyone to search for lost knowledge. However, the secretive nature of fully closedpdb, compared to the metal detectors, pickaxes, and radar surveys of traditional archaeology. This practice is distinguished from reverse engineering because there is not always an explicit goal of design recovery
Figure 2: (a) Hayase et al. show how to infer the distribution of data used to train a tokenizer based on how BPE constructs merge lists (image source: Figure 1). (b) Nasr et al. demonstrate a simple attack to extract training data from language models by forcing them to repeat the same token (image source: Figure 1).
Training data is fuel that powers Large Language Models. Open-science initiatives, such as Pythia
Tokenizer Data: Hayase et al. show how to infer properties of the data used to train a tokenizer, which underpins how LLMs process text
In personal correspondence with Alisa Liu, we learned that this finding was sparked by preparing for a reading group. Liu noticed that \(( \; )\) appeared near the top of the merge list in the Gemma tokenizer, which seemed odd, because one might expect to find high-frequency fragments and stopwords at the top of the merge list, not a parenthesis. After further reflection, Liu realised that the prominent position of this token could be because the model was trained on a large volume of programming code, and, by extension, that tokenizer merge lists must reveal information about word frequency in the training data.
Pretraining Data: Carlini et al. show how to extract verbatim sequences from the GPT-2 language model
Status: it has never been confirmed if Hayase et al. inferred the true data distribution of any of the studied tokenizers. It has never been confirmed if Carlini et al, Nasr et al, or Karamolegkou et al. succeeded in extracting the training data from the GPT or Claude models.
Figure 3: Parishad et al. stumbled upon evidence of how the Mistral-7B LLM was pretrained. These visualisations show the cosine similarity between encodings using causal attention and bidirectional attention, across layers and token positions. The LLaMA2-7B model (a) shows low cosine similarities, especially at deeper layers, whereas Mistral-7B (b), shows the opposite, raising questions about how the model was pretrained (image source: Figure 5.)
Exactly how LLMs are trained is becoming increasingly shrouded in mystery but some noteworthy explanations remain in the open-weight and open-science literature
In trying to convert a LLM into a sentence embedding model, BenhamGhader et al.
In personal correspondence with Parishad BenhamGhader, she explained that the bidirectional attention finding came about as a result of a reviewer requesting experiments with additional models. The behaviour that was observed for the LLaMA-7B model was not observed in the Mistral-7B model, which sparked the additional analysis of the cosine similarities at different token positions before and after enabling bidirectional attention. It was here that it became clear that the representations from the Mistral-7B model were extremely similar with or without bidirectional attention. Additional correspondence with Marius Mosbach, one of the collaborators on the project, revealed that he studied the Mistral inference code and found a flag in the forward() that could enable bidirectional attention. This led to speculation that the model was trained using a PrefixLM-style objective
Status: it has never been confirmed whether this speculation is correct. At the time of writing, the original Mistral inference code is no longer publicly available.
Figure 3: (a) Carlini et al. show that the difference between consecutive singular values in an SVD decomposition of a matrix of final token logits is aligned with the embedding size of the Pythia-1.4B LLM, and that this can be estimated efficiently (b). (image sources: Figure 1 and 2.)
The size of most fully closed LLMs is not public knowledge. Furthermore, it is also not clear if they are dense models, mixtures-of-experts, or something else entirely. Such is the lack of information that the GPT-4 Wikipedia article speculates that the model is somewhere between 1T–1.8T parameters
Hidden Dimension Size: Finlayson et al.gpt-3.5-turbo model.
Full Output Layer: One can go further than inferring the hidden dimension side. Carlini et al. also showed how to extract the entire output layer of language models by observing that in the SVD decomposition of \(Q = U \cdot \Sigma \cdot V^T\), that \(U\) is a linear transformation of the output layer. They showed that this can accurately extract the output embedding layer of open-weights models, and they use the same attack to steal the output layer of OpenAI deployed models.
Status: it was acknowledged that the method of Carlini et al. correctly extract the size of the models. It was also confirmed that they could steal the output layer of the ada and babbage models with a root-mean squared error of \(5 \cdot 10^{-4}\) and \(7 \cdot 10^{-4}\), respectively, for just $4–12 dollars of API query credits. Finlayson et al. reported that several fully closed models changed their APIs to prevent this information being stolen from their models
Figure 4: Zhang et al. showed that simply asking LLMs to translate inputs into a different language (a) can reveal the model system prompt (b).
System prompts are used to define the tone, operational parameters, and tools that LLMs can use when responding to a request. The basic template suggested for the LLaMA-4 system prompt instructs the model that it is “companionable and confident”, whereas the Claude Sonnet 4 prompt has specific information about the outcome of the November 2024 United States Election, and alleged snippets of the GPT-5 system prompt, show that the model has variable verbosity levels when generating a response.
Zhang et al. show how to extract the system prompt of LLMs using a jailbreak attack
Status: The success of this technique has been directly confirmed for the Claude models because Anthrophic publish the system prompts.
The knowledge revealed in these five studies has not been lost to time. These models are not sitting at the bottom of the Mediterranean Sea, nor are they resting under the fields of Xi’an. They are publicly distributed through the HuggingFace Models Hub or accessible through paid APIs, but they harbour deep secrets about how they were trained. In the absence of this knowledge, researchers spend their time trying to understand what has been secreted away inside startups and corporations. This secretive behaviour is often justified with the claim that organizations need to maintain their competitive edge, but it has recently been argued that these practices are more similar to the Prisoner’s Dilemma. Increased openness from resource-rich organizations would allow everyone to focus on scientific innovation, but, if the status quo is maintained, we need to hope that more open organizations, such as EleutherAI, AllenAI, and DeepSeek, will continue to help push science forward with a 6–9 month delay. Researchers could also be incentivised towards making these discoveries if workshops, conferences, and journals explicitly welcomed contributions that reverse-engineer generative AI systems.
This blog was inspired by discussions in Vices and Versa during a research visit funded by the IVADO Thematic Semester on Autonomous Agents. Thanks to Marius Mosbach, Stella Frank, Anders Søgaard, and Ákos Kádár for providing critical feedback on earlier drafts.
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX