Computer Use Survey - A Visual Survey of Computer Use Agents

In recent years, AI systems operating on the web and in computer environments have become a major topic of interest for both academia and industry. The goal of this blog is to provide an interesting and interactive survey of historical and recent works on computer use agents. We define key terms used in the literature, catalogue the expansive list of environments and datasets, discuss the evolution of the methodologies, and assess both today’s landscape and possible paths forward.

Introduction

In the last few years, and particularly in the most recent past year, AI systems which operate on the web and in computer environments have become a significant topic of study, for both academia and industry. If readers skim the citations in this review, they will likely notice that many of the works are extremely recent, most within the past year.

The goal of this survey is to organize much of the historical and recent work on computer use agents. First, we define exactly what we mean by “Computer Use” in What is Computer Use. We then define and categorize different environments, datasets and evaluations for computer use agents in Environments and Datasets. Next, we discuss the methodological work in this area in Models and Methods focusing in particular on the recent trend of “LLM Agents” and provide an accessible explanation of this class of systems and trends within this research. Finally, we discuss ongoing trends, areas for improvement and possible safety and ethical concerns brought about by this research in Discussion.

So what has shaped this sudden interest in computer use agents in the first place? What goals are academic and industry researchers trying to serve? The first explanation is that the advent of extremely capable language and vision language models (LLMs / VLMs) has made the goal of autonomous computer use agents suddenly quite plausible. These models, trained now on trillions of tokens scraped from the web, are incredibly powerful few- and zero-shot learners able, with little or no new training data, to be generally proficient at many language tasks, making them adaptable to many different problems.

Computer use is not only difficult in terms of being a long-horizon sequential decision-making problem, but also linguistically challenging and knowledge-laden. Before these large-scale models, language problems required models to be trained from scratch, language understanding was often limited and brittle and few-shot reasoning was almost unheard of. This made it hard to do these computer use tasks. Take a popular web use task example: booking an airline ticket. To solve the problem, you first have to be able to precisely understand a user’s request: what airline they prefer, when they want to travel, etc. The agent also must have quite a bit of both commonsense and specific knowledge about airline ticket purchasing to accomplish this. The agent needs to not only understand the request and know how to navigate the airline website, it also needs to understand things like “I should not book a $10k ticket.” With these capable LLMs, we not only have a backbone which has general capability that can start to make progress on the task, but also have these essential commonsense capabilities to make computer use agents possible.

While the precise timing of this work can be explained by the arrival of these capable models, we should also ask what people are trying to accomplish with computer use agents. Many papers point to automating routine, boring or time consuming tasks as a motivation. Many works list automation of tedious tasks or enhancing user experiences as a motivation. The other common motivations are accessibility; making computers and the Internet more generally accessible to those with disabilities. We elaborate on this further in Discussion.

What is Computer Use

First, we define what we mean here by a “Computer Use” AI system or computer use agent. By this, we mean an agent that interacts dynamically with a computer system or web interface. In the MDP formulation, an agent is a decision maker that observes an environment and takes actions .

As we discuss later on in Environments and Datasets, we can define a computer use environment in a number of different ways, from an environment which is a simulation or Virtual Machine instance of an operating system, an environment which contains the text of a website(s) which can be navigated, an environment which emulates a browser to navigate and act on a live or cached version of the web or any portion or simulation of a computer that can be operated like an environment. These environments allow for any number of tasks from navigation to question answering. This definition suggests that the thing we care about is looking at works which create agents or environments with the goal of creating AI which can operate on computer systems in a manner similar to the ways humans do. This can actually incorporate many kinds of AI systems, but the most recent version of these which we discuss in detail in Models and Methods is the LLM or VLM agent on computer use environments.

Environments and Datasets

The landscape of computer use research spans diverse environments, from desktop operating systems to web browsers and mobile platforms. Each environment presents unique challenges and opportunities for developing autonomous agents. Below, we present an interactive exploration of the major datasets and benchmarks that have shaped this field.

We divide the datasets into several broad categories. Computer and OS Control describes environments which give more or less full simulated access to a computer or operating system. Web Control and Navigation describes environments where access is mostly to the web (live or statically defined pages). Text-Only Web Environments describe specifically versions of web environments with just text. Web QA and Classification are datasets which ask static questions or classifications on web pages as opposed to controllable environments. Finally we include Coding and Assistant Tasks for coding tasks and the broad category of assistant tasks using computer use elements.

We realize that the category “web browsing” is somewhat vague. It can include tasks involved with changing the settings of a browser such as Chrome and tasks requiring finding information on the web . Similarly, even what is meant by the web can be very different. This can mean access to a few websites such as in VisualWebArena or access to the entire web. This list is by no means complete and likely by the time we hit publish there will be many more datasets and environments.

While going through all of these environments and datasets, we had a number of observations we thought were important to discuss.

First, unlike with many large-scale datasets in other fields such as ImageNet , these datasets are often entirely annotated by the researchers themselves (usually graduate students). This often makes scaling these datasets quite challenging and puts more burden on students to create these datasets. Specialized tools such as AgentNet have been developed to try to aid in the annotation process, but the work is still highly specialized.

Second, we found it actually quite difficult to count the number of tasks versus number of instances. In many datasets, there is no real distinction made: each query to a web agent is considered a task and there is exactly one instance of that task. Some of these tasks can be similar or overlap, but there is no categorization of similar tasks or instances. In general, we try to report the size of the dataset in terms of the number of queries unless the dataset has a firm distinction between a task and instance, in which case we report both.

Finally, we note that, especially recently, there are an enormous number of datasets in this area, each with their own data, methodology of collection and input and action spaces. One challenge the field will have going forward is having consistent evaluations and deciding how we are making progress in the field given this. Several papers now have discussed various issues with evaluation with suggestions such as evaluating cost and eliminating agent shortcuts , standardized evaluation harnesses , or by making an arena to directly compare computer agents with humans.

Models and Methods

In this section we discuss the models and methodologies used in the field, from the pre-LLM era (Pre-LLM Work) to Computer Use LM Agents (Computer Use LM Agents) where we discuss Base Models (Base Models) and the wrappers or “Firmament” around them (Firmament), and proprietary systems and products (Proprietary Systems).

Pre-LLM Work

Inherent to this problem is taking repeated actions over time on some representation of a computer interface or website. Naturally, then, works have framed their methodology through the agent/environment paradigm discussed in What is Computer Use. Let’s journey through time to see how these approaches evolved:

Other papers explored various ideas such as curriculum approaches , curiosity RL , hierarchical RL .

Computer Use LM Agents

As LLMs and VLMs began to scale on massive amounts of web data and thus became far more capable on language and vision tasks , approaches directly using these models as a backbone began to dominate. This line of work is characterized by frameworks which operate by utilizing LLM’s ability to call tools to have the LLM call external tools and capabilities (such as OCR modules, episodic memory modules) and critically to call itself as a tool.

These papers are often called “agents” or “agentic” to mean that the system can take repeated actions over time by creating a finite state machine construction in such a way that the system is able to repeat a cycle of reading the task query, planning, interpreting the current input and taking actions in the environment. This bears some resemblance to the sense-think-act cycle in robotics . To see this more clearly, see the figure below which shows a typical control flow diagram for an LLM agent on computer or web control tasks.

LLM Agent Control Flow Diagram showing control passing between differently prompted versions of the LLM/VLM and external components

Typical control flow diagram for an LLM agent showing how control passes between differently prompted versions of the LLM/VLM and external components.

We can think of the diagram as showing how control passes between differently prompted versions of the LLM or VLM (a planner or step planner, an input interpreter, an action decider as well as other kinds of modules such as a reflect module) and external/non LLM components. The user query (i.e. a question or goal the agent is asked to perform) is taken as input to an LLM prompted to act as a planning module. After an initial plan, control is passed to the LLM prompted to plan each step, which calls other modules to load and interpret the input and decide the action.

The input can either be a direct image of the environment, or some combination of the image observation, OCR outputs, parsed HTML code or DOM elements or other representations from the environment. Then the action is decided (or no action is taken and the step is replanned) and passed to a module which interprets the LLM output string as an environment action. This can either be a direct API call (e.g. click(X, Y) or type(str)) or a custom action space which is then parsed and interpreted into a corresponding environment action. This loop continues until the planner decides that the agent has fulfilled the user query and action terminates. We call such agents Computer Use LM Agents.

We can say that what all of these have in common are:

An LLM or VLM backbone as well as any software that handles API calls if using an external service.
What we will here call the firmament which is the finite state machine that decides the control flow and the prompting used for different LLM components.
External tools or modules used.

The differences then between different methods which use this framework can be differentiated in several ways including:

Differences in base models including whether it is a VLM or LLM using only text observations, and whether and how the base model is trained.
Differences in the firmament in the exact control flow used, which external modules are called, how the input and output spaces are represented.
Whether there is any test-time expansion of the base model and how that is handled.
The input can be a direct image, OCR outputs, parsed HTML, DOM elements, or other representations.
The action is decided and passed to a module which interprets the LLM output string as an environment action.
This loop continues until the planner decides the agent has fulfilled the user query.

These modules are not all necessary or exhaustive (some methods include reflect operations and others have a simplified control flow). Different methods will add unique modules such as episodic memory , or web knowledge , or even a code interpreter . The thing that all of these agents have in common however is the finite state machine which allows the LLM to be called in a loop with different prompts for each step, the invoking of tools, and the interaction with the web environment through some representation of the input and actions.

Base Models for LLM Agents

Possibly the most important design decision that can be made is the choice of base model and whether it is fine-tuned for the task. This is generally decided by the general ability of the model as well as practical considerations such as cost or ability to run or train on the researcher’s available hardware.

Fixed Base Models

The first category of work does not train the LLM backbone, but relies on innovations in the firmament components or adding new modules such as OCR. Much of the early LLM agent work on web agents uses prompted only GPT-4(V) (e.g ) as at the time of its release it was one of the most capable models, was easily callable through an API and had visual input. Later much of the non-training work switched over to GPT-4o due to greater affordability and performance . Claude-3 and Claude-3.5 (e.g. in ) and GPT-4-Turbo (in ) are other popular choices.

Supervised Finetuning

Because base VLMs and LLMs are trained on typical web images and text, certain text and images may be out of distribution and models will struggle on these tasks . This is often the case for web tasks as well, so several works have tried to train models for this task (e.g xLAM ). CogAgent for example adds a high-resolution image stack to CogVLM and finetunes on a large-scale dataset of GUI and OCR tasks. Similarly, OS-ATLAS creates a large GUI grounding corpus and finetunes Qwen2-VL and InternVL-2 to better perform on these tasks.

RL/Exploration Finetuning

Similarly, there have been many works looking at automatic exploration or web and OS environments for finetuning . These methods interact with the OS or Web environment directly to then update the model. In , traces in the environment are created synthetically by either using the dataset training set or prompting an LLM for a task, using the base LLM to generate actions in the environment, then using another LLM to self-critique to determine if the task was accomplished successfully. Similarly generates synthetic traces by first starting with an environment trace and then labeling the task in hindsight. In all of these works, the synthetic datasets are used for supervised finetuning of a base LLM model for the task.

WEBRL adopted a full RL training loop for exploring and fine-tuning base models for web agents. Like other works, it uses a base LLM to generate instructions, executes using its current model, and uses a critic LLM to determine success or failure which is then used as a reward for reinforcement learning. It also employs KL smoothing and a replay buffer only keeping successful trajectories to avoid catastrophic forgetting. For datasets with direct environment rewards, other papers are able to directly finetune agents with RL.

Multi-task Agent Foundation Models

Most recently, a very popular approach has been to build new foundation models which include web agent data . In they jointly train on a number of tasks including computer use, simulated robotics and games. This work finetunes a multimodal language model on embodied tasks, both with supervised fine-tuning and RL, making use of a multi-embodiment action tokenizer. Similarly in Magma they train a foundation model on a variety of multimodal environments including computer use which is specifically trained for planning in embodied settings. Others have trained foundation models exclusively for GUI agents by training across many different GUI environments .

Firmament

Another critical part of web agent architectures is the control flow state machines and prompting techniques (which we here call the firmament but others might call scaffolding). This incorporates things such as prompting strategies such as prompting agents to give responses in code , adding modules such as self reflection, episodic memory of related examples or other information synthesized from past experience , hierarchical planning modules, online web search and narrative planning .

One interesting area is in incorporating test-time search into the agent . does this by doing an A*-like exploration of web actions, picking a state to expand at each step and considering the possible next states the agent could land in. Similarly, ExACT incorporates Reflective Monte Carlo Tree Search (R-MCTS), a variation on MCTS at test time, using reflection to evaluate to estimate state values. Similarly WebPilot adopts an MCTS-like approach, using multiple LLMs acting in different capacities as Explorer, Verifier, Appraiser, and Controller. does this with model-based planning rather than a full search in the environment. Other works such as will similarly employ explicit multi-agent systems to break down and execute different parts of a task.

Another important aspect of all of these frameworks is how they specifically interact with the Web or OS environment itself. For the representation of the input, there has been great experimentation ranging from set of marks , where an id is overlaid on the UI element of interest , accessibility trees, built in features in operating systems and browsers which tag UI elements with text , HTML often reduced or filtered in some way such as by using a Document Object Model (DOM) tree or even the raw screenshots . Often several of these input types will be combined together.

The wide variety of inputs used by different models can often create issues with evaluations and benchmarking models. As shown in Table 5 of OSWorld (reproduced below), the same backbone model prompted differently and using different representations of the same input can have vastly different levels of performance. Ideally, comparisons would be made between models using the same assumptions about the input space.

Table 5 from OSWorld <d-cite key='xie2024osworld'></d-cite> showing different performance based on input type — Table 5 from OSWorld showing the success rate of their baseline LLM and VLM agents on task categories of the OSWorld environment.

Similarly, methods have a wide variety of ways of dealing with the action spaces in the environment. In the earliest works, the action space was at the lowest level, made up of atomic mouse (x, y) positions and click actions , or click and typing actions . Some works such as UGround still use these action spaces, while other works have tried to stay grounded to the visual space without coordinates. For the most part however, when language model agents became the more dominant paradigm, the action space often changed to better match with the language output of LLMs. Works such as AutoWebGLM or Agent S or many others allow for the language model to specify actions in templated language. In other words, it treats the computer action space as a tool . It still allows for clicks and typing, but rather than specify an x,y coordinate, it specifies click(id) with the id of the UI element to click on. These works also define a set of pre-defined web actions such as jump_to(url, newtab) which navigates a particular tab to a specified url so that some actions can be taken without specifying every mouse and keyboard action required.

With these custom action spaces, prompts are used to tell the LLM the set of available actions and some kind of parser or interpreter is used to translate these language-specified actions to the environment. The input and action spaces are often tightly linked. For instance in works such as , the UI elements are labeled with ids in a way similar to set-of-marks and those same ids are then used in the action space so the agent can interact with those same elements.

Proprietary Systems

It is also worth mentioning that Web and OS control agents are not only an academic topic, but an emerging product area for AI companies. Google DeepMind has announced Project Mariner and later Gemini 3.0 with web browsing , OpenAI with Operator and Anthropic with Computer Use and several others . Unfortunately, some of these models are in limited release, either behind trusted user groups or premium subscriptions, making evaluation difficult.

Some of these projects have reported numbers on some popular computer use benchmarks, but none of these projects have released papers or detailed technical reports, so details about how the agents work and how the evaluations were conducted are not known. Based on the limited information released, it seems apparent that these agents correspond with the line of work described in Computer Use LM Agents, but other details such as how the base models are fine-tuned, data used for training etc are not known. Ultimately, these black-box releases are very relevant to the interest in computer use agents, but cannot be easily benchmarked or relied upon to contribute to the academic literature. This will likely continue to be a tension going forward, as it has been in the broader literature of LLMs.

Discussion

Areas for Improvement

In the next table, we (non-exhaustively) look at the performance of different methods on two popular computer/web agent tasks, WebArena and OSWorld . (Note, we do not include proprietary systems with unpublished methodologies).

Base Models	WebArena-Lite	WebArena	OSWorld
GPT-3.5 Turbo	–	6.2	–
GPT-4	–	14.4	12.24
GPT-4o	13.9	13.1	11.36
LLAMA2-7B	–	1.2	–
LLAMA2-70B	–	0.6	–
Llama3.1-Instruct	4.8	–	–
ScribeAgent + GPT-4o	53	–	–
AgentSymbiotic	52.1	–	–
WebRL	49.1	–	–
Learn-by-Interact	48	–	–
AgentOccam-Judge	45.7	–	–
NNetscape navigator	–	7.2	–
AutoWebGLM	–	18.2	–
AWM	–	35.5	–
WebPilot (GPT-4o)	–	37.2	–
ExACT	–	–	19.39
Agent S	–	–	20.58
OpenCUA	–	–	34.8
Seed1.5-VL	–	–	40

Progress on popular benchmarks WebArena and OSWorld . Scores represent accuracy percentages.

One observation to make here is that great improvements have been made on all of these tasks, but that accuracies on these tasks are still quite low, all below 60%, and many of the higher numbers are quite recent! And more recent benchmarks such as are even more difficult. So what things might be holding models back?

Planning

One area where computer use agents often get stuck is planning. A common issue is that LLM agents will sometimes get stuck in a state and repeatedly try to perform some action unsuccessfully or being distracted by an irrelevant web element. An analysis in Figure 3 of WebRL for instance shows that this is one of the most common issues for many baseline agents.

Figure 3 from WebRL <d-cite key='qi2024webrl'></d-cite> showing breakdown of error types in web agents — Figure 3 from WebRL showing the most common error types for different web agents, with planning-related issues being predominant.

As mentioned earlier, many works try to solve this issue with either using an LLM as a “planner”, relying on the LLM itself to recognize that it is stuck, or using some test-time search to escape loops .

Input/Output Representation

Another issue is environment grounding, which can take many forms. For agents which use the raw pixel input of the environment, this is a problem of visual grounding (see for a general survey): literally, can the model understand what is in an image and where. And while many methods use non-visual representations of the current web or computer state (see Firmament), they still have the same grounding problem, but in an input space of HTML (or other non-visual space).

A common qualitative issue mentioned in many papers is important UI elements not being recognized by the system, either due to a specific failure in the tool representing the input (e.g. failures of the non-visual tools used to represent the input such as Set-of-Marks ). Sometimes accessibility trees or DOMs for instance, contain errors or missing labels . Works which specifically use image inputs often have the problem that the VLMs were not adequately trained on web images and a well-known issue in VLMs is out-of-domain image understanding . Ultimately, these issues may come down to basic issues of visual recognition, which some have used to suggest specific fine-tuning for better visual recognition of GUIs .

Lack of Training Data

One obvious issue (which is implicit in bad grounding and planning) may be that many base LLMs or VLMs are simply not well aligned to the task because the base models are inadequately trained on this distribution. Attempts have been made to create larger datasets , for training, especially for visual finetuning . However, one major issue, which can be generally seen in long-horizon RL problems is that it can require a lot of data, especially for long trajectories, and that long trajectories can cause policies to suffer a distribution shift from the training trajectories, leading to poor performance . As discussed in Base Models, newer methods have often incorporated explicit exploration using RL or other exploration methods to gather more training trajectories. In particular, WebRL has great success directly finetuning through reinforcement learning on the live environment with policy gradients.

Long-horizon Problems are Hard

Another answer is simply that many web or computer tasks are difficult because they are fundamentally a long-horizon action problem. We have a description of some task or problem in language and we expect agents to take many consecutive steps before being able to resolve the query. In the Reinforcement Learning literature, there is a well-known issue of having “sparse rewards” facing agents which train on these rewards which makes it difficult for agents to explore properly to reliably find the rewards . Another related problem is the Credit Assignment problem which is well known in the RL literature and was described by Marvin Minsky: “In applying such methods to complex problems, one encounters a serious difficulty-in distributing credit for success of a complex strategy among the many decisions that were involved” . In a greater sense, to ask why complex multi-step web or computer tasks are difficult is asking the same questions that practitioners of AI have been asking since the founding of the field.

Safety/Ethical Concerns

The emergence of more and more capable computer use agents has raised a number of new ethical and safety issues. One prominent issue is the possibility of Web Agents for malicious use. Hard-coded web agents (often colloquially called “bots”) have already been cited as a major issue with so-called “Scalper bots” being used to buy limited-quantity items and resold at massive markups . With the possibility of even more sophisticated bots, these concerns are even more acute. For instance, more adaptable and intelligent bots could be able to override mechanisms meant to deter these behaviors such as CAPTCHAs . And as web agents become more capable, they could help automate more web activities used for fraudulent or malicious purposes.

Another concern is privacy and security. Computer use agents acting on behalf of users have access to extremely sensitive information such as users’ profiles, passwords, financial transactions, social media messages and many other kinds of data . This information could accidentally or even maliciously be accessed by the agent and sent to others. Mistakes in actions could, for instance, cause web agents to accidentally send un-encrypted passwords to others by email or chat websites. Web agents could also be deliberately designed to steal such information. For a more thorough discussion of the ethical issues involved in agents, the blog post “AI Agents Are Here. What Now?” provides a detailed breakdown.

Agentic Web

A recurring discussion point in the philosophy of computer use agents is that current interfaces are built for humans, not autonomous agents. This has motivated arguments for rearchitecting the web for agents rather than forcing agents to reverse-engineer the human-centric interfaces. Stemming from this view, the Agentic Web is framed as a new phase of the internet, in which autonomous agents transact, coordinate, and negotiate on behalf of users over infrastructures explicitly designed for agentic activities.

On a similar note, Agentic Web Interfaces (AWI) discusses an interaction layer optimized for perception and action with the web interface. Designing AWIs around principles of safety, efficiency, and standardization reframes core challenges of web agents, e.g., DOM parsing and UI understanding, as shared responsibilities of web developers, standards bodies, and agent designers. Yet the argument for building web agents still remains in the era of agentic web, as we may need web agents to navigate through the older pre-agentic websites.

Agentic Inference Cost

Agentic systems carry a much higher cost multiple than traditional zero-shot LLM/VLM calls, mainly due to running iterative reasoning loops, multi-stage planning, and the “unreliability tax” of retries . The tax is not solely due to the extra actions, but also due to the extra reflection as these agents reason, act, and reassess outcomes, as humans do. Estimates of tokens generated during agentic inference are unreliable, as research shows that performance scales with the number of test-time tokens .

While the unit price of tokens has decreased, with some benchmarks seeing performance-adjusted costs drop by 40x annually , the volume of tokens generated by “super agents” can be 5–25x higher than basic inferences . Current research points to the need for cost-controlled evaluation and to jointly optimize performance and cost . Recent optimization efforts, such as Agent Workflow Optimization (AWO), focus on “meta-tools” to coalesce structural tool calls, which can reduce total LLM calls by roughly 12% . Optimizing agentic inference could be the key to widespread real-world deployment of AI agents.

Figure 2 from AI Agents that Matter <d-cite key='kapoor2024ai'></d-cite> showing joint optimization that maintains accuracy while reducing cost. — Figure 2 from AI Agents that Matter showing joint optimization that maintains performance while reducing cost.

But Why Computer Use Agents?

As discussed in the Introduction, the two most common goals for Computer Use Agents are related to automation and accessibility. The automation motivation is compelling - as described in WebGPT , they envision: “Picture this scenario: You type in a task description, then relax and enjoy a cup of coffee while watching tasks like booking tickets online, conducting web searches, managing files, and creating PowerPoint presentations get completed automatically.” This vision of seamless task automation drives much of the current research effort.

The other motivation is accessibility , with researchers stating that this line of work will “make digital devices more accessible” . Several potential issues emerge however. One concern is that many current agents rely implicitly or explicitly on accessibility features already built into operating systems or the web to function . Unfortunately, as mentioned in , 95.9% of webpages contain errors in accessibility including missing alt text in images or missing form input labels at around 57 errors per page . Relying on web accessibility to be correct is an unfortunate issue for those who use these features directly and makes it more difficult for web agents to help these same people if they rely on the same features being correct.

There can often be a large disconnect between the use cases, datasets and models thought up by researchers (often mostly sighted people) and those actually important and useful to blind and visually impaired people. As noted in who studied this in the context of Visual Question Answering, the kinds of questions actually asked by blind users varied significantly from prior VQA datasets. Similarly, we might expect that the use cases of the visually impaired for online agents or assistants might be significantly different from those imagined by current lines of research. Future work which seeks to improve accessibility must similarly to use the actual data and requirements from the blind to be genuinely helpful as the overall capabilities of these systems improves.

Nevertheless, the potential certainly exists in the future for these agents to be useful for accessibility, not just by those with visual impairment, but people with severe neuropathy or motor control symptoms as well as people with cognitive or memory issues which might make common web tasks difficult. Computer use agents will first have to drastically improve in performance and researchers will have to work with health researchers, providers and affected people to develop truly useful accessibility features using this technology.

Author’s Note

This survey was adapted from the original interactive web version created by the authors. We tried to take advantage of the web as a medium, specifically to make the survey a bit different and more interactive for readers. In addition, certain parts of the survey, such as the list of datasets, would simply not have worked in a normal linear blog. Claude Sonnet 4 was used to help build the website. Special thanks to Dai-Jie Wu for providing valuable feedback and helping debug browser compatibility issues.

Author Contributions

Initial paper collection was done jointly by KM and AM. Most sections were written by KM, except for Environments and Datasets, which were split between KM and AM, and Agentic Web and Inference Cost, which were written by MFI. The original website was created with Claude Sonnet 4, and the transfer from the original website to the blog was done by MFI. Final editing was done by both KM and AM.

Enjoy Reading This Article?

Here are some more articles you might like to read next:

Fairness Audits as Theater: When Metrics Mask Structural Harm

FANS - Frequency-Adaptive Noise Shaping for Diffusion Models

Beyond Attention as a Graph

Attention Sinks from the Graph Perspective

A Hitchhiker's Guide to Agent Evaluation