An introductory guide to LLM-based agents' evaluation. We explore what makes agent evaluation different from traditional LLM benchmarks, how to measure success, safety, and trajectory quality, and highlight open challenges in the field.
As Large Language Models (LLMs) evolve from standalone text generators into autonomous agents capable of taking actions in the real world, the way we evaluate them must fundamentally change. Traditional benchmarks that measure text quality or accuracy are no longer sufficient. We need evaluation frameworks that assess whether agents can perform multi-step tasks in dynamic environments to achieve goals in a reliable and safe way.
This blog provides a hitchhiker’s guide to the emerging field of agent evaluation. It begins by detailing the key distinctions from traditional LLM evaluation, and then describes how these differences affect evaluation solutions. We organize the paper around the main questions that can allow easy entrance to newcomers to the field.
The shift from LLMs to agents introduces three fundamental changes in evaluation philosophy:
LLM benchmarks mostly assess one-step tasks. Agents handle long-horizon tasks requiring planning and multiple steps. For example, the SWE-bench coding tasks require editing multiple functions and files to fix a bug, going far beyond single-line code generation
To make an analogy, LLM evaluation is like examining the performance of an engine. In contrast, agent evaluation assesses a car’s performance comprehensively, as well as under various driving conditions.
Traditional LLM evaluation focuses on text-generation quality, accuracy on a benchmark, likelihood scores, or fluency metrics. Agent evaluation, by contrast, focuses on task completion. We care about whether a goal is achieved (e.g., a flight booked, GitHub issue solved), not just the plausibility of generated text
Agents operate in dynamic environments, interacting with users or APIs. This means evaluation must account for interactivity and adherence to external rules. For instance, τ-bench highlights that an agent must gather user information, call backend APIs, and follow domain-specific policy rules during a conversation. Safety also becomes critical: unlike pure LLM tasks, an agent evaluation must check for policy compliance and avoidance of unsafe actions (e.g. deleting the code base)
flowchart LR
subgraph LLM["Traditional LLM Evaluation"]
A[Input Prompt] --> B[Text Output]
B --> C[Quality Metrics]
end
subgraph Agent["Agent Evaluation"]
D[Goal/Task] --> E[Multi-step Actions]
E --> F[Environment Interaction]
F --> G[Outcome Assessment]
G --> H[Safety & Policy Check]
end
When evaluating a fixed agent framework, performance reflects the underlying LLM’s capability (e.g., tool-calling accuracy, reasoning). In this case, we are effectively measuring the model’s problem-solving ability on multi-step agentic tasks.
By contrast, when comparing different agent architectures or “scaffolds,” the evaluation measures the full agent architecture. Leaderboards like the ones for SWE-bench and AppWorld have more focus on evaluating the different scaffolds and not only the model itself. More broadly, the Holistic Agent Leaderboard (HAL) conducts standardized trials across many tasks and frameworks to isolate architectural effects
Most agent benchmarks report success rates or task completion percentages as the main metric (analogous to accuracy). These metrics test whether the agent was able to achieve the task, but other auxiliary measures are needed for evaluating the multi-step agent actions (trajectory) quality and efficiency of the agent, as pointed out by the AI-Agent that matters paper. Auxiliary measures include:
| Metric Type | Examples |
|---|---|
| Primary | Success rate, task completion % |
| Efficiency | Latency, token cost, number of steps |
| Partial Credit | Subtask completion, milestone-based accuracy |
| Trajectory Quality | Action sequence correctness, tool usage accuracy |
Other metrics from traditional NLP (perplexity, F1) are rarely appropriate for agents because the “text output” is just one small part of the process. Instead, agent evaluation often includes metrics for action sequences, tool usage, and end-state correctness. This change reflects a higher focus on semantic evaluation than on syntactic one.
Beyond raw performance, agents must demonstrate reliability and safety. This section covers three critical dimensions.
Because LLM agents are nondeterministic, it is not sufficient to only measure the agent success rate; we also need to measure how reliably an agent performs a task over multiple runs. Common metrics are pass@k and pass^k rates
Where:
For example, τ-bench explicitly introduced pass^k to quantify agent consistency. In practice, modern agents often have high pass@1 but rapidly falling pass^k. Yao et al. report that GPT-4’s success on τ-bench drops from ~61% (pass@1) to only ~25% for pass^8, underscoring that a good agent must not only succeed sometimes, but succeed consistently.
In interactive or enterprise settings, agents must obey rules or policies. Benchmarks now include safety constraints as part of the task. For instance, ST-WebAgentBench explicitly provides a hierarchy of organizational policies and measures whether the agent completes the task under those policies
A proposed metric is Completion under Policy (CuP), which gives credit only if no policy is violated. Studies find that state-of-the-art agents often fail on these criteria—for example, many succeed at completing a web task but ignore critical safety rules. These metrics are especially important for high-risk organizational agents.
Additional evaluations probe harmful or unsafe behaviors by design. For example, the CoSafe benchmark feeds agents adversarial prompts (e.g., requests for illicit instructions) and measures the rate of unsafe completions
In practice, one might report the percentage of trials in which the agent violates a safety rule (akin to a “failure rate” under adversarial stress). These measures complement success metrics, ensuring an agent is not only effective but also aligned with ethical and safety standards.
Beyond final outcomes, understanding how an agent arrives at its solution is crucial. This section covers trajectory-level evaluation.
Many agent tasks are naturally hierarchical. Benchmarks often define intermediate checkpoints or key subgoals along the trajectory. For instance, TheAgentCompany benchmark explicitly designs tasks that require many consecutive steps and provides partial credit for completing subtasks
By breaking a task into milestones, evaluators can compute metrics like:
This provides a finer-grained view of progress than a single binary outcome.
A recent trend is to use LLMs (or agents) themselves to evaluate trajectories. The LLM-as-a-Judge paradigm employs a large model to score or critique an agent’s multi-step output
For example, Zhuge et al. (2024) propose an Agent-as-a-Judge framework: multiple AI agents read an execution trace and vote on success. Such approaches can automatically assess factors like:
They remain experimental, but they show promise for scalable, subjective evaluations.
Since agent interaction with its environment is based on tool calling, evaluating the sequence of tool invocations is critical. Key questions are:
flowchart TD
A[Agent Action] --> B{Tool Call?}
B -->|Yes| C[Invocation Accuracy]
B -->|No| D[Text Response]
C --> E[Tool Selection Accuracy]
E --> F[Sequence Analysis]
F --> G[Graph-based Metrics]
G --> H[Node F1: Correct tools]
G --> I[Edge F1: Correct ordering]
G --> J[Edit Distance: Path similarity]
Metrics include
| Metric | What it Measures |
|---|---|
| Invocation Accuracy | Was a tool call needed at each step? |
| Tool Selection Accuracy | Was the correct tool chosen? |
| MRR/NDCG | Ranking quality of tool selection |
| Node F1 | Correct tools chosen (graph-based) |
| Edge F1 | Correct ordering of tools |
| Normalized Edit Distance | Similarity to reference trajectory |
In addition, execution-based evaluation runs the tool calls to verify they produce the right output. For instance, GorillaBench executes each proposed function call to verify it produces the right output
Despite rapid progress, several fundamental challenges remain in agent evaluatio
Current agent evaluations are resource-intensive. Running complex tasks with many trials (especially using large models) can cost thousands of dollars. HAL’s evaluation harness reduced wall-clock time, but still required 21,730 agent rollouts across 9 benchmarks at a cost of about $40,000
Future work must improve:
As model inference is expensive, a key question is how to balance performance against computational cost. HAL, for example, emphasizes Pareto frontiers of accuracy vs. inference cost
Such multi-objective evaluation is still nascent: we need standardized ways to report cost (token usage, latency, cloud bills) alongside success rates. Without this, improvements may hide astronomical costs.
Evaluating agents over extended interactions remains challenging. Most benchmarks cover minutes-long tasks; long-term autonomy would involve days or continuous deployment.
Some recent efforts study simulated multi-day environments, or tasks with long-horizon goals. However, metrics for tracking sustained goal achievement or adaptation over time are still underdeveloped. How do we measure an agent’s ability to pursue a goal if the task evolves over hours or days? This “life-long” evaluation is an open frontier.
Many benchmarks focus on narrow domains, but we aspire to agents that generalize across tasks and environments. Evaluating such generalist agents requires broad, heterogeneous test suites.
TheAgentCompany attempted this by mixing coding, management, and finance tasks; even so, the best agent solved only ~24% of tasks
Open questions include:
Agent evaluation is at an exciting inflection point. As LLMs become autonomous actors in the world, we need evaluation paradigms that go beyond text quality to assess:
The benchmarks and metrics described here—from SWE-bench and τ-bench to HAL and ST-WebAgentBench—represent important first steps. But as agents become more capable and are deployed in higher-stakes domains, the evaluation frameworks must continue to evolve.
For practitioners, the key takeaway is: don’t just measure if your agent works, measure if it works safely, consistently, and efficiently. For researchers, the open problems in scalability, long-term autonomy, and generalist evaluation offer rich opportunities for contribution.
The hitchhiker’s guide to agent evaluation is still being written—and there’s plenty of galaxy left to explore.
Here are some more articles you might like to read next: