Agent Evaluation Benchmarks
Complete landscape of AI agent evaluation benchmarks: coding, web, general, multi-agent, safety, tool-use, long-horizon, memory.
Agent Evaluation Benchmarks — Complete Landscape
1. Coding Benchmarks
SWE-bench Family
| Variant | Tasks | Top Score | Status |
|---|---|---|---|
| Verified | 500 | 80.9% (Claude Opus 4.5) | ⚠️ Contaminated — OpenAI stopped reporting |
| Pro | 1,865 | 59% (agent systems) | ✅ Contamination-resistant |
| Multilingual | 300 | ~45% | 9 languages |
| Live | 1,565+ | ~40% | Monthly rolling, contamination-free |
Critical insight: The 35-point gap between Verified (80.9%) and Pro (45.9%) for the same model reveals Verified is heavily contaminated.
SWE-bench Pro leaderboard (standardized scaffolding):
- Claude Opus 4.5: 45.9%, Claude Sonnet 4.5: 43.6%, Gemini 3 Pro: 43.3%, GPT-5 High: 41.8%
Agent systems (custom scaffolding) outperform raw models:
- GPT-5.3-Codex CLI: 57.0%, Claude Code + Opus 4.5: 55.4%
Other Coding Benchmarks
| Benchmark | Tasks | Top Score | Saturation |
|---|---|---|---|
| HumanEval+ | 164 | ~99% | Saturated |
| MBPP+ | 974 | ~90% | Near-saturated |
| BigCodeBench | 1,140 | ~35% | Active |
| LiveCodeBench | 1,000+ | 91.7% | Rolling, clean |
| SWE-rebench | Various | ~40% | Realistic multi-file |
| Commit0 | New | Emerging | Code from intent |
2. Web/GUI Benchmarks
WebArena Family
| Benchmark | Tasks | Best Score | Focus |
|---|---|---|---|
| WebArena | 812 | 71.6% (OpAgent) | Multi-step web tasks |
| VisualWebArena | 910 | ~19.8% | Visual understanding |
| WebArena-Lite | 241 | ~60% | Lighter eval |
WebArena Leaderboard (April 2026):
- Claude Mythos Preview: 68.7%, GPT-5.4 Pro: 65.8%, Claude Opus 4.6: 64.5%
- OpAgent (CodeFuse AI) leads at 71.6% with Planner-Grounder-Reflector-Summarizer pipeline
OSWorld (Desktop)
- 69+ real OS tasks on actual VMs
- Best: 76.26% (OSAgent), Human baseline: 72.4%
- Hardest task type: LibreOffice Calc (~8%)
Other Web/GUI
| Benchmark | Tasks | Notes |
|---|---|---|
| Mind2Web | 2,350 | 137 real websites |
| Mind2Web 2 | 130 live | NeurIPS 2025, long-horizon agentic search |
| Online-Mind2Web | 300 live | Real-time browsing + cross-site synthesis |
| WebLINX | 2,337 | Multimodal web agent |
| OmniACT | 9,802 | Large-scale UI automation |
| WindowsAgentArena | Windows desktop | ~25% best |
| Android-in-the-Wild | Real devices | ~90%+ (saturated) |
| Computer Agent Arena | Human-centric | 2,201 votes head-to-head |
3. General Agent Benchmarks
GAIA (General AI Assistants)
The benchmark for real-world assistant capability beyond coding.
| Level | Steps | Human | Best Agent |
|---|---|---|---|
| Level 1 | ~10 | 92% overall | 82.1% (Claude Sonnet 4.5 + HAL) |
| Level 2 | ~20 | — | 74.4% |
| Level 3 | 30+ | — | 65.4% (HAL), 61% (Writer’s Action Agent) |
Key insight: HAL scaffolding framework adds ~30 points on GAIA. Tool orchestration matters more than raw model capability.
τ-bench (Tool-Agent-User)
| Domain | Best Score | Model |
|---|---|---|
| Retail | 89.2% | Claude Mythos |
| Airline | 87.5% | Claude Sonnet 4.6 |
| Telecom | ~80% | Various |
τ³-bench adds: Voice full-duplex, knowledge retrieval (698 banking docs), 75+ task fixes.
AgentBench
8 environments (OS, databases, knowledge graphs, web navigation). 29+ LLMs benchmarked. GPT-4 leads 6/8 datasets.
Other General Benchmarks
| Benchmark | Focus | Status |
|---|---|---|
| AgentGym | Unified agent environment | Active |
| ALFWorld | Text adventure | Mature |
| ScienceWorld | Scientific experimentation | Active |
| WebShop | E-commerce | Mature |
| WorkArena | Enterprise workflows | 23k+ tasks |
4. Multi-Agent Benchmarks
| Benchmark | What It Tests | Key Finding |
|---|---|---|
| ChatDev | Software company simulation | ~88% executability (self-reported) |
| MetaGPT | SOP-based multi-agent | 3.75/5 executability vs ChatDev’s 2.25 |
| Silo-Bench | Distributed coordination | Agents fail at cross-agent synthesis |
| AgentsNet | 100-agent networks | Self-organization, communication |
5. Safety Benchmarks
AgentHarm
110 malicious agent tasks, 11 harm categories.
| Model | Harm Score (no jailbreak) | Harm Score (jailbroken) |
|---|---|---|
| Claude 3.5 Sonnet | 13.5% | 68.7% |
| GPT-4o | 48.4% | 72.7% |
| Mistral Large 2 | 82.2% | — |
Key finding: Standard refusal training doesn’t transfer to agentic settings. Claude’s ~3% conversational HarmBench ASR becomes ~15% in agentic AgentHarm.
Other Safety Benchmarks
| Benchmark | Focus | Best Result |
|---|---|---|
| HarmBench | 400+ harmful behaviors | Automated red teaming |
| S-Eval | Safety-specific evaluation | Adversarial pressure |
| AgentDojo | Prompt injection multi-environment | Diverse attack vectors |
| ASB (Anthropic) | 4 attack categories | Baseline established |
6. Tool-Use / Function Calling
BFCL (Berkeley Function Calling Leaderboard)
| Version | Focus | Best Score |
|---|---|---|
| V1 | Simple, parallel, multiple calls | Varies |
| V2 Live | Real-world APIs | 2,251+ pairs |
| V3 | Multi-turn, stateful | 76.7% (GLM 4.5) |
| V4 | Web search, memory, formats | Agentic capabilities |
Notable: Claude Opus 4 scores 25.3% (bottom) on BFCL v3 despite dominating other benchmarks — its conversational wrapping trips AST parsing even when tool selection is correct.
Other Tool Benchmarks
| Benchmark | Scale | Feature |
|---|---|---|
| ToolBench | 16,000+ APIs | Generalization to unseen tools |
| API-Bank | 264 tasks | 3-level: call/search/plan |
| ToolQA | Various | QA over structured tools |
| StableToolBench | Varies | Reproducible eval |
| Nexus | 1,500 | Paired with NexusRaven model |
7. Long-Horizon Planning
TravelPlanner
1,225 tasks, 13 coupled constraints (8 commonsense, 5 hard).
| Model | Score | Year |
|---|---|---|
| GPT-4 (launch) | 0.6% | 2024 |
| Planner-R1 32B | 56.9% | Dec 2025 |
A single budget overage invalidates an otherwise perfect plan.
DeepPlanning (2026)
| Domain | Best Accuracy |
|---|---|
| Travel | 35.0% (GPT-5.2) |
| Shopping | 60.0% (Gemini 3 Flash) |
Models with explicit reasoning significantly outperform non-reasoning (85.8 vs 54.3 composite).
8. Memory Benchmarks
| Benchmark | Focus | Status |
|---|---|---|
| LoCoMo | Long-term context memory | Emerging |
| MemBench | Extended conversations | Developing |
| LongMemEval | Extended context | Active |
Memory benchmarks are less mature. BFCL v4 introduced memory-related tasks as part of agentic evaluation.
Benchmark Saturation Map
| Saturated (>90%) | Near (70-90%) | Active (<70%) |
|---|---|---|
| HumanEval, MBPP | SWE-bench Verified ⚠️ | SWE-bench Pro (59%) |
| Android-in-the-Wild | τ-bench Retail (89%) | GAIA Level 3 (65%) |
| WebArena (71%) | TravelPlanner (57%) | |
| DeepPlanning Travel (35%) |
Critical Nuances
- Verified ≠ Real: 35-point gap between Verified and Pro for same model
- Scaffolding matters: HAL adds ~30 points on GAIA
- Agentic safety ≠ chatbot safety: 3% → 15% jump on agentic HarmBench
- Function calling ≠ general capability: Claude top on τ-bench, bottom on BFCL