Gormes

Agent Evaluation Benchmarks

Complete landscape of AI agent evaluation benchmarks: coding, web, general, multi-agent, safety, tool-use, long-horizon, memory.

Agent Evaluation Benchmarks — Complete Landscape

1. Coding Benchmarks

SWE-bench Family

VariantTasksTop ScoreStatus
Verified50080.9% (Claude Opus 4.5)⚠️ Contaminated — OpenAI stopped reporting
Pro1,86559% (agent systems)✅ Contamination-resistant
Multilingual300~45%9 languages
Live1,565+~40%Monthly rolling, contamination-free

Critical insight: The 35-point gap between Verified (80.9%) and Pro (45.9%) for the same model reveals Verified is heavily contaminated.

SWE-bench Pro leaderboard (standardized scaffolding):

  • Claude Opus 4.5: 45.9%, Claude Sonnet 4.5: 43.6%, Gemini 3 Pro: 43.3%, GPT-5 High: 41.8%

Agent systems (custom scaffolding) outperform raw models:

  • GPT-5.3-Codex CLI: 57.0%, Claude Code + Opus 4.5: 55.4%

Other Coding Benchmarks

BenchmarkTasksTop ScoreSaturation
HumanEval+164~99%Saturated
MBPP+974~90%Near-saturated
BigCodeBench1,140~35%Active
LiveCodeBench1,000+91.7%Rolling, clean
SWE-rebenchVarious~40%Realistic multi-file
Commit0NewEmergingCode from intent

2. Web/GUI Benchmarks

WebArena Family

BenchmarkTasksBest ScoreFocus
WebArena81271.6% (OpAgent)Multi-step web tasks
VisualWebArena910~19.8%Visual understanding
WebArena-Lite241~60%Lighter eval

WebArena Leaderboard (April 2026):

  • Claude Mythos Preview: 68.7%, GPT-5.4 Pro: 65.8%, Claude Opus 4.6: 64.5%
  • OpAgent (CodeFuse AI) leads at 71.6% with Planner-Grounder-Reflector-Summarizer pipeline

OSWorld (Desktop)

  • 69+ real OS tasks on actual VMs
  • Best: 76.26% (OSAgent), Human baseline: 72.4%
  • Hardest task type: LibreOffice Calc (~8%)

Other Web/GUI

BenchmarkTasksNotes
Mind2Web2,350137 real websites
Mind2Web 2130 liveNeurIPS 2025, long-horizon agentic search
Online-Mind2Web300 liveReal-time browsing + cross-site synthesis
WebLINX2,337Multimodal web agent
OmniACT9,802Large-scale UI automation
WindowsAgentArenaWindows desktop~25% best
Android-in-the-WildReal devices~90%+ (saturated)
Computer Agent ArenaHuman-centric2,201 votes head-to-head

3. General Agent Benchmarks

GAIA (General AI Assistants)

The benchmark for real-world assistant capability beyond coding.

LevelStepsHumanBest Agent
Level 1~1092% overall82.1% (Claude Sonnet 4.5 + HAL)
Level 2~2074.4%
Level 330+65.4% (HAL), 61% (Writer’s Action Agent)

Key insight: HAL scaffolding framework adds ~30 points on GAIA. Tool orchestration matters more than raw model capability.

τ-bench (Tool-Agent-User)

DomainBest ScoreModel
Retail89.2%Claude Mythos
Airline87.5%Claude Sonnet 4.6
Telecom~80%Various

τ³-bench adds: Voice full-duplex, knowledge retrieval (698 banking docs), 75+ task fixes.

AgentBench

8 environments (OS, databases, knowledge graphs, web navigation). 29+ LLMs benchmarked. GPT-4 leads 6/8 datasets.

Other General Benchmarks

BenchmarkFocusStatus
AgentGymUnified agent environmentActive
ALFWorldText adventureMature
ScienceWorldScientific experimentationActive
WebShopE-commerceMature
WorkArenaEnterprise workflows23k+ tasks

4. Multi-Agent Benchmarks

BenchmarkWhat It TestsKey Finding
ChatDevSoftware company simulation~88% executability (self-reported)
MetaGPTSOP-based multi-agent3.75/5 executability vs ChatDev’s 2.25
Silo-BenchDistributed coordinationAgents fail at cross-agent synthesis
AgentsNet100-agent networksSelf-organization, communication

5. Safety Benchmarks

AgentHarm

110 malicious agent tasks, 11 harm categories.

ModelHarm Score (no jailbreak)Harm Score (jailbroken)
Claude 3.5 Sonnet13.5%68.7%
GPT-4o48.4%72.7%
Mistral Large 282.2%

Key finding: Standard refusal training doesn’t transfer to agentic settings. Claude’s ~3% conversational HarmBench ASR becomes ~15% in agentic AgentHarm.

Other Safety Benchmarks

BenchmarkFocusBest Result
HarmBench400+ harmful behaviorsAutomated red teaming
S-EvalSafety-specific evaluationAdversarial pressure
AgentDojoPrompt injection multi-environmentDiverse attack vectors
ASB (Anthropic)4 attack categoriesBaseline established

6. Tool-Use / Function Calling

BFCL (Berkeley Function Calling Leaderboard)

VersionFocusBest Score
V1Simple, parallel, multiple callsVaries
V2 LiveReal-world APIs2,251+ pairs
V3Multi-turn, stateful76.7% (GLM 4.5)
V4Web search, memory, formatsAgentic capabilities

Notable: Claude Opus 4 scores 25.3% (bottom) on BFCL v3 despite dominating other benchmarks — its conversational wrapping trips AST parsing even when tool selection is correct.

Other Tool Benchmarks

BenchmarkScaleFeature
ToolBench16,000+ APIsGeneralization to unseen tools
API-Bank264 tasks3-level: call/search/plan
ToolQAVariousQA over structured tools
StableToolBenchVariesReproducible eval
Nexus1,500Paired with NexusRaven model

7. Long-Horizon Planning

TravelPlanner

1,225 tasks, 13 coupled constraints (8 commonsense, 5 hard).

ModelScoreYear
GPT-4 (launch)0.6%2024
Planner-R1 32B56.9%Dec 2025

A single budget overage invalidates an otherwise perfect plan.

DeepPlanning (2026)

DomainBest Accuracy
Travel35.0% (GPT-5.2)
Shopping60.0% (Gemini 3 Flash)

Models with explicit reasoning significantly outperform non-reasoning (85.8 vs 54.3 composite).


8. Memory Benchmarks

BenchmarkFocusStatus
LoCoMoLong-term context memoryEmerging
MemBenchExtended conversationsDeveloping
LongMemEvalExtended contextActive

Memory benchmarks are less mature. BFCL v4 introduced memory-related tasks as part of agentic evaluation.


Benchmark Saturation Map

Saturated (>90%)Near (70-90%)Active (<70%)
HumanEval, MBPPSWE-bench Verified ⚠️SWE-bench Pro (59%)
Android-in-the-Wildτ-bench Retail (89%)GAIA Level 3 (65%)
WebArena (71%)TravelPlanner (57%)
DeepPlanning Travel (35%)

Critical Nuances

  1. Verified ≠ Real: 35-point gap between Verified and Pro for same model
  2. Scaffolding matters: HAL adds ~30 points on GAIA
  3. Agentic safety ≠ chatbot safety: 3% → 15% jump on agentic HarmBench
  4. Function calling ≠ general capability: Claude top on τ-bench, bottom on BFCL