I’ve been watching the AI evaluation cost problem creep up for a while, but the numbers coming out now are genuinely eye-watering. The Holistic Agent Leaderboard (HAL) recently dropped $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. That’s not a typo. A single GAIA run on a frontier model can hit $2,829 before any caching tricks. Exgentic’s $22,000 sweep across agent configurations found a 33× cost spread on identical tasks, and it turns out scaffold choice is the main culprit, not the model itself.
This isn’t just an agent problem either. UK-AISI recently scaled agentic steps into the millions just to study inference-time compute. In scientific ML, evaluating one new architecture on The Well costs about 960 H100-hours. A full four-baseline sweep? 3,840 H100-hours. That’s real GPU time that could have been used for training.
The static benchmark cost problem
This didn’t start with agents. When Stanford’s CRFM released HELM in 2022, the per-model accounting showed API costs ranging from $85 for OpenAI’s code-cushman-001 to $10,926 for AI21’s J1-Jumbo (178B). Open models weren’t free either: 540 to 4,200 GPU-hours, with BLOOM and OPT at the top end. IBM Research noted that putting Granite-13B through HELM could consume 1,000 GPU hours. Across HELM’s 30 models and 42 scenarios, the aggregate cost hit roughly $100,000.
But the real shocker came from Perlitz et al.’s analysis of EleutherAI’s Pythia checkpoints. Pythia released 154 checkpoints for each of 16 models across 8 sizes, so 2,464 checkpoints total. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training. Perlitz et al. noted that evaluation costs may even surpass pretraining when you evaluate every checkpoint. For small models, evaluation becomes the dominant compute line item across the whole development cycle.
The good news is that static benchmarks had a weakness you could exploit: model differences often concentrate in a small subset of items. Perlitz et al. found that a 100× to 200× reduction in compute preserved nearly the same ranking. Flash-HELM turned that into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error. The Open LLM Leaderboard collapsed from 29,000 examples to 180. Anchor Points showed that as few as 1 to 30 examples could rank-order 87 language-model/prompt pairs on GLUE.
Agent evals are a different beast
That compression trick weakens sharply once you move from static predictions to agents. HAL’s headline cost of $40,000 for 21,730 rollouts tells part of the story. By April 2026, the leaderboard had grown to 26,597 rollouts. Ndzomga’s independent reproduction arrives at almost the same number: $46,000 across 242 agent runs.
Behind that aggregate, the cost of a single benchmark run varies by four orders of magnitude across HAL tasks, and by three orders within some individual benchmarks. Claude Opus 4.1 charges $15 per million input tokens and $75 per million output. Gemini 2.0 Flash charges $0.10 and $0.40, a two-order-of-magnitude spread on input alone. Agent benchmarks rarely benchmark “the model” in isolation. They benchmark a model × scaffold × token-budget product, and small scaffold choices can multiply costs 10×.
Worse, higher spend does not reliably buy better results. On Online Mind2Web, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. The HAL paper notes “a 9× difference in cost despite just a two-percentage-point difference in accuracy.” On GAIA, an HAL Generalist with o3 Medium cost $2,828 for 28.5% accuracy, while a different agent hit 57.6% for $1,686. CLEAR finds across 6 SOTA agents on 300 enterprise tasks that “accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives” with comparable real-world performance.
What this means
The cost threshold has shifted who can do serious evaluation. If you’re a startup or academic lab without deep pockets, you’re priced out of running comprehensive agent benchmarks. The field needs better compression techniques for agent evals, but agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability through repeated runs, costs multiply further.
I don’t see a clean solution yet. Maybe we need standardized cheap scaffolds that everyone benchmarks against, or maybe we need to accept that evaluation will be a significant cost center for the foreseeable future. Either way, pretending this isn’t a bottleneck is wishful thinking.
Comments (0)
Login Log in to comment.
Be the first to comment!