VAKRA: A New Benchmark That Shows How Bad AI Agents Are at Real Work

I’ve been watching the AI agent space for a while now, and there’s been a lot of hand-waving about how agents are going to revolutionize enterprise workflows. But talk is cheap. What happens when you actually make these agents do real work?

That’s the question the team at IBM Research tried to answer with VAKRA—a benchmark that dropped a few weeks ago and has been making the rounds on Hugging Face. The results aren’t pretty.

What VAKRA Actually Tests

Most AI benchmarks are like test-driving a car in an empty parking lot. Sure, the car can accelerate and brake, but what happens when you hit rush hour traffic? VAKRA is rush hour traffic.

It’s a tool-grounded, executable benchmark. That means agents have to actually call APIs, retrieve data, and chain multiple steps together to complete tasks. No multiple choice. No static datasets. Real execution traces that show whether the agent finished the job or crashed and burned.

The setup is genuinely impressive: over 8,000 locally hosted APIs backed by real databases across 62 domains, plus document collections that align with those domains. Tasks require 3 to 7 reasoning steps, mixing structured API calls with unstructured document retrieval. The queries are written in natural language, so the agent has to figure out what tools to use and in what order.

Here’s a concrete example from the dataset:

“Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?”

To answer this, the agent needs to call get_data to initialize the dataset, then chain multiple select_data_equal_to calls to filter down to the right team. The answer is FC Barcelona. Simple for a human. Not so simple for most current models.

The Four Capabilities

VAKRA breaks down into four task categories, each testing a different weakness:

1. API Chaining with Business Intelligence APIs – 2,077 test instances across 54 domains. Agents have to chain 1-12 tool calls using generic data manipulation tools (think Tableau-style filtering and sorting) or more specialized ones. The catch? You must call get_data first to initialize the data source, and the full dataset stays server-side to avoid bloated transfers. The agent only gets a lightweight preview.

2. Tool Selection with Dashboard APIs – 1,597 instances across 17 domains. This one uses REST APIs wrapped in an MCP server. The domain-specific tool sets range from 6 to 328 tools per domain. That’s a lot of options, and the OpenAI API spec limits the tool list to 128 tools, so you can’t just dump everything in. The agent has to pick the right one.

3. Document Grounding – This is where things get interesting. Agents get a collection of domain-specific documents and have to answer questions that require reading and reasoning across multiple documents. No tool calls here—pure retrieval and comprehension.

4. Mixed Tool Use – The hardest category. Agents have to combine API calls with document retrieval. They might need to pull data from an API, then cross-reference it with information from a PDF, then make another API call based on what they found. This is where most models fall apart.

The Results Are Brutal

I won’t sugarcoat it: the numbers are bad. Even the best models hover around 40-50% accuracy on the simpler tasks and drop to 20-30% on the mixed ones. That’s not production-ready by any stretch.

The researchers identified several failure modes:

Tool hallucination: Models call functions that don’t exist or pass invalid arguments. This is embarrassingly common.
Chain breakage: The agent completes step 2 correctly but forgets what it was doing by step 5. Context window management is still a joke.
API misselection: Given 100+ tools, models pick the wrong one even when the right one is obvious from the query.
Document grounding failure: Models can retrieve the right document but then fail to extract the specific information needed.

What’s interesting is that these failures aren’t evenly distributed. Some models are great at API chaining but terrible at document grounding, and vice versa. No single approach dominates across all four capabilities.

Why This Matters

VAKRA isn’t just another benchmark for researchers to chase. It’s a reality check for anyone building enterprise agents. If your agent can’t handle a 5-step workflow with 3 API calls and a document lookup, it’s not ready for your CRM integration.

The dataset is public, the leaderboard is live, and the code is on GitHub. If you’re building agents, I’d recommend running your own models through VAKRA before claiming they’re enterprise-ready. The results might save you some embarrassment.

One thing I appreciate about this benchmark: it’s executable. You can’t cheat by memorizing answers because the environment is dynamic. The APIs return real data, and the documents are actual text. This is closer to what agents will face in production than any static benchmark I’ve seen.

That said, I wish the paper included more analysis on why certain models fail. Is it a training data issue? Architecture limitation? Prompt engineering failure? The dataset is there for anyone to dig into, but I’d love to see IBM’s own post-mortem on the top performers.

For now, VAKRA stands as a sobering reminder that we’re still early in the agent game. The demos are slick, but the benchmarks tell a different story.

VAKRA: A New Benchmark That Shows How Bad AI Agents Are at Real Work

What VAKRA Actually Tests

The Four Capabilities

The Results Are Brutal

Why This Matters

Comments (0)