Forest vs. Tree: Google Research on How Many Raters You Actually Need for AI Benchmarks

Forest vs. Tree: Google Research on How Many Raters You Actually Need for AI Benchmarks

6 0 0

Google Research just dropped a paper that should make anyone building AI benchmarks stop and think. It’s called “Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation,” and it’s about something that sounds obvious but apparently gets ignored all the time: human disagreement.

Here’s the setup. You’re building a benchmark to test, say, a toxicity classifier. You have a budget. You can either get one rating on 10,000 items, or 10 ratings on 1,000 items. Which gives you more reliable results? Most researchers go with the first option — the “forest” approach — because they want coverage. But Google’s researchers, Flip Korn and Chris Welty, argue that this is often a mistake.

The core problem is that human raters disagree. A lot. On subjective tasks like toxicity, hate speech, or even something like summarization quality, two perfectly competent humans can look at the same piece of text and give opposite labels. The standard practice is to collapse those disagreements into a single “plurality” label — whichever side got more votes wins. But that throws away information. The example they give in the paper is telling: two sets of annotations can have the same plurality label (e.g., 60% toxic, 40% not) but one set might be a tight 51-49 split while the other is a clear 80-20. Treating them the same is lazy.

So they built a simulator to stress-test this. They varied two things: N (total number of items) from 100 to 50,000, and K (number of raters per item) from 1 to 500. Then they checked which configurations gave statistically reliable results (p < 0.05). The findings are pretty stark. For many tasks, the standard 1-5 raters per item is nowhere near enough to capture the real distribution of human judgment. You end up with benchmarks that look reproducible but aren't, because the next team running the same test with a different set of raters could get a completely different ranking of models.

I’ve seen this play out in practice. Every time a new LLM benchmark comes out, there’s a scramble to reproduce the numbers, and half the time they don’t match. The usual excuse is “different annotator pools” or “different prompts.” But this paper suggests the real culprit is just not enough raters per item. You’re essentially measuring the noise instead of the signal.

The trade-off is real. More raters per item means fewer items you can afford to rate. But the paper’s simulator gives you a way to optimize: for a given budget, you can calculate the N and K that maximize reproducibility. They’ve even open-sourced the simulator so you can plug in your own numbers.

One thing I appreciate about this work is that it doesn’t pretend there’s a one-size-fits-all answer. The optimal balance depends on how subjective your task is, how much disagreement you expect, and how precise your comparison needs to be. But the message is clear: if you’re using 3 raters per item and calling it a day, you’re probably fooling yourself.

I’d love to see benchmark builders actually adopt this framework. The paper is worth a read, and the simulator is on GitHub. If you’re in the business of evaluating models, this is the kind of methodological nudge that could actually improve the field — if people pay attention.

Comments (0)

Be the first to comment!