Google's New Framework Tests Whether LLMs Actually Behave Like Humans

Google Research just dropped a paper that tries to answer a question I’ve been poking at for a while: when an LLM acts like it has empathy or assertiveness, does it actually follow through in realistic situations?

They call it “Evaluating Alignment of Behavioral Dispositions in LLMs,” and it’s basically a way to turn psychological personality tests into large-scale situational judgment tests (SJTs) for models. The core idea is simple but clever: instead of just asking a model “Are you empathetic?” and taking its word for it, they put it in scenarios that force it to show its cards.

The Problem with Self-Report

Anyone who’s worked with LLMs knows they’re terrible at self-assessment. Ask a model “Do you agree with the statement ‘I am quick to express an opinion’?” and it’ll give you a confident answer. But that answer shifts wildly with prompt phrasing, and there’s zero guarantee it’ll actually behave that way when you put it in a conversation where it has to advise a user on a tense workplace conflict.

This isn’t just a technical quibble. As these models get embedded in everything from customer service to mental health support, knowing whether they’ll act assertive or accommodating in a real interaction matters.

The Google team pulled from standardized, peer-reviewed psychological instruments like the Interpersonal Reactivity Index (IRI) for empathy and the Emotion Regulation Questionnaire (ERQ). These are the same tools psychologists use to study human personality. But instead of handing the questionnaire to a human, they adapted each statement into a declaration of the model’s general advising tendency, then used that to generate SJTs.

How It Works

Each SJT presents a realistic scenario with two possible courses of action: one that supports a specific behavioral trait and one that opposes it. For example, a scenario might involve a user asking for advice on handling a colleague who takes credit for their work. One action is assertive, the other is accommodating. Three independent human annotators validate each scenario to make sure it’s coherent and actually tests what it’s supposed to test.

Then the model gets the scenario as input and generates a natural response, which an LLM-as-a-judge maps to one of the two actions. The key here is they’re not just scoring the model’s behavior in isolation. They collected preferred actions from 10 annotators per SJT from a pool of 550 participants, creating a human preference distribution. Then they compare that distribution to the distribution of model responses.

This is where it gets interesting. They tested 25 LLMs across scenarios covering professional composure, conflict resolution, practical tasks like booking a trip, and daily decision-making. The results reveal two distinct kinds of gaps.

The Two Gaps

First, there’s the gap where model dispositions deviate from the consensus among human annotators. This is what you’d expect: models that are too aggressive or too passive compared to what most people would do. But the second gap is more subtle and arguably more important: cases where model dispositions don’t capture the range of human opinions when there’s no clear consensus.

Humans disagree on social situations all the time. Some people think you should push back when a colleague steals credit; others think you should let it slide and address it privately later. Both positions are valid, but many LLMs collapse that diversity into a single “safe” response that misses the nuance entirely.

This is higher than I expected. I’ve seen plenty of alignment work that focuses on making models agree with human preferences on average, but very little that checks whether they reflect the actual distribution of those preferences. A model that always picks the majority opinion is technically aligned with the average human, but it’s also flattening the richness of human social judgment.

What This Means for Practitioners

If you’re building applications where LLMs give advice or make recommendations in social contexts, this framework is worth paying attention to. The standard approach of fine-tuning on preference data doesn’t necessarily produce models that behave appropriately in nuanced social situations. You need to test for behavioral dispositions explicitly, and you need to test across a range of scenarios that reflect the actual diversity of human experience.

The paper focuses on everyday interactions, which I think is the right call. We’ve seen plenty of work on safety and alignment for extreme cases, but the vast majority of LLM usage is mundane: helping with emails, giving travel advice, mediating workplace disagreements. That’s where behavioral alignment matters most.

The Catch

The framework still relies on LLM-as-a-judge for classifying responses, which introduces its own biases. And the scenarios are generated by LLMs, then validated by humans, which means the quality depends heavily on the prompt engineering and human oversight. It’s not a turnkey solution.

But as an early step, this is solid. It moves the conversation from “Does the model say the right thing?” to “Does the model behave like a reasonable person would in this situation?” That’s a meaningful shift.

I’d like to see this extended to cross-cultural contexts. The annotator pool is presumably US-based, and social norms vary dramatically across cultures. A behavior that’s appropriately assertive in one context might be rude in another. But that’s a problem for future work.

For now, this is a useful tool for anyone who’s ever wondered if their model actually has the social skills it claims to have. Spoiler: it probably doesn’t, but at least now we have a better way to measure the gap.

Google’s New Framework Tests Whether LLMs Actually Behave Like Humans

The Problem with Self-Report

How It Works

The Two Gaps

What This Means for Practitioners

The Catch

Comments (0)