Can LLMs Actually Help Physicists Figure Out Superconductivity?

Can a large language model be a useful research partner for a physicist? That’s the question behind a new paper from Google Research and Cornell University, published in the Proceedings of the National Academy of Sciences. They used high-temperature superconductivity as their test case—a messy, open-ended field where competing theories have been battling for decades.

They asked six different LLMs to answer challenging questions about cuprates, a class of copper-based compounds that conduct electricity with zero resistance at temperatures up to about -140°C. That’s still cold, but way warmer than traditional superconductors. Understanding why these materials work the way they do could lead to room-temperature superconductors, which would be a game-changer for everything from power grids to maglev trains.

The panel of experts graded the responses on accuracy, comprehensiveness, and how well they reflected the current state of debate. The top performers were NotebookLM and a custom-built system, both of which pull from a closed ecosystem of curated, quality-controlled sources. That’s not surprising—when you’re dealing with nuanced scientific literature, having a filtered set of reliable references matters more than raw model size.

But here’s the thing: even the best models had clear weaknesses. They sometimes oversimplified competing theories or failed to flag where consensus is still shaky. For a field like superconductivity, where the leading hypotheses have been argued over for nearly 40 years, that’s a big deal. A student or junior researcher relying on these tools could easily walk away thinking there’s more certainty than there actually is.

This isn’t the first time Google has poked at this problem. Earlier work introduced CURIE, a benchmark for evaluating LLMs on analytical tasks across six scientific disciplines. Other projects have looked at using LLMs to interpret tables and figures, solve quantum mechanics equations, or even write expert-level scientific software. The idea is to move beyond the model as a glorified search engine and toward something that can act as a genuine thought partner.

I like that framing—”thought partner”—but it sets a high bar. A good thought partner doesn’t just regurgitate facts; they challenge your assumptions, point out where evidence is thin, and help you weigh competing ideas. That’s hard for any system, especially when the underlying data is a mess of conflicting studies, incomplete experiments, and shifting terminology.

The paper also highlights something I’ve been saying for a while: the value of curation. NotebookLM’s performance wasn’t just about having a good model—it was about having access to a vetted, focused knowledge base. That’s a lesson that extends well beyond physics. If you’re building an LLM for any specialized domain, garbage in really does mean garbage out.

For now, these tools are probably most useful for getting up to speed on a topic or exploring potential research directions, not for making final judgments. But as the field matures, I can see a future where a physicist regularly bounces ideas off an LLM before designing experiments. Just don’t expect it to replace peer review anytime soon.

Can LLMs Actually Help Physicists Figure Out Superconductivity?

Comments (0)