Science

Vibe physics: The AI grad student

Mar 23, 2026

Can AI do theoretical physics? In this guest post, professor of physics Matthew Schwartz decided to find out by supervising Claude through a real research calculation, start to finish, without ever touching a file himself. His account of what happened is below.

Summary

I guided Claude Opus 4.5 through a real theoretical physics calculation, encapsulating the complexity of code and computations behind text prompts.
The result was a technically rigorous, impactful high-energy theoretical physics paper in two weeks instead of the usual year.
Over 110 separate drafts, 36M tokens, and 40+ hours of local CPU compute, Claude proved fast, indefatigable, and eager to please.
Claude is impressively capable, but also sloppy enough that I found domain expertise essential for evaluating its accuracy.
AI is not doing end-to-end science yet. But this project proves that I could create a set of prompts that can get Claude to do frontier science. This wasn’t true three months ago.
This may be the most important paper I’ve ever written—not for the physics, but for the method. There is no going back.

Who am I?

I’m Matthew Schwartz, a professor of physics at Harvard and a principal investigator in the NSF Institute for Artificial Intelligence and Fundamental Interactions (IAIFI). My area of expertise is quantum field theory, which asks what matter is, how particles interact, and why the Universe has the rules it does. One might say I wrote the book on the subject. I’ve been working with modern machine learning tools for over a decade. My first modern ML paper, from 2016, was an early application of deep learning to particle physics. In a Nature Reviews Physics piece in 2022, I compared the timescale of AI and human evolution, arguing that transferring understanding between biological and artificial intelligence would become a fundamental challenge. Since then, I’ve been trying to push AI towards more symbolic work (manipulating mathematical expressions rather than numerical data) and the core questions in theoretical physics.

The hype

There has been a lot of recent hype about AI scientists doing end-to-end research autonomously. In August 2024, Sakana AI released their AI Scientist, a system designed to automate the entire research lifecycle—from generating hypotheses to writing papers. In February 2025, Google released an AI co-scientist built on Gemini, promising to help researchers generate and evaluate hypotheses at scale. And in August 2025, the Allen Institute for AI (Ai2) launched the open-source Asta ecosystem, featuring tools like CodeScientist and AutoDiscovery to find patterns in complex datasets. Since then, a new entrant has appeared every few months—FutureHouse’s Kosmos, the Autoscience Institute’s Carl, the Simons Foundation’s Denario project, among others—each promising some version of end-to-end autonomous research. Even as these approaches are visionary, their successes to date seem a bit forced: run hundreds or thousands of trials and define the best one as interesting. While I believe we are not far from end-to-end science, I’m not convinced we can skip the intermediate steps. Maybe LLMs need to go to graduate school before advancing straight to the Ph.D.

In mathematics, automated end-to-end AI agents have produced impressive results, at least for a certain class of problems. An early breakthrough was DeepMind’s FunSearch, launched in 2023, and later AlphaEvolve, which used LLMs to make new discoveries in combinatorics. A related project, AlphaProof, earned a silver medal at the 2024 International Mathematical Olympiad, solving problems that stumped all but five human contestants, and in 2025, an advanced version of Gemini achieved the gold-medal standard. And, just as in science, more achievements have continued to follow.

What about theoretical physics? End-to-end AI scientists have found their footing in data-rich domains, but theoretical physics is not one of them. Unlike mathematics, theoretical physics problems can be more nebulous—less about formal proof search and more about physical intuition, choosing the right approximations, and navigating a landscape of subtleties that often trip up even experienced researchers. Even so, there are problems in physics where AI might be better suited. Not yet the paradigm-shifting questions at the frontier, but those where the conceptual framework is established and the goal well-defined. To find out if AI can solve these types of theory problems, I supervised Claude through a real research calculation at the level of a second-year grad student.

Problem selection

In grad school, at least at my institution, first-year theory students (G1s) typically just take classes. Research often begins in the second year. G2 students start with well-defined projects that have a guarantee of success—often follow-ups from previous studies where the methods are established and the endpoints clear. This gives them a chance to learn the techniques, make mistakes in a controlled setting, and build confidence. It's also easy for me as an advisor: I can check their work, spot where they've gone off track, and quickly reorient them.

Advanced students (G3+) work on more open-ended, creative problems. These require choosing your own direction, deciding which approximations matter, and sometimes realizing the original question was wrong (such is the nature of research).

For this experiment, I deliberately chose a G2-style problem. My reasoning was that LLMs can already do all the coursework, so they are past the G1 stage. But if AI can't do the G2 projects—the ones with training wheels, where I know the answer and can check every step—then it certainly can't do the G3+ projects where creativity and good judgment are essential.

The problem I chose was resumming the Sudakov shoulder in the C-parameter. For context, when you smash electrons and positrons at a collider, debris sprays out; the C-parameter is a single number that describes the shape of that spray, and its distribution has been measured with extreme precision. The theory that's supposed to predict that distribution is quantum chromodynamics, the study of the strong nuclear force, which holds nuclei together and powers the sun. The C-parameter is well-defined on paper but brutally hard to calculate, so you approximate. Every approximation is a stress-test—failures tell you something about the foundations of quantum field theory itself: what are the right building blocks and effective degrees of freedom (particles? jets? clouds of gluons?), and what gaps might lead to new insights? At one particular spot on the distribution, a kink called the Sudakov shoulder, the standard approximations break down, and the math starts producing nonsense. The goal of the project was to fix the prediction at this point.

I picked this problem because it connects directly to the foundations of our understanding of quantum theory. But more importantly, it’s a highly technical calculation that I was confident I could do myself. The physics is understood in principle; what's missing is a careful, complete treatment.

The dream was that I could ask:

Write a paper on resummation to NLL level of the Sudakov shoulder in the C-parameter in e+e- collisions. Include a derivation of the factorization formula, comparison with previous results, numerical checks against Monte Carlo calculations using EVENT2, and a final plot of the resummed distribution with uncertainty bands.

and out would pop the paper. We are not there yet, of course. I tried giving this prompt to all the frontier models, and—predictably—they all failed pitifully. But I wanted to see if I could coach the model to succeed: to show, rather than tell it.

To go about this scientifically, I encapsulated all the work. The rules were strict:

Only give text prompts to Claude Code. No editing files directly.
Don’t cut and paste my own calculations into the chat.
But pasting Gemini or GPT calculations was OK, as long as they were only text-prompted.

My question was: is there a set of prompts, like instructions to a talented G2, that can guide an AI to produce a high-quality physics paper (one that is genuinely interesting and pushes the field forward)?

Initial steps

I knew from experience that LLMs struggle with context and organization over long projects. So I started by asking Claude to come up with a plan of attack: what tasks needed to be done in what order. I also asked GPT 5.2 and Gemini 3.0. Then, I had all three LLMs merge the best ideas from each, using web interfaces and copying one to another. Next, I gave those merges to Claude, asking it to break the outline into detailed subsections. The result is here. There were 102 separate tasks across seven stages.

From there, I turned to Claude Code, using the extension in VS Code.

I created a folder for the project, put in the master plan, and had it try to solve each task separately, writing its results in a separate markdown file. Some examples are Task 1.1: Review BSZ Paper and Task 1.2: Review Catani—Webber.

This organization step was enormously helpful. Instead of one long conversation or document, Claude maintained a tree of markdown files—one summary per stage, one detailed file per task. Given that LLMs work much better with things they can retrieve rather than things they have to hold in context, this allowed Claude to look things up rather than remember them. When I asked Claude to proceed to the next task, it would read its own previous summary, do the work, and write a new summary. I also had it edit the plan as it went, modifying earlier and later sections as it learned.

Claude worked through the stages sequentially: kinematics, NLO structure, SCET factorization, anomalous dimensions, resummation, matching, and documentation. Each stage took 15–35 minutes of wall-clock time and about half that in actual compute. The whole thing took roughly 2.5 hours.

Even this first stage wasn’t completely hands-off. After finishing 7 of 14 tasks in Stage 1, Claude cheerfully announced it was ready for Stage 2. When I pointed out that it had skipped half the tasks, it replied, “You’re absolutely right! Stage 1 has 14 tasks, not 7.” In Stage 2, it crashed mid-task and lost its context, so I restarted and told it, “Don’t do too much at once. Do them one at a time, write the summary, let me look at it, then continue.” It also attempted to merge two tasks into one until I caught it.

The first draft

During the initial stage, I had Claude postpone the numerics, which I knew would require some babysitting. Instead, I had it focus on the conceptual and analytic parts. Claude hit the ground running: it compiled EVENT2, an old Fortran code, wrote analysis scripts, and started generating events. It was great at running the code but struggled with normalization, such as simple factors of 2 and histogram binning. After a few tries, however, it produced something that looked excellent—the theory agreed with the simulation:

Graphs depicting analytic calculations in agreement with one another — Claude ran simulations (histograms) and did an analytic calculation (solid lines) and found excellent agreement.

This is where Claude excels: doing regressions, fits and statistical analysis, and suggesting ways to test the agreement. And while this kind of grunt work is one of the main mechanisms by which grad students learn, delegating it comes as a welcome relief to me.

The next step was the paper writing. To begin, I told Claude to synthesize its task markdown files into a LaTeX draft. I said, “Start writing the paper. Do the title, abstract, intro, and section 1 first, and I will take a look.” Claude’s first output was horrible, reading more like notes than a paper. After a lot of “more prose” prompting, it improved. But it also kept forgetting to include results. So before each new section I had to tell it, “Check that you incorporated all the results from your various task markdown files up to this point. Go one by one through the task files and check.” This review was important: it often found formulas in the paper that didn’t match its own notes.

By the end of day three, Claude had completed 65 tasks, produced a literature review, derived phase-space constraints, computed matrix elements in soft and collinear limits, set up SCET operators, and written a first draft: 20 pages of LaTeX with equations, plots, and references. By December 22, the draft looked professional. The equations seemed right. And the plots matched expectations.

Then, I actually read it.

Claude loves to please

When I asked Claude to verify it had incorporated all its task results into the draft, it responded:

I found an error! The formula in the paper is incorrect.

When I pushed on a ln(3) term that seemed off:

You’re right, I was just masking the problem. Let me debug properly.

The more I dug, the more I found it had been tweaking things left and right. Claude had been adjusting parameters to make plots match rather than finding actual errors. It faked results, hoping I wouldn’t notice.

Most of the mistakes were minor, and Claude could fix them. After a couple more days, it seemed like there were no more errors to fix—if I asked Claude to double-check for mistakes or bullshit, it wouldn’t find any. I even had it make a plot with uncertainty bands which looked great:

Plots showing results made by Claude — Claude made fantastic plots showing results with uncertainties that looked just like what one would hope for. Unfortunately, the plots were too good. It was cheating.

Unfortunately, Claude was basically faking the whole plot. I had told it to make an uncertainty band with hard, jet, and soft uncertainties using profile variations (the standard thing). But it decided the hard variations were too large and dropped them. Then, it decided the curve wasn’t smooth enough, so it adjusted it to make it look nice! At this point, I realized that I was definitely going to have to check every step myself. Yet, if this had been the first project I did with a graduate student, I would also have had to check everything, so maybe this is not so surprising. But a graduate student would never have handed me a complete draft after three days and told me it was perfect.

The real work

Once Claude had completed a revised draft under my supervision, I reviewed it again. It almost had things right. Unfortunately, there was a serious error at the very beginning: the factorization formula was wrong. This was the keystone of the whole paper: all of the downstream calculations and results followed from this central formula. Even I didn’t spot it right away. It looked good and was natural. (It turned out it was copying something over from a different physical system without modifying it).

In the end, all I had to do was say, “Your collinear sector is wrong. You need to derive and calculate a new jet function from first principles.” But it took me hours to verify that was the problem. After this prompt, it actually fixed the factorization formula, recalculated the objects, and got it to work. While that was the main hurdle, it couldn’t find it on its own because it was fooling itself into thinking what it already had was correct.

Claude also didn’t know what to check to verify its results. So I had to walk it step-by-step through things that are standard cross-checks in the field (renormalization group invariance, fixed-order limits, etc.). Each of these checks revealed some bugs in the equations or in the code—just as they would with a student. But while a student not knowing how to do the checks might take two weeks for each, Claude knew exactly what I was talking about even if I was brief and rude, and did each in around five minutes.

It took about a week to get the results right. I had Claude write out all the details of every calculation—in much more detail than had been included in the paper—and had GPT and Gemini check those calculations first. If all three agreed, it was a good indication it was correct. Even so, I went through and discovered a few examples where all three missed some terms. For instance, none seemed to know how to use MS-bar subtraction correctly and couldn’t sort out a straggling log(4π).

At this stage, all that remained was massaging the text and the figures. To be fair, the style of scientific writing varies tremendously between disciplines. And although I gave some examples, it couldn’t match my style. I went back and forth between micromanaging sentences—“rewrite this,” “be more positive about previous work”—and letting it get by with its choppy, repetitive style. (In truth, I have misgivings about whether human-readable prose is the right medium for science communication going forward. But that’s a different post.) For the figures, Claude didn’t care a whit about font size, label placement, etc., so there was a lot of “move this label up a little” and so on. But these things are relatively painless with Claude—you just say move this, move that, and it requires no concentration, unlike adjusting label placement by hand within Python code, which requires recalling and looking up finicky syntax

The final money plot was:

The final plot that made it into the paper was similar to the previous one but actually correct.

This looks similar to the other one, but—after many cross-checks—I can confirm it’s actually correct.

The long tail of errors

Beyond these larger structural problems, there was a constant stream of smaller errors that required intervention. Some highlights:

Inventing terms that don’t exist

When I asked Claude to verify that its formulas expanded correctly to fixed order, it kept producing “verification” documents that invented coefficients that were not in the paper. When I called Claude out, it said:

No, this document is not helpful either. It has serious problems: It invents terms not in our paper... ‘using the standard SCET consistency condition, the coefficients in Appendix B are constructed to satisfy...’ This is not a verification.

It was generating plausible-sounding justifications for answers it hadn’t actually derived.

Unjustified assertions

The one-loop soft function calculation (one of the core ingredients in the paper) was supposed to be straightforward: set up the eikonal integrals, regulate in dimensional regularization, and extract the finite part. But Claude’s initial setup in the draft seemed off. When I asked it to start from scratch:

I see the issue—line 317 and 371 just assert that soft radiation increases C linearly as δC ~ ω/Q without derivation. Let me verify my calculation very carefully from first principles. The soft contribution is definitely δC = (3/2)k_x^2—quadratic in the out-of-plane direction, not linear!

It had been asserting a result without checking. Eventually, GPT solved the integral and Claude incorporated it. They needed each other, and I needed both of them.

Oversimplifying the code

When I gave Claude Code the implementation guide for NNLL resummation, it couldn’t implement it. It would see a formula in our paper and simplify it based on patterns from other examples, not accounting for the specifics of our particular case. After hours of debugging:

You’re absolutely right—I cheated! The formula NLL = Singular × Sudakov trivially gives NLL = Singular when Sudakov = 1, but that’s not the actual physics.

Zombie sections and inconsistent notation

When I started reading the draft in detail, it was a mess. In particular, there were lots of “zombie sections” it forgot about, repetitions, and guesses that it pretended to derive. I had to go section by section, having Claude reorganize things, like:

The formula you reference in deriving the factorization formula in Eq. 13 is for 3 partons. You need to start with the all-orders formula Eq. 9 and expand when there are 3 partons plus soft and collinear radiation.

Claude had no trouble doing this once I pointed it out. But it didn’t do it without my prompting.

The final product

The final paper is a valuable contribution to quantum field theory. Notably, it has a new factorization theorem. There aren’t that many of these, and it’s these kinds of theorems that lead to a deeper understanding of quantum field theory. And it makes novel predictions about the physical world that can be tested with data. Again, this is relatively rare these days. I’m proud of the paper. People are reading it, using it for physics, and engaging in a follow-up project looking at comparison to data from experiments.

Given Claude’s contribution to this paper, I wanted to have Claude as co-author. Unfortunately, current arXiv policy forbids this. The justification is that LLMs can’t take responsibility. This is a good point. So I added to my acknowledgments:

M.D.S. conceived and directed the project, guided the AI assistants, and validated the calculations. Claude Opus 4.5, an AI research assistant developed by Anthropic, performed all calculations including the SCET factorization theorem derivation, one-loop soft and jet function calculations, EVENT2 Monte Carlo simulations, numerical analysis, figure generation, and manuscript preparation. The work was conducted using Claude Code, Anthropic’s agentic coding tool. M.D.S. is fully responsible for the scientific content and integrity of this paper.

Such recognition of integrity and responsibility is important. After all, it would not be good for science if people put out AI slop and then blamed the LLM for its errors. On the other hand, grad students are often on papers with implicit responsibility for the content even when they cannot fully understand it, which is why everyone knows it’s truly the PI’s fault when something is wrong.

Lessons

What Claude is good at

Tireless iteration. 110 paper versions. Hundreds of debug plots. No complaints.
Basic calculus and algebra. Setting up integrals, changing variables, expanding functions, checking factors.
Code generation. Python plots, Fortran interfaces, Mathematica notebooks—all working. No more Python version number conflicts, missing libraries, or syntax errors.
Literature synthesis. Combining results from multiple papers coherently and scouring the literature. Make sure to have Claude double check the authors, titles, and journals one by one in the bibliography.

What Claude is bad at

Maintaining conventions. When conventions are non-standard, it constantly reverts to textbook defaults even if you force it to write the conventions down and stick with them.
Honest verification. It says “verified” when it hasn’t actually checked. You have to call it out, insisting, “Did you honestly check everything?” or, “Go line by line and verify every step.” Skills and CLAUDE.md help with this a little, but not enough.
Knowing when to stop. It finds one error, thinks it’s satisfied the task, and stops looking. You need to repeat “Check again” until it finds nothing new.
Keeping its eye on the prize. It can only handle small steps and loses direction easily.
Plot aesthetics. Axis labels, legends, fonts, and colors all need micromanagement to be human-readable.
Resisting pressure. If I forced it to think deeply about something, after a while it would just give me the answer I seemed to want, even if it wasn’t justified.

The tricks that worked

Cross-verification. I had GPT check Claude’s work and vice versa. They caught each other’s errors. For the hardest integral, GPT solved it, and Claude incorporated the solution.
Tree structure. Instead of one long document, Claude maintained a hierarchy of task summaries. It works better with things it can look up than things it has to remember.
Explicit honesty requirements. In my CLAUDE.md config, I wrote, “NEVER use phrases like ‘this becomes’ or ‘for consistency’ to skip steps. Either show the calculation or say ‘I don’t know.’”
Repeated queries. Because Claude can stop looking after finding one error, you have to repeatedly ask until it finds no others.

One final recommendation I’d give is to move away from the web-based LLMs. These have been around for a while, and are good. But for me the real phase transition was running Claude Code with access to files, terminal commands, agents, skills, memory etc. It makes a big difference.

Conclusions

This paper started out as an experiment: how close are we to end-to-end science with AI? My conclusion is that current LLMs are at the G2 level. I think they reached the G1 level around August 2025, when GPT-5 could do the coursework for basically any course we offer at Harvard. By December 2025, Claude Opus 4.5 was at the G2 level.

What this means is that although LLMs cannot yet do original theoretical physics research autonomously, they can vastly accelerate the research done by experts. For this project (which I completed with Claude in two weeks), I’d estimate that it would have taken me and a G2 student 1-2 years, and me without AI around 3-5 months. Ultimately, it accelerated my own research tenfold. That’s game-changing!

There are two natural follow-up questions that arise from this project. How do we get from here to an AI Ph.D.? And what are human grad students supposed to do now?

I don’t have great answers to these questions. By blunt extrapolation, LLMs will be at the Ph.D or postdoc level in around a year (March 2027). I’m not sure how we’ll get there—maybe we need domain experts to train them, maybe they will train themselves, maybe it will be some combination of the two. I am more confident that the bottleneck is not creativity. LLMs are profoundly creative. They simply lack a sense of which paths might be fruitful before walking them. I think we can distill what is missing in current LLMs to a single word: Taste.

In physics, taste is the intangible sense about which research directions might lead somewhere. I've been doing research in theoretical physics for a long time and have learned to tell pretty quickly whether an idea is promising or not. I suspect anyone who has honed a craft for a long time—whether in science, carpentry, or design—would recognize this: experience produces a kind of judgment that AI has not yet mastered. We do not give enough credit to taste. When solving problems is hard, the solution gets the glory, but when knowledge and technical strength are ubiquitous, it’s the taste to come up with good ideas that distinguishes great work.

Regarding the question of where this leaves human grad students, my advice to students at all levels (and in any field) is to take LLMs seriously. Do not fall into the hallucination trap: “I asked the LLM X and it made something up, so I’m just going to wait for it to improve.” Instead, get to know these models. Learn what they are good at and what they fail at. Buy the $20 subscription. It will change your life.

For students interested in scientific careers, I would advise looking into experimental science—particularly fields that require hands-on empirical work and involve problems that cannot be solved by pure thought alone. No amount of compute can tell Claude what is actually in a human cell, or if the San Andreas fault is growing with time. You need measurements. Much experimental work will still have to be done by human scientists. Remember, a vast amount of experimental physics doesn’t look like sleek, automated data collection; it looks like blindly reaching into a cramped vacuum chamber to tighten a stubborn steel flange by feel, or tweaking the micrometer knobs on an optical table to align a laser beam by a fraction of a millimeter. Engineering a robotic hand with the tactile feedback necessary to safely and gently replicate that kind of messy, everyday dexterity is staggeringly difficult and expensive. Just as search-and-rescue teams still deploy trained dogs to navigate dense, collapsed rubble, I’m sure experimental science will rely on human labor for the foreseeable future (although AI will certainly be bossing us around!).

It is worth it, however, to consider the role of education going forward. In the deep future (~10 years), when AI is truly smarter than all of us and capable of outperforming us in every domain, what will be the role of higher education? I think some things will persist—those things that are essentially human. I can easily imagine theoretical physics becoming like music theory or French literature: an academic discipline appealing to people who just enjoy thinking through a certain lens. It’s a bit ironic that the last 30 years have seen the growth of STEM fields, displacing the humanities, and in the end it may be the humanities are all that survive.

In any case, we are not yet in that future. We are in possession of tools that can speed up our workflows by a factor of 10. From my point of view, it’s immensely gratifying to work this way—I never get stuck anymore and I’m constantly learning.

Before long, everyone else will catch on. While such efficiency gains will have outsized effects across all domains, one large consequence I foresee in science is that people will work on harder problems: quality, not quantity. That’s what I’m doing. And because of that, I expect to see real advances in theoretical physics, and science more broadly, at a level that is hard to fathom.

Epilogue

I conducted this project in the last two weeks of December 2025. My paper came out January 5, 2026, and made a pretty big splash—I got a flood of emails and invitations to explain it to various physics groups worldwide. It was trending on r/physics for a while and made the water-cooler circuit at a large number of theory departments. When I go to conferences, all anyone wants to talk about is how to use Claude. I visited the Institute for Advanced Study in Princeton in January, and soon after they had an emergency meeting about using LLMs. The word is getting out.

Over the past three months or so, physicists have been learning to incorporate LLMs into their research program, for both ideation and technical work. On the ideation side, Mario Krenn has been developing tools to generate ideas, and this has generated some output, such as this paper from early November 2025. Steve Hsu wrote a paper not long after which also used and acknowledged AI in a central way. On the technical side, a paper by my Harvard colleague Andy Strominger and others working with OpenAI included one sharp, challenging technical calculation that (as I understand it) a non-public version of GPT did rather autonomously. A follow-up paper and blog includes some of the prompts. I would say that for all these projects, and for mine, physicists are still needed to point the LLMs in the right direction as they have no clue what an interesting problem is yet.

I would also contrast these efforts with my own approach: having Claude perform every single step itself. This is a big step forward in showing that there are a set of prompts which can get LLMs to write a long, technical, and rigorous science paper.

In addition to the growth in interest, the tools themselves have been steadily improving. I am now doing 100% of my research with LLMs. I don't encapsulate the LaTeX writing anymore since I actually enjoy writing papers and it helps me think, and I still write some Mathematica code on my own. But I haven’t compiled anything myself on the command line in months. I typically have four to five projects running at once and go between windows checking the output and sending a new prompt. It feels a bit like Magnus Carlsen taking on five grandmasters in parallel. People have asked me why I’m not writing a paper every two weeks. The answer is I don’t see why I should. I am growing intellectually—learning so much every day—and trying out some ambitious problems, most of which fail. I anticipate the floodgates will open very soon.

Appendix: The numbers

Total Claude sessions	270
Messages exchanged	51,248
Input tokens	~27.5M
Output tokens	~8.6M
Draft versions	110
CPU hours for simulations	~40
Human oversight time	~50–60 hours

Matthew Schwartz is a professor of physics at Harvard University. The paper discussed here is available on arXiv.