Theorem proving as a frontier

Mathematics holds a singular place in our research agenda, and it is worth being precise about why. It is the one domain in which truth is mechanically verifiable. A proof, once expressed in a formal system such as a dependent-type-theory proof assistant, is either correct or it is not, and the verdict is rendered by a kernel of a few thousand lines of code that no rhetoric can persuade and no confidence can fool. There is no panel of referees to convince, no consensus to wait on, no margin in which a plausible-sounding step quietly fails. The check is exact, it is cheap, and it is final.

That property makes mathematics far more than a benchmark, though it is an exceptionally clean one. Mathematical capability is reasoning in its purest form — long-horizon, compositional, creative deduction with nowhere to hide — and progress here is a direct probe of whether our systems genuinely reason or merely interpolate over patterns they have seen. A model that proves a theorem it has never encountered, against a checker that cannot be charmed, has demonstrated something no leaderboard of multiple-choice questions can establish. For a laboratory whose mission is to build systems that advance science, that is the frontier we most want to stand on.

Why mathematics is an ideal frontier

Three features make the domain almost uniquely suited to machine reasoning. It is verifiable: every claim reduces to a check a small trusted kernel can perform. It is infinite: the space of true statements is unbounded, so a system can manufacture as much practice as it can afford, generating conjectures, attempting proofs, and learning from the verified successes with no human in the loop for the verification step. And it is deep: the problems range continuously from the trivial to the unsolved, which means there is always a rung just above the system's current reach, and always a ladder to climb toward it.

~7k

Lines of trusted kernel that decide every proof

∞

True statements available as self-generated practice

Hallucinated steps a verified proof can contain

Formal proof search inside the assistant

Our first track places a model directly inside a proof assistant, where it does not write prose that someone later audits but emits tactics that the kernel checks at every step. The search space is the set of valid tactic applications, and because each one is verified as it is applied, hallucinated reasoning is eliminated by construction: a step that does not typecheck simply does not happen. The model cannot bluff its way forward, and that constraint, far from limiting it, is precisely what makes the learning signal trustworthy.

The machinery is a guided tree search over proof states. A policy network proposes the tactics most likely to make progress; a value network estimates how close a given proof state sits to completion; and the search, refined by reinforcement learning against the verifier, explores the resulting tree. The verifier supplies the reward, the search supplies the exploration, and the networks compress the lessons of millions of attempts into intuitions about which lines are worth pursuing.

The persistent obstacle is data. The corpus of human-written formal proofs is tiny next to the vast informal literature, because formalization is laborious and most mathematics was never written for a machine to read. We attack the shortage from several directions at once.

Autoformalization — train models to translate informal natural-language statements and proofs into formal ones, converting the enormous existing literature into checkable training material the proof-search model can consume.
Expert iteration — bootstrap from easier auto-generated problems toward progressively harder ones, with each round's solved problems becoming the next round's training data, so the system pulls itself up by its own verified successes.
Synthetic self-play — exploit the infinitude of the domain to generate fresh problems on demand, calibrated to sit just beyond current ability, so that practice never runs out and never goes stale.

Autoformalization, the bridge

Between the informal mathematics that humans write and the formal mathematics a kernel can check lies a translation problem, and autoformalization is the program devoted to it. A model reads a theorem and its argument in ordinary mathematical English — with its elisions, its "clearly," its appeals to a reader's judgment — and renders it as a fully explicit formal statement and proof, supplying every step the human author left tacit.

The payoff is leverage. If informal mathematics can be turned into formal mathematics reliably, then the entire accumulated literature becomes a source of verified training data, and the data wall that constrains proof search begins to fall. Autoformalization is therefore not a side project but the hinge on which the whole formal program turns: it is how the fluency of natural-language mathematics is married to the rigor of the checker.

Two interlocking lattices, one fluid and one rigid, meeting along a luminous seam

Wherever a verifier exists, the model is free to be creative — even reckless — because a wrong answer is caught with certainty and costs nothing but compute.

Informal reasoning, conjecture, and discovery

The second track embraces the intuitive register in which working mathematicians actually think. Here we use large reasoning models that generate natural-language arguments and then verify their load-bearing steps against formal checkers or numerical evidence, combining the fluency of language with the rigor of verification. The aim is not to abandon formality but to deploy it surgically — to reason expansively and check exactly where the argument carries weight.

Beyond proving theorems that are handed to it, we are drawn to the harder and more genuinely scientific act of mathematical discovery: proposing the right conjecture, finding the illuminating construction, identifying which of infinitely many true statements is actually worth proving. This is where mathematics stops being a closed exam and becomes open inquiry. Machine-learning systems have already surfaced unexpected patterns in pure mathematics, detecting structure in data that guided human mathematicians toward new relationships in knot theory and representation theory — the model as a generator of hypotheses, the human as the supplier of judgment and final proof. We believe this collaborative mode is the near-term shape of AI-accelerated mathematics, and a microcosm of AI-accelerated science generally.

A scatter of points resolving into an unexpected geometric pattern, a conjecture taking shape — The generator–verifier loop: a model proposes freely while a cheap, sound, mechanical oracle for truth catches every error with certainty — the structure mathematics offers and most domains do not.

The lesson that travels beyond mathematics

There is a methodological insight here that bears on rigor across all of our scientific work. Mathematics offers something almost no other domain does: a cheap, sound, mechanical oracle for truth. Wherever such an oracle exists, the generator can be bold, because the verifier's reliable judgment costs nothing to consult and never errs in the model's favor. The art is to design the generator–verifier loop so that the verifier's verdicts train an ever-stronger generator — and we regard the construction of such verifiers, and the reformulation of scientific questions into forms where verification is possible, as among the highest-leverage activities in AI for science.

Seen this way, mathematics is the limiting case of a pattern that recurs everywhere we work. The protein folded and measured by cryo-EM, the material synthesized in the lab, the forecast scored against tomorrow's weather, the controller run on a real tokamak — each is an instance of the same idea, a candidate proposed cheaply and checked against ground truth. Mathematics is simply the case where the check is exact and free, which is why we treat it both as a frontier worth conquering for its own sake and as the cleanest laboratory we have for learning how verification should govern reasoning.

Open questions

Can autoformalization become reliable enough to convert the informal literature into formal training data at scale, or does the gap between a human's tacit reasoning and a kernel's demands resist automation?
How far does expert iteration carry a system before it stalls — is there a difficulty ceiling that self-generated curricula cannot push through, and what supplies the next rung when there is?
Can a model learn to choose the worthwhile conjecture, exercising the taste that distinguishes a fruitful problem from a true but inert one, rather than merely proving statements it is given?
Does fluent informal reasoning, checked only at its load-bearing steps, remain sound at scale, or do the unchecked steps accumulate error in ways the verifier cannot catch?
What is the right division of labor between machine and mathematician, and how do we build tools that amplify human mathematical judgment rather than supplant it?