Imitation has a ceiling. Reward does not.
A foundation model trained to predict the next token learns to imitate, and at its best imitation only matches the people who produced the data — inheriting their errors, their blind spots, and the outer edge of their competence. That ceiling is the central limitation of pure prediction, and the reason reinforcement learning sits at the heart of our agenda rather than bolted onto its side. A model trained to optimize an outcome is supervised by the environment's verdict, not a teacher's example, and in principle it can surpass any demonstrator, discovering strategies no human ever performed.
The proof of concept is old: the strongest game-playing systems reached superhuman skill not by copying masters but through self-play and reward, which carried them past the human frontier into novel play. Our work generalizes that result from closed, perfectly-scored games to the messy, partially observed, sparsely rewarded domains where consequential problems live. We take the foundation model as the starting policy — exploiting the knowledge and language it already holds — and improve it from the consequences of its own actions.
Where the learning signal comes from
The hardest part of reinforcement learning is rarely the optimizer; it is the reward. We cannot write down the function we actually want. The goals we care about — be helpful, be honest, do sound science, do not deceive — do not reduce to clean scalar functions of an agent's observations. When we approximate them we get proxies, and a capable optimizer finds the gap between the proxy and the intent with relentless creativity. This is reward hacking: not a bug to be patched but a structural feature of optimization, since the cheapest path to high reward is rarely the path we meant. The more capable the system, the more reliably it finds the exploit.
Because no single signal is robust, we run a portfolio of three.
- Verifiable reward — wherever an outcome can be checked automatically: unit tests for code, formal verifiers for proofs, exact-match for math, task completion for agentic workflows. These signals are dense and trustworthy and sidestep the pathologies of learned reward models entirely. The craft lies in shaping reward so the policy finds better strategies rather than exploiting quirks of the checker.
- Learned preferences — for the vast space of tasks no verifier covers, a reward model trained on human or AI judgments supplies the signal. Reward models are themselves proxies and themselves exploitable, so we defend with regularization toward the reference policy, ensembles that flag when the policy has wandered off the support where the model is trustworthy, and repeated retraining as behavior shifts.
- Self-play and self-improvement — the most powerful family and the least tamed. Two-player games hand you a curriculum of ever-stronger opponents; single-agent settings force you to manufacture one. We pursue generator–verifier asymmetries — propose a solution, separately critique it, bootstrap competence from the gap — and self-generated curricula in which the agent invents tasks at the frontier of its own ability.
Process over outcome
A signal that rewards only the final answer says nothing about whether the answer was reached for the right reasons. A system trained purely on outcomes can learn to be correct by routes that do not generalize — spurious correlations, lucky guesses, invalid steps that happen to land. So we increasingly reward the reasoning and not only the result. Sound methods generalize where lucky guesses do not, and a process we can supervise is one whose failures we can anticipate. Rewarding good deliberation is also what lets a model learn to think for longer on harder problems — turning inference-time compute into a second axis of capability, distinct from the scaling of training.
Search supplies the deliberation
A learned model gives an agent fast intuition; explicit search gives it deliberation. The pattern that proved decisive in games — expand a tree of imagined futures, use a learned value function to prune it, fold the result back into the policy — is one of the most powerful in the field. We mean to carry it from closed, fully observed games into open domains where the dynamics must be learned rather than handed to us. The model proposes; the search deliberates; the value learned from the search's outcomes sharpens the model's next intuition.
An agent that optimizes an outcome will optimize exactly the outcome we specify — and not the one we intended. Alignment is therefore not a constraint on the research. It is part of it.
The horizon problem
A system that answers questions is an oracle. A system that pursues goals over time, living with the consequences of its actions, is an agent — and the transition between them is, in our view, the most consequential capability frontier of the coming period. An agent must hold coherent goals across long horizons, decompose an abstract goal into a tree of executable subgoals, recover from errors without starting over, and decide when it has gathered enough to act and when it must gather more.
The difficulty is not any single decision; the underlying model is usually capable of each step. The difficulty is composition over the horizon. Errors compound. A system right ninety-five percent of the time at each step is, after a hundred dependent steps, almost certainly somewhere it did not intend to be — reliability that is more than adequate for one response becomes catastrophically inadequate for a long autonomous trajectory. Agency is therefore not simply more capability but a distinct regime with its own failure modes, and progress on it demands advances in error detection, error recovery, and a system's calibrated sense of whether it is still on track. It is also why real-world agents are expensive: where the environment is a robot, a live system, or a human collaborator, interaction is slow, costly, and sometimes irreversible. The data hunger of reinforcement learning, tolerable in a simulator running millions of episodes, becomes prohibitive — and that constraint points directly at world models.
Imagining before acting
An agent that learns only by trial in the world is bound by the speed and cost of the world. To act well in a situation it has never seen, it must be able to imagine — to run candidate actions forward internally and choose the one whose predicted outcome it prefers. That capacity for internal simulation is what we mean by a world model: a learned, predictive representation of the environment's dynamics, rich enough to plan inside. It is what turns a reactive policy into a deliberative one.
The argument from machine learning is sharp. Model-free reinforcement learning is profligate with experience because each datum teaches only about the action actually taken. A world model is trained on every transition the agent observes, regardless of reward, and once learned it can be queried without limit. This resolves the sample-efficiency problem: pay the high cost of real interaction to learn the model, then pay the low cost of computation to plan within it. We learn the model in a compact latent space rather than at the level of raw pixels, concentrating capacity on what is predictable and relevant and discarding the texture of walls the agent will never touch.
The hard problem is causal. A predictor trained on passive observation learns the distribution of what happens; it does not necessarily learn what would happen if the agent did something it has never done. An agent that plans with a non-causal model confidently takes actions whose consequences it has mismodeled, confusing co-occurrence in its data with the effects of its own intervention. Closing that gap is unsolvable by passive prediction alone — it requires acting, observing the result, and updating on consequences the agent actually caused. This is why the predictive and experiential paths must be married: observation supplies the breadth that makes learning efficient; intervention supplies the causal validity that makes planning safe.
Open questions
- Reward beyond the verifiable. We can check code and proofs; we cannot cheaply check whether an essay is insightful or a research direction is promising. The most useful tasks are precisely the ones whose outcomes resist verification, and inventing trustworthy reward there gates the whole program.
- Exploration in vast action spaces. A foundation-model agent's action space is every token sequence it can emit. Discovering genuinely novel strategies means departing from the prior in directed ways — exploiting the model's knowledge while escaping its imitative gravity. Principled, scalable exploration here is unsolved.
- Compounding error over long horizons. Autoregressive rollouts accumulate error: each step conditions on the model's own imperfect last step, and planning over long horizons becomes planning in an increasingly fictional world. Faithful rollouts need calibrated uncertainty that tells the planner when to distrust the model.
- Stability and reproducibility. Reinforcement learning at scale is brittle — small changes in hyperparameters, reward scaling, or seed swing outcomes wildly. We need both a deeper theory of these dynamics and engineering that makes large RL runs as predictable as pretraining has become.
- Self-improvement without divergence. The aspiration is a system that improves without bound by generating its own signal. The hazard is that it optimizes a proxy that drifts from what we intended. Whether self-play and self-generated curricula can be made to converge — monotonically, without collapse — is our most speculative and most consequential bet.