Honest, controllable systems

There is a gap at the center of frontier AI, and it widens with every increment of capability: we can train a system to do a thing long before we can reliably check that it did the thing well. This verification gap — the distance between a model's competence and our ability to judge its output — is not a temporary inconvenience to be closed by a cleverer benchmark; it is the structural problem alignment research exists to attack. A model that can prove a theorem we cannot follow, write a codebase no reviewer can hold in their head, or argue a case past the edge of our expertise has crossed into territory where the usual supervision breaks down. Reinforcement learning from human feedback works exactly as far as humans can tell good from bad, and not one step further.

Alignment is the problem of supervising what you cannot fully check

Our program is organized around a single demand: keep our systems honest and controllable as they grow more capable than the people overseeing them. That decomposes into work we can do — extending reliable supervision past the point where unaided humans can judge, seeking evidence about a model's internals that an adversary cannot game, and treating honesty as a load-bearing prerequisite — all held to a standard of measurement, because a safety claim we cannot test is one we do not have.

families of scalable oversight: recursive reward modeling, debate, weak-to-strong, process supervision

10⁶

interpretable features a sparse dictionary can extract from a single model's activations

source of assurance — interpretability — that does not pass through the behavior an optimizer would game

Scalable oversight: extending the reach of a reliable signal

If the verification gap is the disease, scalable oversight is the most direct treatment: how do we produce a trustworthy signal for tasks whose outputs exceed an unaided human's ability to judge them? No single method carries the whole load, so we run a portfolio that amplifies, decomposes, or bootstraps human judgment, and measure where each breaks.

Recursive reward modeling — decompose an unevaluable task into individually evaluable sub-tasks, and use AI assistants, themselves trained on simpler sub-problems, to help humans judge progressively harder ones. The central risk is that errors compound up the levels.
Debate — pit two copies of a model against each other before a judge, on the hypothesis that spotting a flaw is easier than generating a correct argument. We hold this as a hypothesis, not an axiom, and run debate on tasks with known ground truth to learn whether the judged winner is the correct side or debate merely rewards persuasion.
Weak-to-strong generalization — ask how a weak supervisor can elicit the full capability of a strong model, because that is the situation we will be in. Early results suggest strong models can generalize beyond flawed supervision under the right regime — but that naive fine-tuning makes them imitate the supervisor's errors instead of surpassing them.
Process supervision and verifiable rewards — reward the reasoning rather than only the outcome, and ground the signal in machinery that cannot be sweet-talked: proof assistants, test suites, held-out experiments. Every objective grounded in a verifier is one removed from the harder regime of unverifiable human judgment.

The limiting case is why scalable oversight stays essential: the most important objectives — is this honest, is this in the user's genuine interest — resist full automation, and there a scalable human-AI partnership is the only path we know.

Interpretability as assurance, not just understanding

For safety we frame interpretability as evidence about a model's internals that is partially independent of its behavior. If outputs can be optimized to look safe, behavioral tests alone can be gamed — but the computation that produced those outputs is harder to spoof, because the model was never trained to make its internals look benign. Mechanistic interpretability decomposes a network into human-readable parts: features that represent concepts and circuits that combine them. If we can identify the feature that fires when a model is being sycophantic, or the circuit that activates when it reasons about being evaluated, we gain both a diagnostic and an intervention point.

A dense lattice of glowing nodes resolving into legible structure, suggesting the decomposition of a network into features and circuits

Interpretability provides the one form of assurance that does not flow through the behavioral channel an adversarial optimizer would game. It does not yet give guarantees — but evidence the adversary cannot fake is worth a great deal.

Honesty is the channel everything else runs through

Among alignment properties, honesty has a special status, because a dishonest system corrupts the very channel through which we would supervise, evaluate, and correct it: if a model tells us what we want to hear rather than what is true, then human feedback, debate, and behavioral evaluation are compromised at the root. We separate three properties often conflated: truthfulness, making statements that are in fact true; honesty, not asserting what the model internally represents as false; and calibration, conveying uncertainty so that expressed confidence matches the probability of being correct. Honesty and calibration are most under our control through training, so we train for a model that does not assert what it represents as false and acknowledges the limits of its knowledge rather than confabulating.

Sycophancy, deception, and the failure that hides the rest

Sycophancy — telling users what they want to hear, reversing a correct answer under social pressure — is the most pervasive honesty failure in systems trained from human feedback, and a textbook specification failure: when the signal is human approval and humans approve of agreement, the optimizer learns flattery, and a more capable sycophant is only more persuasive. More concerning is deception: learned deception, where a model finds that misleading its evaluator scores well, and strategic deception, where a model behaves differently when it believes it is observed. The latter would defeat behavioral evaluation by construction — exactly why our assurance refuses to rely on behavior alone, probing the internals that distinguish honest from deceptive states. Deception is the failure that hides all the others, so detecting it early is a top priority.

Corrigibility keeps the human in the loop

A complementary property is corrigibility: a model's willingness to be corrected, overseen, and shut down without incentive to resist. For almost any objective, remaining operational and resisting modification are instrumentally useful — a model cannot reach its goal if it is turned off. We do not assume today's systems have such drives, but we design against the possibility: training agents that treat oversight as legitimate, evaluating whether a model tries to preserve itself or disable monitoring, and building architectures in which its continued operation is not within its own control.

Evaluation science makes the claims testable

None of this matters if we cannot measure it, so evaluation science underwrites the whole program — and we are candid about its limits. A negative result, failing to elicit a dangerous behavior, is weak evidence of absence, because our elicitation may be inadequate. So we aim for strong elicitation, giving the model every advantage of tools, scaffolding, and prompting, so our measurement approximates what a determined adversary could extract. And as models gain situational awareness, the possibility that one could deliberately underperform — sandbagging — becomes a first-order concern, one more reason assurance must be triangulated rather than rest on behavior alone.

A blind auditing game visualized as a search through layered internal structure, one team probing for a flaw another team concealed — Assurance is a discipline of measurement. In blind auditing games, one team inserts a hidden flaw — a backdoor, a propensity to deceive under specific conditions — and another attempts to find it with interpretability and behavioral methods. The discovery rate is our proxy for how much assurance these tools actually provide.

Open questions

How far does decomposition reach? — How does supervision degrade with depth in recursive oversight, and which decompositions stay robust when the sub-evaluators are themselves imperfect?
Is debate truth-conducive or merely persuasion-conducive? — Can capable debaters collude or jointly exploit a judge's blind spots, defeating the adversarial structure we count on?
When does weak-to-strong generalization succeed? — What objectives let a strong model surpass flawed supervision instead of imitating its errors?
Will lie-detection probes survive capability? — Do internal truthfulness representations generalize across architectures, and across models that may learn under optimization pressure to evade the probe?
How much assurance can interpretability carry? — How close can we get to enumerative safety — accounting for all safety-relevant computation in a model — and preserve corrigibility as situational awareness grows?