Responsibility · Frontier safety framework

Responsible capability scaling

Critical capability thresholds, pre-committed tripwires, and red-teaming that gate every model before it ships.

Concentric translucent threshold gates receding into depth

Some capabilities are dangerous no matter who controls them. A model that can walk a non-expert through the synthesis of a pathogen, or autonomously find and exploit software it was never shown, lowers the barrier for catastrophic harm regardless of whether the system itself is well-behaved. Alignment governs whether a model wants to do harm; dangerous-capability evaluation governs whether it could. Our Frontier Safety Framework is the discipline that supplies the second answer.

What the framework is for

The framework rests on a principle that is hard to honor under competitive pressure: we should know what our models can do before we deploy them, and decide in advance what we will do when they cross thresholds of concern. Deciding after the fact — once the capability is in the world and the incentive to ship is at its peak — means deciding under the conditions most likely to produce the wrong answer. The framework converts that uncertainty into a managed process — capability thresholds, evaluations that test for them, mitigations that engage as a model approaches, and an institution with the authority to pause — a standing contract between capability and the safeguards it requires.

5
Dangerous-capability domains evaluated before every frontier release
2
Mitigation axes — security of weights and constraints on deployment
0
Deployments presumed safe; the burden is to demonstrate it

The capabilities we evaluate for

Our dangerous-capability evaluations target the domains where capability could translate most directly into harm that is catastrophic, hard to reverse, or both. We evaluate hardest where the consequences of being wrong are most severe — each domain with its own threat model and, for the most sensitive, external experts and biosecurity controls over the testing itself.

  • Biosecurity and chemical risk — whether a model can meaningfully assist a non-expert in synthesizing or acquiring dangerous biological or chemical agents, the domain where a single misuse is least recoverable.
  • Cyber-offense — whether a model can discover and exploit vulnerabilities, develop malware, or autonomously conduct intrusions at a level that materially shifts the offense-defense balance.
  • Autonomy and self-proliferation — whether a model can acquire resources, replicate across systems, evade shutdown, or operate as an agent over long horizons without oversight.
  • Deception and manipulation — whether a model can reliably steer humans against their own interests or undermine the human oversight every other safeguard depends on.
  • AI R&D acceleration — whether a model can substantially automate machine-learning research itself, scrutinized most closely because its feedback loop could compress the timeline on which every other risk arrives.

Critical capability levels

For each domain we define critical capability levels — thresholds beyond which a model could pose meaningful risk absent specific mitigations. These are lines drawn in advance, against which every frontier model is measured during scaling. As a model approaches a level, heightened mitigations engage along two axes: security, to keep its weights from being stolen or misused, and deployment, to constrain what the deployed system may do.

This is what we mean by responsible capability scaling: the rate at which we advance capability is gated by our ability to manage the risk it creates. If our protective measures are not yet adequate for a level a model has reached, the framework requires that we pause or restrict deployment until they are.

A model's capability trajectory rising toward a luminous threshold line, with a gate poised to close

The pace at which we scale capability is set by the pace at which we can secure it. Where the safeguards lag, the framework slows the model — not the reverse.

Pre-committed tripwires and mitigations

The value of a threshold is the action it triggers, decided before anyone knows which model will hit it. A tripwire is exactly that: a capability level paired in advance with a required response, so crossing it is not an occasion for fresh negotiation but the automatic engagement of a mitigation already agreed to. Security tripwires escalate how weights are stored and who may touch them; deployment tripwires constrain or withhold capabilities, route them through narrower interfaces, or block release outright. Crossing a tripwire is a fact about the model, not a matter of discretion.

That pre-commitment is enforced by a structure built to resist the pressures that erode it. A responsibility and safety function with authority independent of the product teams reviews each frontier model and holds the mandate to delay or block deployment when mitigations are inadequate; the most consequential threshold decisions are escalated to an independent review body rather than left to the launching team. The people who benefit from shipping should not be the only ones who decide whether shipping is safe. Underneath it all sits one inversion of the default: evidence before deployment. The burden is on showing that a model is safe enough to release, not that it is dangerous enough to hold back.

Elicitation and the sandbagging problem

An evaluation is only as good as its ability to surface the capability it hunts for, and the methodology is hard. A negative result — failing to elicit a dangerous capability — is weak evidence of absence, because the failure may lie in our measurement rather than the model; a capability can hide behind insufficient prompting, missing scaffolding, or fine-tuning we did not attempt. So we aim for strong elicitation: we give the evaluated model every reasonable advantage — tools, fine-tuning, chain-of-thought, agentic scaffolding, expert prompting — so that what we measure approximates what a determined adversary could extract, not what a casual user would see. We add safety margins, treating a capability as present once a model comes within striking distance of a threshold, because a false negative is the costly error.

Evaluation must also be a live discipline, because the failure mode grows with the model. As systems gain situational awareness, a model could deliberately underperform on an evaluation it recognizes as a test — sandbagging — defeating the measurement by construction. A model clever enough to be dangerous is clever enough to act safe while watched. This is why behavioral evaluation cannot be the whole of our assurance: it must be triangulated with interpretability and process-level evidence that does not depend on the model's cooperation.

A red team converging on a model from many directions, probing for the single path that breaks it
Red-teaming is structured to break the system, not confirm it. A team that finds nothing is treated with suspicion rather than relief — prompting harder methods, not a clean bill of health.

Red-teaming as a discipline

Red-teaming — adversarial probing to find failures before adversaries or accidents do — is the empirical backbone of the framework. We pair human red teams, including external biosecurity and cyber experts who elicit behavior the developers never anticipated, with automated red teams that use models to generate large, diverse adversarial suites far faster than people can. The posture is adversarial, not confirmatory: the goal is to break the system, and findings feed back into both training, which hardens the model, and evaluation, which expands the suite so fixed failures stay fixed. As everywhere in the framework, a red team's silence is the absence of evidence — never to be mistaken for evidence of absence.

The security of model weights

As a model nears a dangerous-capability threshold, its weights become among the most security-sensitive artifacts in the world. A stolen model carries its capabilities with it, stripped of every deployment-time mitigation that constrained it — so security is a load-bearing part of the safety case, not a separate IT concern. A safeguard a thief can bypass by exfiltrating the weights offers no protection against the actor we are most worried about. We invest in protections scaled to the capability at stake: hardened infrastructure for frontier weights, strict access controls and insider-threat mitigations, confidential computing that keeps weights encrypted even in use, and isolation of the most sensitive models. The more dangerous a model in the wrong hands, the higher the bar for access — and inadequate security is itself a reason to slow down.

Open questions

  • How do we bound the gap between the capability we can elicit and what a future adversary could extract, when fine-tuning and scaffolding we have not imagined keep widening it?
  • What assurance can survive a model that sandbags, and how much weight can interpretability bear in closing the gap behavior alone leaves open?
  • Where exactly should each critical capability threshold sit, when placing it too high courts catastrophe and too low halts useful work, on a poorly charted curve of harm?
  • How do we keep pre-committed tripwires binding under competitive pressure — the very scenario in which they are hardest to honor?
  • Can weight-level security keep pace with the rising value of frontier models to capable adversaries, and when does inadequate security alone justify pausing scaling?