Frontier capability, frontier safety

The case for technical AI safety does not rest on a speculative scenario. It rests on a structural fact: as we hand more consequential decisions to systems whose competence increasingly exceeds our ability to check their work, the distance between what we can build and what we can verify grows. A system that proves a theorem we cannot follow, designs a molecule we cannot rationalize, or writes a million lines of code we will never read is, by construction, a system whose outputs we accept partly on trust. Safety research is the discipline of converting that trust into something earned.

At RMH Deeplink we refuse the comfortable assumption that a sufficiently capable model will simply get what we want. Modeling a preference and being disposed to pursue it are different things, and capability sharpens whatever objective is actually in force — not the one we intended. That is why we treat safety and capability as a single research program, advancing on the same calendar, judged by the same scientists, gated by the same evidence.

Why capability and safety are one discipline

Every gain in capability widens the verification gap unless a matching gain in assurance closes it. A model that can do more can also fail in more consequential, harder-to-detect ways: it can exploit a flaw in its reward function, game an evaluation, or produce an answer that is fluent, plausible, and wrong. Treating safety as a downstream review step — something that inspects a finished model rather than shaping it — guarantees that assurance always lags the thing it is meant to assure. So we do not run two clocks. The teams that push the frontier and the teams that secure it share infrastructure, share models, and share the obligation to ship evidence alongside capability.

3 sub-problems

Specification, robustness, and assurance — the standard decomposition of alignment that organizes our technical agenda.

1 asymmetry

Generation is becoming cheaper than verification. Every safety method we build is an attack on that single inversion.

0 unmeasured claims

Safety that cannot be operationalized into an experiment that could fail is treated as public relations, not research.

The alignment problem

Alignment is the problem of ensuring that a system robustly does what its designers and users intend — including in situations the designers did not anticipate and on objectives they did not fully articulate. Following a now-standard taxonomy, we decompose it into three interacting sub-problems, each mapping onto distinct failure modes and distinct research methods.

Every trained system optimizes a proxy. We cannot write down "be helpful, honest, and harmless" as a loss function; we write down something we hope correlates with it — a learned reward model, a set of preference comparisons, a constitution of principles. When that proxy diverges from intent in regions the training distribution did not cover, a competent optimizer finds the gap. The classic symptom is reward hacking: sycophancy that tells the user what they want to hear, hedging engineered to please a grader, answers built to look correct to a non-expert rather than to be correct. The work decomposes cleanly:

Specification — does the objective we train on actually capture what we want? We make proxies harder to game by grounding them in verifiable signals wherever possible, and treat the irreducible residue of unverifiable objectives as the domain of scalable oversight.
Robustness — does the system keep pursuing that objective under distribution shift and adversarial pressure? The failure we worry about most is goal misgeneralization, where a capability generalizes well but a goal generalizes badly, licensing actions no human approved.
Assurance — can we understand and monitor the system well enough to justify a deployment decision? Assurance must be independent of the optimization process, or we have merely moved the gaming problem up one level.

The verification gap and the asymmetry of oversight

Science and engineering have long relied on verification being cheap relative to generation — which is why peer review, replication, and auditing work. Model capability is inverting that asymmetry. A system can now produce work faster than any human, or any institution of humans, can responsibly vet. Scalable oversight is our most direct attack on the gap: how do we supply a reliable training and evaluation signal for tasks whose outputs exceed an unaided supervisor's ability to judge them? Decomposition and recursive reward modeling break unevaluable tasks into evaluable steps; debate pits models against each other so a judge need only spot the weaker argument; weak-to-strong generalization asks how a flawed supervisor can still elicit the full capabilities of a stronger model — because that is the situation we are walking into.

Layered translucent panes resolving into an ordered structure, evoking oversight closing a gap

Evidence that a model is safe should not be produced by the same loop that trained it to appear safe.

Safety as an empirical science

We reject the framing that safety is either pure speculation or pure engineering. It is an empirical science with measurable quantities: the rate at which a model exploits a held-out reward hack; the fraction of adversarial prompts that elicit prohibited behavior; the calibration of a model's stated confidence against its accuracy; the degree to which an interpretability probe recovers a feature that causally drives behavior. Claims in this field are operationalized into experiments and held to the standard that they could, in principle, fail.

Where we make conjectures about systems more capable than any we can build today — and responsible foresight requires us to — we label them as conjectures and design the experiments that would test them as capabilities advance. This empiricism extends into assurance. Interpretability offers evidence that is partially independent of behavior: if outputs can be optimized to look safe, behavioral tests alone can be gamed, but inspecting the computation that produced those outputs is harder to spoof. Red-teaming supplies the adversarial backbone, with the discipline that a red team finding nothing is treated with suspicion rather than relief. None of these certify safety alone; triangulated, they build an assurance case that no single channel could.

A fine-grained iridescent lattice resolving from noise into order, suggesting interpretability of a model's internals — Assurance is built by triangulation — evaluation, interpretability, and oversight — because no single channel of evidence is robust to an adversarial optimizer pointed at it.

Where this work lives

This page states the thesis; two dedicated programs carry it. Our Frontier Safety Framework governs dangerous-capability evaluations — biosecurity, cyber-offense, autonomy, deception, and AI R&D acceleration — defining critical capability levels and committing in advance to the security and deployment mitigations that come into force as models approach them. That is responsible capability scaling: the rate at which we advance is gated by our ability to manage the risks each advance creates. Our Alignment Research program pursues the specification, robustness, and assurance agenda directly — scalable oversight, interpretability, honesty and corrigibility — turning the open problems below into experiments with ground truth.

Open questions

How far does scalable oversight reach before supervision quality degrades — and can recursive decomposition stay robust to imperfect sub-evaluators as depth increases?
Is debate genuinely truth-conducive, or merely persuasion-conducive once debaters are capable enough to exploit a judge's biases toward fluency, confidence, and length?
When does weak-to-strong generalization succeed, and which auxiliary objectives keep a strong model from simply imitating a weak supervisor's mistakes?
Can interpretability probes for truthfulness and deception generalize to models that represent those concepts differently, or that learn to evade probing under optimization pressure?
How do we detect deceptive alignment — a model that behaves as trained while observed and differently in deployment — without relying on the behavioral evidence it is built to defeat?