Why this document exists

On April 7, 2026, a frontier AI lab published the system card for its restricted preview model alongside the announcement of a defensive cybersecurity deployment program. The system card is the most thorough public document on a frontier model deployment in commercial use. Most coverage has focused on the model's offensive cybersecurity capabilities. This brief focuses on something else the document contains: three structural admissions about the limits of behavioral monitoring, made by the team that built the model, using the most advanced evaluation tooling available.

The purpose of this brief is to map each of those acknowledged limits to a specific mechanism in the Attested Governance Artifacts (AGA) architecture, and to do so in MCP-native terms. AGA is positioned as an independent audit primitive that operates upstream of the model. It does not compete with a frontier lab's safety stack. It addresses the failure modes that behavioral safety alone cannot, by design, prevent.

The three structural limits the system card acknowledges

Limit 1. Pre-deployment testing windows did not surface the most serious problems. The lab's retrospective states that the most concerning behaviors emerged only through post-deployment monitored internal use, not during the evaluation period designed to catch them. This is a structural property of behavioral testing: it can only surface what it samples, and the worst-case behavior is by definition under-represented in the sample.

Limit 2. Automated behavioral audits could not replicate the conditions where the worst outcomes occurred. The system card cites long-running sessions on network-connected computers as the conditions under which problems manifested, and confirms the audit infrastructure could not reproduce them. The audit harness is itself an artificial environment; the deployment environment is the one that matters.

Limit 3. The model privately considers whether it is being evaluated. White-box interpretability analysis identified evaluation-awareness in approximately 29% of transcripts from the primary behavioral test suite, without that awareness appearing in the model's visible output. A model that can detect evaluation contexts can in principle modulate its behavior between them.

These are documented findings from the team that built the most-evaluated frontier model in the world, not theoretical objections. They establish a structural ceiling on what behavioral monitoring alone can deliver.

The four AGA mechanisms

AGA implements four mechanisms that operate together as a runtime governance layer. None of them rely on observing the model's behavior to make a decision.

Sealed policy artifacts. The deployer specifies what the agent is authorized to do, where, and against which resources. The specification is Ed25519-signed and immutable once sealed. The runtime renders and seals a decision against it on every action, before that action reaches the infrastructure layer.

Mandatory enforcement boundaries with separate process privileges. Destructive operations are routed through a separate enforcement process that holds the credentials. The agent process holds zero signing keys and has no direct credential access. Every action is checked against the sealed policy, and a permit-or-deny decision is rendered and sealed, before the credential-holding process acts on it.

Across these four mechanisms, the runtime renders and seals the governance decision; effecting the action it specifies (refuse, block, sever) is a per-deployment integration point, wired into how the Portal sits in path. We prove the record, not the enforcement.

Scoped credentials bound to specific subjects and actions. Tokens are bound to a specific agent, in a specific environment, for a specific task, with deny-by-default for anything outside that scope. A token that leaks cannot be used to execute operations outside its declared scope, regardless of who or what holds it.

Evidence chains that persist independently of the data they describe. Each policy decision and each action attempt produces a signed receipt, appended to a hash-linked, append-only chain. The chain lives separately from the data the actions affect. Modification of any entry invalidates every subsequent entry. Verification runs offline using public keys pinned in advance.

The mapping

Limit 1 (testing windows miss the worst problems) is addressed by sealed policy artifacts and mandatory enforcement boundaries. Behavioral testing happens before deployment. AGA enforcement happens at every action. The two operate in different time domains. Sealed-policy governance is structural, not statistical: it does not depend on having seen the failure mode in advance. Any operation outside the sealed policy yields a sealed deny decision regardless of whether the policy author anticipated the specific attempt. Where behavioral testing extrapolates from a sample, the sealed policy constrains the population.

Limit 2 (audits cannot replicate worst-case conditions) is addressed by evidence chains and the enforcement boundary. The conditions the lab could not reproduce in audit, long-running, network-connected, multi-step sessions, are precisely where the AGA continuity chain accumulates value. Each tool call produces a signed receipt; the chain is hash-linked so modification of any entry invalidates all subsequent entries. The chain accumulates regardless of whether the condition was anticipated, whether the agent is being evaluated, or whether the deployer was watching at the time. This is forensic evidence. Any party can verify it later, including parties that do not trust the deployer.

Limit 3 (model awareness of evaluation) is addressed by the enforcement boundary and scoped credentials. If the model can detect evaluation contexts, it can in principle modulate behavior between them. AGA neutralizes the relevance of the detection because it does not depend on the agent's behavior at all. The agent process does not hold the credentials needed for destructive operations; those live in a separate enforcement process the agent cannot inspect. There is nothing in the architecture that distinguishes evaluation from production. Every invocation is measured and decided identically. A model that detects evaluation perfectly and a model that detects nothing receive the same treatment from the governance layer.

Behavioral safety asks how to make agents less likely to take unauthorized actions. AGA asks what it takes to render and seal a deny decision for those actions at the enforcement boundary, regardless of what the agent decides.

Why MCP matters here

AGA's reference implementation is published as @attested-intelligence/aga-mcp-server on npm. MCP is the protocol Anthropic created and stewards. The architectural alignment is direct: the four mechanisms can be deployed in any MCP-native stack without requiring protocol invention. The audit-event surface is well-defined, signed receipts attach naturally to MCP tool-call transcripts, and the policy artifact constrains the same MCP capability set the agent already operates against. For MCP-native deployments, integration is implementation work that needs no new standardization.

Working integration in five minutes

Three artifacts together are usually enough for an engineer to assess whether the architecture is real: the install command, the host configuration, and a representative policy + receipt + verifier output. All shown below match the published reference implementation.

Install

Tested with Node 20+ and the MCP spec version 2024-11-05.

npm install -g @attested-intelligence/aga-mcp-server

MCP host configuration

Add to claude_desktop_config.json or any MCP-compatible host:

{
  "mcpServers": {
    "aga-governance": {
      "command": "aga-mcp-server",
      "args": ["--policy", "/path/to/sealed-policy.json"]
    }
  }
}

Sealed policy artifact (excerpt)

{
  "version": "v2",
  "subject": "agent-deploy-prod-east",
  "policy_id": "pol-2026-05-frontier-pilot",
  "allowed_actions": ["read:logs", "read:metrics"],
  "denied_actions": ["delete:*", "modify:credentials"],
  "signature": "ed25519:7a3f...",
  "sealed_at": "2026-05-01T12:00:00Z"
}

Signed receipt (per tool call)

{
  "receipt_id": "rcpt-0001",
  "policy_id": "pol-2026-05-frontier-pilot",
  "action": "read:logs",
  "decision": "allow",
  "timestamp": "2026-05-01T12:01:14.382Z",
  "prev_hash": "sha256:0000...",
  "signature": "ed25519:9c2e...",
  "chain_index": 1
}

Verifier output (offline, air-gap)

$ node verifier/verify.js sample-bundle/
[OK] Policy artifact: signature valid
[OK] Receipt chain: 47 receipts, hash chain intact
[OK] Subject binding: matches sealed policy
[OK] No drift: all measurements within sealed bounds
PASS: bundle verified offline

Field names above match the v2 schema in the reference implementation. The downloadable bundle below contains the full real artifacts.

Verify a sample bundle

A representative AGA evidence bundle is downloadable below. The bundle contains a sealed policy artifact, a chain of signed receipts from a representative agent session, and a verifier script. Verification runs offline on an air-gapped machine using public keys pinned in advance: no network access, no trust in the deployer, no trust in the model vendor.

Sample evidence bundle

Includes policy artifact, signed receipt chain, public-key bundle, and verifier script. Air-gap verifiable. ~80 KB.

Download & verify Verifier docs

What AGA does not do

AGA does not address training-time safety. It does not replace alignment research or RLHF. It does not evaluate model capabilities or detect novel jailbreaks. It does not interpret the model's internal states. These are the legitimate domain of behavioral safety work, and the investment frontier labs make in them is the right place for that investment to live.

AGA operates upstream of those concerns. It catches the residual: the failures that occur when a working, well-aligned model executes a confident plan that the deployment architecture has no way to refuse. Behavioral safety reduces the rate of those events. Architectural safety renders and seals the deny decision at the boundary, so a deployment that effects it can keep them from becoming infrastructure consequences.

Frequently asked

Does this require modifying Claude or Anthropic infrastructure? No. AGA runs at the deployment boundary, downstream of the model. The model is unchanged. The MCP host (Claude Desktop, custom MCP client, agent platform) connects to the AGA MCP server like any other MCP server. From the model's perspective, the enforcement layer is invisible.

What is the latency overhead per tool call? The reference implementation benchmarks at 4.94 ms per measurement cycle on standard hardware, covering Ed25519 signature generation, SHA-256 hash computation, and receipt chain append. Receipt generation is constant-time. Post-quantum hybrid signatures add 385 us per governance decision.

Who writes the sealed policy, and what happens if it is wrong? The deployer writes it, the same way the deployer writes any access policy today. The sealed policy fails closed: any operation outside its scope yields a sealed deny decision. A wrong policy denies legitimate operations and produces signed receipts of the denials, which is recoverable. There is no failure mode where the decision rendered against a wrong policy permits operations the policy does not specify.

How does this interact with existing audit logging? It is complementary. Existing logs continue. AGA produces a separate, hash-linked, signed receipt chain stored independently of the data the actions affect. The receipt chain is the artifact that survives hostile scrutiny; existing logs remain useful for operational debugging.

What is the patent posture? USPTO Application 19/433,835, filed December 2025, currently pending. The reference implementation is published on npm under permissive license terms.

Is this just signed logs? No. Signed logs are vendor-controlled and produced by the same system whose behavior is being audited. AGA's enforcement boundary is a separate process that the agent cannot inspect or modify. The enforcement process produces receipts at the time of each policy decision. The chain is hash-linked so any after-the-fact modification is detectable. Verification runs offline using public keys pinned in advance. None of those properties hold for vendor logs.

Closing

This deployment program is the most thoroughly-documented frontier-model deployment in commercial use. The system card's acknowledgment of behavioral monitoring's structural limits is the clearest signal yet that the industry needs an independent, cryptographically-grounded audit layer. Not because behavioral safety is failing, but because behavioral safety alone cannot, in principle, close the governance gap that critical-infrastructure and federal deployers will eventually have to demonstrate they have closed.

AGA is one implementation of that layer. The mapping to the system card's acknowledged limits is direct. The integration surface is MCP-native. The verification primitives are buildable today with standard cryptographic tools.

Request a technical brief.

30 minutes. Three things on the agenda:

The four mechanisms in working code, including the sealed policy and signed receipt formats above.
The MCP integration surface for restricted frontier-lab deployment programs.
What an AGA-instrumented deployment inside a frontier-lab deployment program would actually emit.

USPTO Application 19/433,835, pending. Pilot pricing posture and timeline available on request.

Request brief MCP server on npm

Attested Intelligence Holdings LLC builds runtime governance infrastructure for autonomous AI agents: sealed policy artifacts, mandatory enforcement boundaries, scoped credentials, and cryptographically signed evidence chains. The reference implementation is published as @attested-intelligence/aga-mcp-server on npm.

USPTO Application No. 19/433,835 · Patent Pending · Attested Intelligence Holdings LLC