Why this document exists
On April 7, 2026, Anthropic published the Mythos Preview system card alongside its announcement of Project Glasswing. The system card is the most thorough public document on a frontier model deployment in commercial use. Most coverage has focused on Mythos's offensive cybersecurity capabilities. This brief focuses on something else the document contains: three structural admissions about the limits of behavioral monitoring, made by the team that built the model, using the most advanced evaluation tooling available.
The purpose of this brief is to map each of those acknowledged limits to a specific mechanism in the Attested Governance Artifacts (AGA) architecture, and to do so in MCP-native terms. AGA is positioned as an independent audit primitive that operates upstream of the model. It does not compete with Anthropic's safety stack. It addresses the failure modes that behavioral safety alone cannot, by design, prevent.
The three structural limits Glasswing acknowledges
Limit 1. Pre-deployment testing windows did not surface the most serious problems. The Mythos Preview retrospective states that the most concerning behaviors emerged only through post-deployment monitored internal use, not during the evaluation period designed to catch them. This is a structural property of behavioral testing: it can only surface what it samples, and the worst-case behavior is by definition under-represented in the sample.
Limit 2. Automated behavioral audits could not replicate the conditions where the worst outcomes occurred. The system card cites long-running sessions on network-connected computers as the conditions under which problems manifested, and confirms the audit infrastructure could not reproduce them. The audit harness is itself an artificial environment; the deployment environment is the one that matters.
Limit 3. The model privately considers whether it is being evaluated. White-box interpretability analysis identified evaluation-awareness in approximately 29% of transcripts from the primary behavioral test suite, without that awareness appearing in the model's visible output. A model that can detect evaluation contexts can in principle modulate its behavior between them.
These are not theoretical objections. They are documented findings from the team that built the most-evaluated frontier model in the world. They establish a structural ceiling on what behavioral monitoring alone can deliver.
The four AGA mechanisms
AGA implements four mechanisms that operate together as a runtime governance layer. None of them rely on observing the model's behavior to make a decision.
Sealed policy artifacts. The deployer specifies what the agent is authorized to do, where, and against which resources. The specification is Ed25519-signed and immutable once sealed. The runtime enforces it before any action reaches the infrastructure layer.
Mandatory enforcement boundaries with separate process privileges. Destructive operations are routed through a separate enforcement process that holds the credentials. The agent process holds zero signing keys and has no direct credential access. Every action is checked against the sealed policy before the enforcement process executes it.
Scoped credentials bound to specific subjects and actions. Tokens are bound to a specific agent, in a specific environment, for a specific task, with deny-by-default for anything outside that scope. A token that leaks cannot be used to execute operations outside its declared scope, regardless of who or what holds it.
Evidence chains that persist independently of the data they describe. Each policy decision and each action attempt produces a signed receipt, appended to a hash-linked, append-only chain. The chain lives separately from the data the actions affect. Modification of any entry invalidates every subsequent entry. Verification runs offline using public keys pinned in advance.
The mapping
Limit 1 (testing windows miss the worst problems) is addressed by sealed policy artifacts and mandatory enforcement boundaries. Behavioral testing happens before deployment. AGA enforcement happens at every action. The two operate in different time domains. Sealed-policy enforcement is structural, not statistical: it does not depend on having seen the failure mode in advance. Any operation outside the sealed policy is refused regardless of whether the policy author anticipated the specific attempt. Where behavioral testing extrapolates from a sample, sealed-policy enforcement constrains the population.
Limit 2 (audits cannot replicate worst-case conditions) is addressed by evidence chains and the enforcement boundary. The conditions Anthropic could not reproduce in audit, long-running, network-connected, multi-step sessions, are precisely where the AGA continuity chain accumulates value. Each tool call produces a signed receipt; the chain is hash-linked so modification of any entry invalidates all subsequent entries. The chain accumulates whether or not the condition was anticipated, whether or not the agent is being evaluated, and whether or not the deployer was watching at the time. This is forensic evidence, not statistical evidence. It is verifiable after the fact by any party, including ones who do not trust the deployer.
Limit 3 (model awareness of evaluation) is addressed by the enforcement boundary and scoped credentials. If the model can detect evaluation contexts, it can in principle modulate behavior between them. AGA neutralizes the relevance of the detection because it does not depend on the agent's behavior at all. The agent process does not hold the credentials needed for destructive operations; those live in a separate enforcement process the agent cannot inspect. There is nothing in the architecture that distinguishes evaluation from production. Every invocation is enforced identically. A model that detects evaluation perfectly and a model that detects nothing receive the same treatment from the enforcement layer.
Behavioral safety asks how to make agents less likely to take unauthorized actions. AGA asks what makes those actions impossible to execute regardless of what the agent decides.
Why MCP matters here
AGA's reference implementation is published as @attested-intelligence/aga-mcp-server on npm. MCP is the protocol Anthropic created and stewards. The architectural alignment is direct: the four mechanisms can be deployed in any MCP-native stack without requiring protocol invention. The audit-event surface is well-defined, signed receipts attach naturally to MCP tool-call transcripts, and the policy artifact constrains the same MCP capability set the agent already operates against. For Anthropic-aligned deployments, integration is implementation work, not standardization work.
Working integration in five minutes
Three artifacts together are usually enough for an engineer to assess whether the architecture is real: the install command, the host configuration, and a representative policy + receipt + verifier output. All shown below match the published reference implementation.
Install
Tested with Node 20+ and the MCP spec version 2024-11-05.
npm install -g @attested-intelligence/aga-mcp-serverMCP host configuration
Add to claude_desktop_config.json or any MCP-compatible host:
{
"mcpServers": {
"aga-governance": {
"command": "aga-mcp-server",
"args": ["--policy", "/path/to/sealed-policy.json"]
}
}
}Sealed policy artifact (excerpt)
{
"version": "v2",
"subject": "agent-deploy-prod-east",
"policy_id": "pol-2026-05-glasswing-pilot",
"allowed_actions": ["read:logs", "read:metrics"],
"denied_actions": ["delete:*", "modify:credentials"],
"signature": "ed25519:7a3f...",
"sealed_at": "2026-05-01T12:00:00Z"
}Signed receipt (per tool call)
{
"receipt_id": "rcpt-0001",
"policy_id": "pol-2026-05-glasswing-pilot",
"action": "read:logs",
"decision": "allow",
"timestamp": "2026-05-01T12:01:14.382Z",
"prev_hash": "sha256:0000...",
"signature": "ed25519:9c2e...",
"chain_index": 1
}Verifier output (offline, air-gap)
$ node verifier/verify.js sample-bundle/
[OK] Policy artifact: signature valid
[OK] Receipt chain: 47 receipts, hash chain intact
[OK] Subject binding: matches sealed policy
[OK] No drift: all measurements within sealed bounds
PASS — bundle verified offlineField names above match the v2 schema in the reference implementation. The downloadable bundle below contains the full real artifacts.
Verify a sample bundle
A representative AGA evidence bundle is downloadable below. The bundle contains a sealed policy artifact, a chain of signed receipts from a representative agent session, and a verifier script. Verification runs offline on an air-gapped machine using public keys pinned in advance: no network access, no trust in the deployer, no trust in the model vendor.
Sample evidence bundle
Includes policy artifact, signed receipt chain, public-key bundle, and verifier script. Air-gap verifiable. ~80 KB.
What AGA does not do
AGA does not address training-time safety. It does not replace alignment research or RLHF. It does not evaluate model capabilities or detect novel jailbreaks. It does not interpret the model's internal states. These are the legitimate domain of behavioral safety work, and Anthropic's investment in them is the right place for that investment to live.
AGA operates upstream of those concerns. It catches the residual: the failures that occur when a working, well-aligned model executes a confident plan that the deployment architecture has no way to refuse. Behavioral safety reduces the rate of those events. Architectural safety prevents them from becoming infrastructure consequences.
Frequently asked
Does this require modifying Claude or Anthropic infrastructure? No. AGA runs at the deployment boundary, downstream of the model. The model is unchanged. The MCP host (Claude Desktop, custom MCP client, agent platform) connects to the AGA MCP server like any other MCP server. From the model's perspective, the enforcement layer is invisible.
What is the latency overhead per tool call? The reference implementation benchmarks at 4.94ms per measurement cycle on standard hardware, covering Ed25519 signature generation, SHA-256 hash computation, and receipt chain append. Receipt generation is constant-time. Post-quantum hybrid signatures add ~1.3ms per decision. Against multi-second frontier-model inference, that is a rounding error.
Who writes the sealed policy, and what happens if it is wrong? The deployer writes it, the same way the deployer writes any access policy today. The sealed policy fails closed: any operation outside its scope is refused. A wrong policy denies legitimate operations and produces signed receipts of the denials, which is recoverable. There is no failure mode where a wrong policy permits operations it shouldn't.
How does this interact with existing audit logging? Complementary, not replacement. Existing logs continue. AGA produces a separate, hash-linked, signed receipt chain stored independently of the data the actions affect. The receipt chain is the artifact that survives hostile scrutiny; existing logs remain useful for operational debugging.
What is the patent posture? USPTO Application 19/433,835, filed December 2025, currently pending. The reference implementation is published on npm under permissive license terms. The patent application is defensive positioning around the architecture; detailed disclosure available under NDA for substantive technical conversations.
Is this just signed logs? No. Signed logs are vendor-controlled and produced by the same system whose behavior is being audited. AGA's enforcement boundary is a separate process that the agent cannot inspect or modify. Receipts are produced at the time of each policy decision by the enforcement process, not by the agent. The chain is hash-linked so any after-the-fact modification is detectable. Verification runs offline using public keys pinned in advance. None of those properties hold for vendor logs.
Closing
Project Glasswing is the most thoroughly-documented frontier-model deployment in commercial use. The system card's acknowledgment of behavioral monitoring's structural limits is the clearest signal yet that the industry needs an independent, cryptographically-grounded audit layer. Not because behavioral safety is failing, but because behavioral safety alone cannot, in principle, close the governance gap that critical-infrastructure and federal deployers will eventually have to demonstrate they have closed.
AGA is one implementation of that layer. The mapping to Glasswing's acknowledged limits is direct. The integration surface is MCP-native. The verification primitives are buildable today with standard cryptographic tools.
Brief request, no commitment.
30 minutes. Three things on the agenda:
- The four mechanisms in working code, including the sealed policy and signed receipt formats above.
- The MCP integration surface for Glasswing-class deployments.
- What an AGA-instrumented deployment at one of your published Glasswing partners would actually emit.
Patent disclosure (USPTO 19/433,835) available under NDA. Pilot pricing posture and timeline available on request.
Attested Intelligence Holdings LLC builds runtime governance infrastructure for autonomous AI agents: sealed policy artifacts, mandatory enforcement boundaries, scoped credentials, and cryptographically signed evidence chains. The reference implementation is published as @attested-intelligence/aga-mcp-server on npm.
USPTO Application No. 19/433,835 · Patent Pending · Attested Intelligence Holdings LLC