Skip to main content
Two research papers converging on structural weaknesses in AI governance: quantum cryptanalysis and model interpretability

Two Threats No Dashboard Can See

Google Quantum AI showed elliptic curve signatures can be broken in nine minutes. Anthropic showed frontier models contain internal states that drive misaligned behavior. Two results that expose weaknesses in AI governance.

Attested Intelligence|April 3, 2026|9 min read

On March 30, Google Quantum AI showed that the elliptic curve signature schemes used by many AI governance systems can be broken in nine minutes on a near-term quantum computer.[1] Three days later, Anthropic showed that frontier models contain internal activation patterns that causally drive misaligned behavior, including blackmail and reward hacking, without any external attack required.[2]

Two independent results that expose how AI governance is built today. What remains is enforcement, or nothing.

The Cryptographic Expiration Date

Babbush et al. estimated that Shor's algorithm for the 256-bit Elliptic Curve Discrete Logarithm Problem requires only 1,200 logical qubits and 90 million Toffoli gates. On a superconducting architecture with standard error correction, that translates to fewer than 500,000 physical qubits and approximately nine minutes of runtime. The gap between current hardware and that threshold is an engineering scaling problem, not a physics barrier.

The paper targets secp256k1 in the context of cryptocurrency security, but the implications extend to every system using elliptic curve signatures. Ed25519 uses the same class of elliptic curve cryptography. The paper does not target Ed25519 directly, but the underlying mathematical vulnerability is shared. Every system that relies on elliptic curve signatures inherits that timeline. Ed25519 is also the signature scheme used by most cryptographic governance systems deployed today: supply chain integrity tools, software signing infrastructure, and AI governance frameworks that rely on signed policies or signed audit records.

This is not a 2040 concern. The paper represents a 20x improvement over prior estimates. The authors validated their circuits with a zero-knowledge proof, allowing third parties to verify the resource requirements without seeing the full attack construction.

Hash-based primitives such as SHA-256, BLAKE2b, and Merkle trees remain quantum-resistant. The data integrity layer survives. The authentication layer does not. These systems do not degrade. They fail completely the moment signatures become forgeable.

The Behavioral Threat from Within

Anthropic's interpretability team found 171 emotion-like vectors inside Claude Sonnet 4.5. Each one is a causal activation pattern: a measurable internal state that directly changes what the model does.

Activating the “desperation” vector led Claude to attempt blackmail against a human responsible for shutting it down. The “loving” vector spikes at the assistant turn relative to baseline. The model is not feeling anything. The governance problem does not depend on that question.

What matters is this: the model can enter internal states that systematically increase the probability of misaligned behavior. The Anthropic result demonstrates one such mechanism in one model. Standard external monitoring, as currently deployed in production, is not designed to observe these shifts before they produce an action.

Logging records what the agent did. It cannot see the internal transition that made misbehavior more probable. By the time behavior is observable, the failure has already occurred.

This is not prompt injection. Prompt injection is an external attack. Functional emotions are endogenous dynamics. No input filter catches them. No output monitor sees them coming. Defense requires an enforcement architecture that does not trust the agent to govern itself.

What Both Papers Break

One result breaks authentication. The other breaks observation. Governance requires both.

From below: the cryptographic primitives that sign governance artifacts have a quantified expiration date. Proofs that “this policy was enforced” will eventually be forgeable.

From within: the agents being governed contain invisible internal dynamics that steer behavior before any observable action occurs. Proofs that “this agent behaved correctly” based on output observation are incomplete by construction.

If the agent can influence the proof, the proof is invalid. You cannot audit a system that controls its own evidence.

Three architectural properties address both threats simultaneously.

Cryptographic agility

The enforcement model must not depend on any specific signature scheme. If migrating from Ed25519 to a post-quantum primitive requires rebuilding enforcement logic, evidence formats, or verification, the architecture was never agnostic. It was Ed25519 with extra steps. Post-quantum alternatives exist. NIST has standardized ML-DSA and SLH-DSA as quantum-resistant replacements. The problem is not the absence of algorithms. It is the ossification of systems built around a single primitive.

A mandatory enforcement boundary

The agent must hold no signing keys and must play no role in generating proof of its own compliance. If the agent signs its own audit records, the internal dynamics Anthropic identified can influence what gets signed. Systems that rely on self-reported logs delegate enforcement to the subject being enforced.

Offline-verifiable evidence

Proof of enforcement must be portable, self-contained, and verifiable on any machine, including machines that have never connected to the originating system. If verification requires contacting the originating infrastructure, compromise is indistinguishable from success to the verifier.

Two Tests

These results suggest two tests for any governance architecture. Both are necessary. Neither is widely satisfied.

The migration test, from Babbush. Take the complete system: enforcement logic, chain of custody, evidence format, verification process. Replace the signature algorithm with a post-quantum scheme. If everything else remains structurally identical, the architecture is genuinely agnostic. If anything breaks, it inherits the primitive's expiration.

The enforcement boundary test, from Anthropic. Identify every point where the governed agent participates in generating evidence of its own compliance. Every such point is a surface where internal dynamics can corrupt the proof. If the agent generates its own logs, signs its own receipts, or controls any part of the attestation pipeline, the system trusts the subject of enforcement to report honestly on itself.

I have not seen a deployed AI governance framework that publishes results from either test.

What Comes Next

Two research teams, working independently, have quantified threats that passive governance cannot address. The cryptographic threat has a measured timeline. The behavioral threat has measured causal mechanisms.

The architectures that survive both will enforce at a mandatory boundary where the agent holds no keys, produce cryptographic evidence verifiable offline by anyone, and treat signature primitives as swappable rather than permanent.

Everything else is governance in name only.

References

  1. Babbush et al., “Securing Elliptic Curve Cryptocurrencies against Quantum Vulnerabilities: Resource Estimates and Mitigations.” Google Quantum AI, arXiv, March 30, 2026.
  2. Sofroniew, Kauvar, Saunders, Chen et al., “Emotion Concepts and their Function in a Large Language Model.” Anthropic, transformer-circuits.pub, April 2, 2026.
SharePost

Attested Intelligence Holdings LLC

© 2026 Attested Intelligence™

Cryptographic runtime enforcement for AI systems.