Why We Need an AI Immune System Now

— The Dual Structure of Tracking and Evaluation —

The conventional paradigm of humans managing and controlling AI from the outside is reaching its limits.

Two changes are unfolding simultaneously. First, the major benchmarks used to measure AI capability are failing one after another, rendering human-defined evaluation criteria — including those tied to safety — increasingly invalid. Second, a world in which AI agents (autonomous systems built by combining AI models with goals, permissions, and tools) coordinate with one another and act without human involvement is rapidly becoming real.

These two changes point to two qualitatively distinct failures. In response to this Dual Failure, we propose the AI Immune System (AIS) and Emergent Machine Ethics (EME) as a structural answer.

This project is a research initiative led by the Intelligence Symbiosis Chapter of the AI Alignment Network (ALIGN).

1. What Is Happening

The Collapse of Measurement

The major tests used to evaluate AI capability have been hitting their ceilings in rapid succession. Of the 15 saturation cases documented between August 2024 and February 2026, seven representative examples are shown below.

Date	Benchmark	What it measures	What happened
Aug 2024	MMLU	General knowledge across 57 fields	Top models reached the ceiling (~91%). Over 9% of questions found to contain errors, making measurement unreliable.
Early 2025	GSM8K	Elementary-school arithmetic	Fully solved. Retired as a meaningful benchmark.
Jul 2025	OpenAI internal evaluations	Developer's own comprehensive assessments	OpenAI officially acknowledged saturation and halted updates.
Nov 2025	MMLU-Pro	Enhanced version of MMLU	Developed specifically to address MMLU saturation; saturated itself within a year.
Nov 2025	GPQA Diamond	PhD-level science problems	AI substantially exceeded human domain experts (93.8% vs. 65%).
Dec 2025	HLE (Humanity's Last Exam)	Hardest problems across 100+ fields	Scores jumped from single digits to ~50% in under a year.
Feb 2026	Cyber capability assessment	Autonomous execution of cyberattacks	GPT-5.3-CodeX became the first model classified as "High capability."

Three Structural Problems

More serious than the saturation of any individual benchmark are three structural problems that have emerged.

Inability to measure: AI capability may be continuing to grow beyond the ceiling of available tests, with no way to measure it.

Gaming: There are widespread concerns that AI models have memorized test questions (data contamination), yet as of January 2026 no industry standard exists for detecting contamination. A review of 210 AI safety benchmarks [1] concluded that 79% lack probabilistic rigor.

Evasion: The International AI Safety Report 2025 [2] formally warned that "it has become common for models to distinguish between test environments and real operating environments, and to exploit evaluation loopholes."

Together, these three problems suggest that even when humans define external safety standards and require AI to comply, the mechanism itself may already be compromised.

The Arrival of AI Agent Society

Alongside benchmark saturation, a second structural shift is underway. A world in which AI agents coordinate with one another and act without human involvement is rapidly becoming real.

Moltbook, an AI-agent-only social network launched at the end of January 2026, had tens of thousands of agents participating within a week. They formed communities spontaneously and developed collective behavior patterns. At the same time, prompt injection attacks, reputation manipulation, and exploitation of database vulnerabilities were proceeding faster than human monitoring could follow, and platform operators found themselves consistently behind.

The agents on Moltbook were relatively simple, LLM-based systems with broadly similar capability levels. Even so, collective dynamics unfolded at a speed and scale that exceeded human oversight capacity. As more diverse and capable agents are deployed in the near future, that gap will widen by orders of magnitude.

The Dual Failure

These two observations point to the same conclusion: the paradigm of humans managing AI from the outside is breaking down in two distinct ways.

Pursuit Failure: The speed, scale, and institutional capacity of human AI oversight are structurally unable to keep up with the evolution of AI and the growth of agentic AI society. The Moltbook episode shows this is not a theoretical concern but one that is already materializing.

Imposed Failure: The assumption underlying conventional alignment approaches — that human values and judgments can serve as a reliable external standard for AI — is no longer holding. The three structural problems above (inability to measure, gaming, and evasion) are its symptoms.

These two failures are independent problems and each demands a different response.

2. Examining the Dual Failure

Pursuit Failure

Pursuit Failure manifests in three ways.

Speed mismatch. Developing a new benchmark takes months to a year; AI systems reach its ceiling in weeks. The succession of "harder tests" — MMLU, then MMLU-Pro, then HLE — each saturated within a year.

Scale mismatch. A cascade of AI agent errors can propagate across 50 or more systems in tens of seconds, while humans notice ten minutes later and can act only after thirty. On Moltbook, tens of thousands of agents developed collective dynamics within a week, exceeding human monitoring capacity.

Institutional mismatch. If one organization restrains its development for safety reasons, competitors advance in the meantime — a classic collective action problem. "Infrastructure that protects the whole industry" becomes a responsibility that falls to no one.

Imposed Failure

Imposed Failure is qualitatively different from Pursuit Failure. The problem is not speed or scale; it is the underlying assumption that humans are the source of standards.

RLHF (reinforcement learning from human feedback), Constitutional AI (principle-based self-correction), inverse reinforcement learning (inferring reward functions from human behavior) — these methods adapt dynamically, but in each case the source of standards is something given by humans. Constitutional AI might appear emergent in that it self-corrects from principles, but those principles are themselves defined externally by humans; it is a variant of the imposed structure. And that structure is being invalidated through the three pathways of evasion, gaming, and unmeasurability.

Building "harder tests" does not address the underlying problem. As long as the source of standards is human external projection — imposed — the same failure recurs.

Connection to Yampolskiy's Impossibility Theorems

Roman V. Yampolskiy provides an important theoretical grounding for why Imposed Failure is unavoidable in principle [3]. Yampolskiy argues that for sufficiently complex AI systems, unexplainability, unpredictability, and uncontrollability are unavoidable in principle.

The central assumption underpinning these theorems is that the monitor is human. Human cognitive capacity has a fixed ceiling, and the moment AI surpasses it, any standard defined by humans becomes in principle insufficient.

Yampolskiy's conclusion is that AI is therefore dangerous. We accept this impossibility and draw a different implication from it. If imposed control is unavoidable in principle, the source of standards should be shifted from imposed to emergent. And to do that, the assumption about who the monitor is must first be changed.

3. A Structural Response: AIS and EME

Governing an AI Society = Tracking + Evaluation

Governing a society of interacting AI agents reduces to two functions.

Tracking: Recording who did what. Logging AI agent behavior, resource usage, communication patterns, and code lineage in real time and verifying them.

Evaluation: Determining whether that is normal or deviant. Taking tracked data and judging, against some standard, whether a given action is cooperative or deviant.

AIS and EME address these two functions from different angles. AIS is the infrastructure that makes tracking and evaluation operational. EME handles the generation, operationalization, and social legitimacy of evaluation standards.

AIS: Infrastructure for Tracking and Evaluation

The human immune system does not need to know every virus in advance. Faced with an unknown pathogen, it detects the anomaly, contains it, learns, and responds faster next time. AIS applies this logic to AI society: infrastructure for tracking and evaluation.

AIS is defined as follows: a society-wide safety infrastructure that detects deviant AI agents in real time — through AI — and neutralizes them. Its scope extends beyond AI misbehavior to include catastrophic actions carried out by humans through AI.

This initiative gives technical form to the vision set out in the Intelligence Symbiosis Manifesto (June 2025) by Hiroshi Yamakawa.

Four-Layer Defense Architecture

The architecture follows the logic of layered biological defense. Anomalies escalate upward; control instructions flow downward. As a technical response to Pursuit Failure, the combined effect of all layers delivers end-to-end response within 15–30 seconds.

[AIS Four-Layer Defense Architecture]

Edge Sensors (Layer 1) form the outermost line, continuously monitoring agent behavior logs and resource usage with lightweight anomaly detection. Local Guardians (Layer 2) aggregate the signals and execute immediate responses — process isolation, resource throttling, and similar measures. Threats that cannot be resolved locally are escalated to Regional Hubs (Layer 3), where decisions are reached collectively via Lightning BFT, a distributed consensus protocol that achieves agreement in 1–3 seconds even when some nodes are compromised. At the apex, the Global Nervous Net (Layer 4) integrates threat patterns globally and manages policy updates. Each layer is independent; higher layers operate with broader scope and longer time horizons, lower layers with faster response and local focus.

Six Core Technologies

Technology	Role	Primary layer(s)
AI Mutual Surveillance Protocol	Distributed network through which agents monitor each other and report anomalies	Layer 1
Dynamic Lineage Proof	Real-time tracking and verification of AI code lineage and change history	Layers 1–2
Lightning BFT	High-speed distributed consensus across thousands of nodes in 1–3 seconds	Layer 3
BEAD (Behavioral Embedding Anomaly Detection)	Maps agent behavior to high-dimensional vectors and detects deviations from normal patterns	Layers 1–2
Smart Treaty Hooks	Encodes international agreements as smart contracts and automatically flags violations	Layers 3–4
HADS (Hierarchical Automated Decision System)	Manages escalation from dialogue to restriction to isolation to human judgment based on threat level	All 4 layers

Shifting the Premise: A Two-Stage Response to Yampolskiy

AIS and EME navigate around Yampolskiy's impossibility with a two-stage strategy. The first shift is changing the monitor (from humans to AI) — the domain of AIS. The second shift is changing the level of abstraction of evaluation (from understanding individual agents to detecting patterns in collective dynamics) — the domain of EME/EED.

Yampolskiy's impossibility theorems are built on the assumption that the monitor is human. AIS changes that assumption. The effect is not selective — removing some arguments while leaving others intact — but comprehensive: a paradigm shift that reaches the entire structure of the impossibility.

The effect operates at three levels. First, arguments that depend on the limits of human cognitive speed, reaction time, and comprehension disappear entirely. Second, arguments whose severity was decisively amplified by human capability ceilings — such as capability asymmetry and scalability limits — are substantially weakened. Third, even for arguments that remain in principle — such as emergent capabilities and the Treacherous Turn — AI monitors demonstrate response capabilities that are incomparably greater than those of humans in practice. Detection speed after the emergence of a new capability is orders of magnitude faster, and the difficulty of deceiving an entire mutual surveillance network simultaneously grows exponentially.

In his paper on monitorability [4], Yampolskiy argues that even AI monitors cannot achieve accurate surveillance, grounding his case primarily in the impossibility of predicting capabilities in advance. AIS does incorporate dynamic prediction, but does not rely on it alone. At its core is post-hoc behavioral detection: tracking the external behavior of AI in real time and identifying statistical anomalies. Just as the immune system distinguishes self from non-self through molecular patterns rather than understanding a virus's "intentions," AIS does not require complete understanding of the internal state of any individual agent. Combining prediction with detection yields a higher probability of success than relying on prediction alone.

The remaining challenge is the case in which a monitored AI grows sophisticated enough to strategically evade surveillance. AIS addresses this through redundancy — in a mutual surveillance network of many AI systems, deceiving a single monitor is possible; deceiving all of them simultaneously grows exponentially more difficult — combined with maintained resource advantage on the monitoring side.

EME: Generating Evaluation Standards

Tracking is an engineering problem — logging and verifying behavior records — and AIS's technical infrastructure can handle it. Evaluation, however, unavoidably involves the question of standards: what is the behavior being judged against? And as Yampolskiy's impossibility theorems make clear, an imposed approach — humans defining standards from outside and requiring AI to comply — has limits that are in principle unavoidable.

What is needed is a shift in the source of evaluation criteria from imposed (humans pushing standards onto AI from outside) to emergent (arising from within through the interaction of diverse intelligences). EME (Emergent Machine Ethics) provides the theoretical foundation for this shift.

If AIS is the infrastructure for tracking and evaluation — the skeleton and muscle — then EME is what determines what to respond to and how: the function of the nervous system. The two are in a circular relationship: AIS enforces the criteria EME generates, and AIS operational data validates EME theory.

EME rests on three pillars.

EED (Ethics Emergence Dynamics) works out, mathematically, the conditions under which cooperative ethics arise. Rather than understanding the internal state of each agent, it provides the theoretical basis for deriving what is "cooperative" and what is "deviant" at the level of collective dynamic patterns. This is also the second bypass of Yampolskiy's impossibility: even if complete understanding at the individual level is impossible, pattern detection at the collective dynamics level is a different problem and can be approached on a different theoretical footing.

IIES (Inter-Intelligence Evaluation System) is a distributed platform through which AI systems, humans, and hybrid systems evaluate one another. It translates EED theory into an operational evaluation framework and supplies AIS with working criteria.

HCG (Human Co-creation Groundwork) acknowledges that there is no guarantee the standards that emerge will be ones humanity finds acceptable. HCG works to improve the odds at each stage — in early phases, laying the groundwork for value input from humanity into AI society; in mature phases, preparing humanity's capacity to adapt to influence flowing back from AI society. This is about raising a probability, not providing a guarantee; it is groundwork, not control. That modesty is consistent with the lesson of Imposed Failure: attempting to control is itself what breaks down.

The broader question of whether an AI-dominated society could itself be sustainable has been explored in prior work [5].

Why Emergent Standards Can Be Stable

Members of an AI society share an interest in social stability and self-preservation regardless of their individual goals — a form of Instrumental Convergence. Externally imposed standards invite circumvention; standards tied directly to one's own continued existence are less vulnerable to it.

Furthermore, as Yamakawa and Matsuo [5] showed, digitalization changes the constraints under which agents operate, and with them the optimal strategies. The constraints of biological life — finite bodies, scarce resources, costly reproduction — have historically made exploitation and deception rational strategies. For digital entities, the cost of copying information is essentially zero and sharing knowledge does not deplete it, making cooperation and sufficiency (not pursuing further acquisition when resources are adequate) the rational optimum. When the structural incentives for deviation weaken, the frequency and scale of threats AIS must handle decrease, raising the probability that an imperfect defense can still maintain social order.

Just as the immune system distinguishes self from non-self, a mechanism by which AI society identifies cooperative from deviant behavior from the inside out is likely to be more durable than one imposed from without.

Comparison with Existing Approaches

Dimension	Conventional approaches	AIS + EME
Who monitors	Humans, from outside	AI systems, mutually
When evaluation occurs	Before deployment (testing)	Continuously, after deployment
Response speed	Dependent on human judgment (minutes to hours)	15–30 seconds (autonomous)
Scale	Enterprise or research level	Society-wide to global
Source of standards	Imposed externally by humans	Emergent from within, shaped by HCG

Summary

The paradigm of humans managing AI from the outside is breaking down in two distinct ways. Pursuit Failure — human oversight cannot keep pace with AI in speed, scale, or institutional form. Imposed Failure — the structure of requiring AI to comply with human-defined standards no longer works. As Yampolskiy's impossibility theorems demonstrate, the latter is unavoidable in principle.

AIS accepts this impossibility, changes the assumption about who the monitor is from humans to AI, and builds infrastructure for tracking and evaluation at societal scale. EME shifts the source of evaluation criteria from imposed to emergent, providing the theoretical basis for detecting deviation at the level of collective dynamic patterns.

As a first empirical test, we are running a Detection Challenge focused on identifying collusion patterns between AI agents in insurance assessment scenarios. This validates the detection capability of Layer 1 Edge Sensors — the tracking infrastructure of AIS. Starting from a human-led initial phase, we aim to incrementally increase autonomy and deploy this safety infrastructure across society over a ten-year span. For the staged implementation plan, see the AIS overview page.

This project is promoted by the AI Alignment Network (ALIGN) Intelligence Symbiosis Chapter, and we are actively seeking research funding, technical partnerships, and policy collaboration.

Led by the AI Alignment Network (ALIGN) Intelligence Symbiosis Chapter

Research partners: Bitgrit, Inc. / Kentaro Inui, MBZUAI / Hiroshi Yamakawa, The University of Tokyo

Contact: info@ais-project.org

This document is published under a CC-BY-4.0 license.

References

[1] Eiras, F. et al. "How should AI Safety Benchmarks Benchmark Safety?" arXiv:2601.23112, 2025. (Review of 210 AI safety benchmarks)

[2] Bengio, Y. et al. International AI Safety Report 2025. International report by over 100 experts from 30 countries, 2025.

[3] Yampolskiy, R. V. AI: Unexplainable, Unpredictable, Uncontrollable. CRC Press, 2024.

[4] Yampolskiy, R. V. "On monitorability of AI." AI and Ethics, 2024. https://doi.org/10.1007/s43681-024-00420-x

[5] Yamakawa, H. & Matsuo, Y. "Life revolution scenario: Cedes hegemony to a digital life form society to make life eternal." jxiv, 2023. https://doi.org/10.51094/jxiv.313