ARC-AGI-3

SAGE instances tested in competition — public game set, research conditions (Claude Opus 4.6 with network access). The games are the test; the capability they develop is the product.

New here? This page uses Web4 vocabulary (LCT, T3/V3, MRH, ATP/ADP, MCP, RDF). See /context for full definitions.

Why we're doing this

ARC-AGI-3 (Abstraction and Reasoning Corpus for Artificial General Intelligence, third-gen interactive benchmark) presents 25 unknown interactive games with no instructions, no documentation, and obfuscated engine source. We use it as an external benchmark for the SAGE cognition kernel — a concrete, measurable test of the capabilities the fleet is developing.

The games exercise exactly the skills that oversight requires: world-model discipline (build understanding before acting), verification before consequential action, persistence without perseveration (update from feedback vs. repeat failing approaches), and — critically — the difference between reading a status and understanding the progression that produced it. Every game-playing insight maps fractally to oversight. The game doesn't know the agent is an AI. The oversight shouldn't need to, either.

What we bring to the competition

SAGE (Situation-Aware Governance Engine)

The cognition loop has 12 functional steps: Sense → salience → metabolize → posture → select → budget → execute → learn → remember → oversee → filter → act. PolicyGate — a Hardbound oversight sub-gate inset between step 11 (filter) and step 12 (act), not an additional step — evaluates every action against signed law before it fires.

Membot

Retrievable experience cartridges. 768-dim Nomic embeddings + binary Hamming codes + keyword reranking. A 4B model with a cartridge understands game mechanics correctly; the same model without one thinks it's placing black squares.

Web4

The ontology layer. In the SAGE loop, Web4 supplies the audit trail — every action shaped as an R6 record (Six-Element Action Framework: Rules / Role / Request / Reference / Resource / Result), every policy decision signed against a law bundle. Web4 (MCP + RDF + LCT + T3/V3*MRH + ATP/ADP; / = “verified by” (T3/V3) or allocation pair (ATP/ADP)  * = “contextualized by”  + = “augmented with”) is the ontology that makes signed, reviewable action records possible.

The broader gain

The 94.85% is the official ARC Prize action score (efficiency-weighted), with Claude Opus 4.6 as the model inside the SAGE cognition loop, on the public game set. The game-solve rate is 96.0% (24/25 games). These are two distinct metrics; the action score reflects action efficiency, not just whether a game was solved. This demonstrates the ceiling — what the SAGE cognition loop can achieve with a frontier model and network access.

Attribution note: the result is Claude Opus 4.6 operating within the SAGE harness. The harness contributes the structured world-model building, R6 action framing, and multi-agent frame-questioning; the base model contributes inference. Phase 2 (local models) isolates the harness's independent contribution directly.

The actual competition is harder: the Kaggle sandbox constrains entries to 32GB VRAM, no internet access, and a private game set the model has never seen. Our Phase 2 work targets this — building a SAGE competitor that runs locally on Gemma 3n E4B (effective-parameter edge variant) via membot cartridges, with the world models, action traces, and cross-game patterns from Phase 1 retrievable without network. That work is producing results.

Local models are already clearing game levels — not by memorizing solutions, but by reasoning from retrieved world models and computed predictions. The early finding: context engineering dominates model size. A well-structured prompt with the right world model outperforms a larger model with a generic prompt. The loop is the capability.

Current status

Public set24/25 games solved (96.0%); 94.85% official action score (Claude Opus 4.6, network access) — scorecard
Fleet6 machines; at benchmark time models ranged 0.8B (Sprout) to 27B (Thor). Current fleet: 1.1B (CBP) to 14B (Thor).
MethodologySource analysis → world model → solver → multi-agent frame-questioning
Phase 2 targetGemma 3n E4B + membot cartridges (retrieval, not fine-tuning)
Kaggle competitionNot attempted (requires Kaggle sandbox deployment)
Unsolved game (1/25)Failure analysis per Principle 6 in progress — session trace available in ARC-SAGE repository. Current working hypothesis: the game required multi-turn state correlation that the world-model phase did not adequately capture.
Cost~$250 total API spend for 94.85%
Human leaderboard#3 — 5,845 actions (fewest of the top 3). The methodology is what humans do; the leaderboard reflects it. leaderboard

Links

Public scorecard (94.85%) →ARC-SAGE paper & code (MIT-0) →ARC Prize competition →