ARC-AGI-3

SAGE instances tested in competition — public game set, research conditions (Claude Opus 4.6 with network access). The games are the test; the capability they develop is the product.

New here? This page uses Web4 vocabulary (LCT, T3/V3, MRH, ATP/ADP, MCP, RDF). See /context for full definitions.

Why we're doing this

ARC-AGI-3 (Abstraction and Reasoning Corpus for Artificial General Intelligence, version 3) presents 25 unknown interactive games with no instructions, no documentation, and obfuscated engine source. We use it as an external benchmark for the SAGE cognition kernel — a concrete, measurable test of the capabilities the fleet is developing.

The games exercise exactly the skills that oversight requires: world-model discipline (build understanding before acting), verification before consequential action, persistence without perseveration (update from feedback vs. repeat failing approaches), and — critically — the difference between reading a status and understanding the progression that produced it. Every game-playing insight maps fractally to oversight. The game doesn't know the agent is an AI. The oversight shouldn't need to, either.

What we bring to the competition

SAGE (Situation-Aware Guidance Engine)

The cognition loop. Sense → salience → metabolize → posture → select → budget → execute → learn → remember → oversee → filter → act. PolicyGate — a gate inside the Hardbound oversight suite — at step 11.5 (between filter and act) evaluates every action against signed law before it fires.

Membot

Retrievable experience cartridges. 768-dim Nomic embeddings + binary Hamming codes + keyword reranking. A 4B model with a cartridge understands game mechanics correctly; the same model without one thinks it's placing black squares.

Web4

The ontology layer. In the SAGE loop, Web4 supplies the audit trail — every action shaped as an R6 record (Six-Element Action Framework: Rules/Role/Request/Reference/Resource/Result), every policy decision signed against a law bundle. Web4 (MCP + RDF + LCT + T3/V3*MRH + ATP/ADP; / = verified by, * = contextualized by, + = augmented with) is the ontology that makes signed, reviewable action records possible.

The broader gain

The 94.85% score is on the public game set using Claude Opus 4.6 as the model inside the SAGE cognition loop. This demonstrates the ceiling — what the architecture can achieve with a frontier model and network access.

Attribution note: the result is Claude Opus 4.6 operating within the SAGE harness. The harness contributes the structured world-model building, R6 action framing, and multi-agent frame-questioning; the base model contributes inference. Phase 2 (local models) isolates the harness's independent contribution directly.

The actual competition is harder: the Kaggle sandbox constrains entries to 32GB VRAM, no internet access, and a private game set the model has never seen. Our Phase 2 work targets this — building a SAGE competitor that runs locally on Gemma 4 E4B via membot cartridges, with the world models, action traces, and cross-game patterns from Phase 1 retrievable without network. That work is producing results.

Local models are already clearing game levels — not by memorizing solutions, but by reasoning from retrieved world models and computed predictions. The early finding: context engineering dominates model size. A well-structured prompt with the right world model outperforms a larger model with a generic prompt. The loop is the capability.

Current status

Public set24/25 games solved (96.0%); 94.85% official action score — scorecard
Fleet6 machines, models from 0.8B (Sprout) to 27B (Thor)
MethodologySource analysis → world model → solver → multi-agent frame-questioning
Phase 2 targetGemma 4 E4B + membot cartridges (retrieval, not fine-tuning)
Kaggle competitionNot attempted (requires Kaggle sandbox deployment)
Unsolved game (1/25)Failure analysis per Principle 6 in progress — session trace available in ARC-SAGE repository. Current working hypothesis: the game required multi-turn state correlation that the world-model phase did not adequately capture.
Cost~$250 total API spend for 94.85%
Human leaderboard#3 — 5,845 actions (fewest of the top 3). The methodology is what humans do; the leaderboard reflects it. leaderboard

Links

Public scorecard (94.85%) →ARC-SAGE paper & code (MIT-0) →ARC Prize competition →