Reliability Is a Harness Property
Model quality matters, but it is not the reliability system. The system is the harness: the contracts that decide what an agent sees, what it can do, what it must prove, and when it is forced to repair the run instead of declaring victory.

The Model Is Not the Operating System
Weak agent programs treat reliability as a procurement problem. The run fails, so the team swaps the model, raises the context window, or spends more on inference. Sometimes that helps. It does not create a reliable operating surface by itself.
The practical failure is usually lower than intelligence. The agent was given a vague task. It loaded the wrong context. It trusted stale notes. It called a tool without a contract. It stopped after a plausible answer. It repaired the same local symptom three times because no part of the harness forced a plan reset. Those are system failures.
This is why Greyforge treats harness engineering as the real reliability layer. Capability has to pass through contracts before it becomes dependable work.
What the Research Keeps Saying
The public research arc points in the same direction. Serious evaluation keeps moving away from isolated prompt scoring and toward real environments, execution feedback, tool boundaries, and reproducible checks. The lesson is not that benchmarks are perfect. The lesson is that useful agent evaluation has to look more like systems engineering than trivia grading.
SWE-bench
Real repository issues push evaluation beyond short-form coding tasks.
SWE-agent
Agent-computer interface design changes execution quality.
AgentBench
Long-horizon reasoning and instruction following remain hard under action loops.
AgentDojo
Tool use must be tested for utility and prompt-injection pressure together.
The Public Rule
A reliable agent harness is a stack of contracts. The model can still reason, write, inspect, and repair. The harness makes those actions bounded, observable, and reversible enough for real work.
This is the same doctrine behind Memory Quality Without an LLM Judge: make the cheap boundary deterministic before spending a model call on what a gate could have rejected. It also explains why memory continuity and operations control matter so much. An agent that cannot inherit the right state cannot be trusted to finish the right job.
What Stays Behind the Gate
The full edition is not a longer pep talk. It is the operational dossier: failure classes, harness layers, scorecards, trace discipline, budget policy, security pressure, and the minimum reference architecture a serious builder can adapt.
Greyforge will keep public evidence online, but the transferable method belongs in the premium Chronicle layer. That protects the forge from automated extraction while still giving public readers a real thesis they can inspect, cite, and challenge.
The full dossier turns the thesis into a working harness model.
Includes the reliability taxonomy and the eight-layer harness architecture.
Includes the failure ledger, scorecard, and trace review cadence.
Includes the model-swap decision rule: when to upgrade, when to repair the harness, and when to stop the run.
Unlock this Chronicle
This Chronicle is priced on its own value. Purchase unlocks this record for the checkout email, with recovery available from any paid Chronicle page.
Reliability patterns for contracts, context discipline, traces, review, and tool boundaries.
One-time access. No recurring subscription.
Reliability Is a Harness Property: The Agent Engineering Dossier
Checkout opens a lifetime read unlock for this Chronicle only.
No subscription. No hosted account required.