Reliability Is a Harness Property: The Agent Engineering Dossier

The Model Is Not the Operating System

Weak agent programs treat reliability as a procurement problem. The run fails, so the team swaps the model, raises the context window, or spends more on inference. Sometimes that helps. It does not create a reliable operating surface by itself.

The practical failure is usually lower than intelligence. The agent was given a vague task. It loaded the wrong context. It trusted stale notes. It called a tool without a contract. It stopped after a plausible answer. It repaired the same local symptom three times because no part of the harness forced a plan reset. Those are system failures.

This is why Greyforge treats harness engineering as the real reliability layer. Capability has to pass through contracts before it becomes dependable work.

What the Research Keeps Saying

The public research arc points in the same direction. Serious evaluation keeps moving away from isolated prompt scoring and toward real environments, execution feedback, tool boundaries, and reproducible checks. The lesson is not that benchmarks are perfect. The lesson is that useful agent evaluation has to look more like systems engineering than trivia grading.

SWE-bench

Real repository issues push evaluation beyond short-form coding tasks.

SWE-agent

Agent-computer interface design changes execution quality.

AgentBench

Long-horizon reasoning and instruction following remain hard under action loops.

AgentDojo

Tool use must be tested for utility and prompt-injection pressure together.

The Public Rule

A reliable agent harness is a stack of contracts. The model can still reason, write, inspect, and repair. The harness makes those actions bounded, observable, and reversible enough for real work.

The task contract tells the agent what done means before it starts.

The context contract decides what evidence is loaded and what stale memory is rejected.

The tool contract narrows authority before a command, write, or external call happens.

The verification contract decides whether a run can close or must repair itself.

This is the same doctrine behind Memory Quality Without an LLM Judge: make the cheap boundary deterministic before spending a model call on what a gate could have rejected. It also explains why memory continuity and operations control matter so much. An agent that cannot inherit the right state cannot be trusted to finish the right job.

What Stays Behind the Gate

The full edition is not a longer pep talk. It is the operational dossier: failure classes, harness layers, scorecards, trace discipline, budget policy, security pressure, and the minimum reference architecture a serious builder can adapt.

Greyforge will keep public evidence online, but the transferable method belongs in the premium Chronicle layer. That protects the forge from automated extraction while still giving public readers a real thesis they can inspect, cite, and challenge.

Reliability Is a Harness Property