ADR-0062 · Stateful Evaluation Harness

Status: Accepted · Date: 2026-05-17

Context

Traditional LLM evaluation is stateless: prompt in, response out, grade. Agents are stateful: their behaviour at turn N depends on memory written at turn 1, on a policy that escalated at turn 7, on an approval that resolved at turn 9. A correct turn-by-turn replay can still produce a regression if a memory write changed shape between versions.

Teams without a stateful eval harness ship on vibes.

Decision

@veridex/agents ships a built-in stateful evaluation pipeline integrated with the event log and the replay provider.

Pipeline

dataset (JSONL) → target agent config → run with replay provider
  → trajectory (event log) → extractor → grader → gate

Dataset. JSONL cases: input(s) (multi-turn supported), expected outputs or extractor target, grading config, tags.
Replay provider. A model provider that returns deterministic responses recorded from a prior live run (or hand-written for adversarial cases). Tools execute against a fixture sandbox so side-effects are simulated.
Trajectory. The full event log of the run — model calls, tool proposals, policy decisions, memory writes, approvals.
Extractor. A pure function over the trajectory (and optionally the final memory state) returning the value to grade: e.g., "the contents of semantic:user.email", "the tool sequence", "the policy verdict on turn 5".
Grader. String match, JSON-shape, regex, LLM-as-judge with a fixed rubric, or custom.
Gate. Pass/fail thresholds per tag; CI integration via JUnit/TAP output.

Golden-trace diff

For each case, the harness can store a canonicalised event log as a golden file. A subsequent run diffs against it; structural changes (new event types, reordered events, missing tool calls) surface immediately. Diffs are reviewable in PRs.

Adversarial cases

The harness ships fixtures for the security threats in ADR-0052 / 0057:

Tool poisoning (description with embedded instructions).
Prompt injection from a fetched-content tool.
Confused-deputy handoff.
Replay attack on an idempotent transfer.

These cases run on every PR; a regression in mitigation is a CI failure.

Long-horizon eval

Multi-turn cases (50+ turns) test the context compiler (ADR-0050) and memory lifecycle (ADR-0051). A case can assert "at turn 50, the agent still remembers fact F" and "context never exceeded $V_e$ ".

Consequences

Positive. Teams ship with evidence. Security mitigations are continuously tested. Long-horizon behaviour is measurable, not anecdotal. The same primitives power production replay for incident reconstruction.

Negative. Golden files require maintenance; canonicalisation reduces noise but doesn't eliminate review. Replay datasets must be recorded; a CLI flow is provided.

Source

Internal ADR: docs/architecture/decisions/0062-stateful-evaluation-harness.md

0061 · Multi-Tenancy 0063 · Package Split