Skip to content

Evidence contract

A verdict is only as strong as the evidence it references. This contract defines the minimum portable fields for evidence-backed Agent QC reports.

Evidence reference

FieldRequiredDescription
idYesStable evidence id inside the report.
kindYesEvidence kind such as command-log, test-report, protocol-transcript, surface-artifact, release-artifact, eval-artifact, review-note, or qcloop-run.
sourceYesLocal path, artifact URL, CI URL, qcloop id, or evidence service id.
scopeYesCase id, gate id, command, surface, profile, or release target covered.
created_atRecommendedTimestamp or run id.
environmentRecommendedOS, runtime, browser, terminal size, provider mode, CI job, or Docker image.
redactionConditionalRequired when credentials, user data, provider requests, or channel transcripts are involved.
summaryRecommendedShort human-readable result.
raw_refOptionalSafe raw payload ref. Do not inline secret-bearing payloads.

Verdict object

FieldRequiredDescription
statusYespassed, failed, blocked, exhausted, waived, needs-review, or skipped.
case_idYesCase being judged.
gate_familyYesGate family being judged.
evidence_refsYes except skippedEvidence ids supporting the claim.
expectations_metRecommendedExplicit expectation ids or text snippets proven by evidence.
failureRequired for failedSmallest actionable failure, not a broad complaint.
blockerRequired for blockedMissing environment fact and owner.
attemptsRequired for exhaustedAttempt refs, budget, and remaining uncertainty.
waiverRequired for waivedApprover, reason, scope, expiry.
reviewRequired for needs-reviewReviewer, queue, or reason semantic review remains.

Evidence minimum by gate

GateMinimum evidence
staticcommand/CI log, tool version, failing ids or success summary
unittest report or command log with suite and failure ids
property-fuzzseed/corpus, invariant, failing minimized case if any
contract-protocolschema diff, generated artifact check, fake server or protocol transcript
fake-integrationfake server log and request/response refs
runtime-e2eruntime transcript, state snapshot, process cleanup or retry proof
ui-interactionsurface artifact plus runtime/protocol link
live-provideropt-in flag, redacted request/response, credential scope, cost/budget note
stress-concurrencyworker timeline, seed/config, duration, race/retry result
distribution-releasepackage manifest, clean install, Docker/OS matrix, version output
semantic-evaldataset/rubric, model/judge info, baseline delta, threshold
reviewreviewer identity, scope, evidence refs, decision

Surface evidence add-ons

SurfaceAdd-on evidence
cli-streamstdout/stderr transcript, exit code, structured event sample
tuiterminal size, key sequence, terminal snapshot, linked runtime transcript
webuiPlaywright or browser trace/screenshot, console output, route/state assertion
desktop-guishell start log, bridge health, workspace readiness, screenshot, OS note
browser-automationDOM/a11y snapshot, console/network, screenshot, cleanup/orphan-process proof
channel-uiwebhook replay, channel transcript, media fixture, identity/auth proof
eval-uireport screenshot/export, rubric, judge output, reviewer note

Waiver contract

A waiver is not a pass. It is a time-bounded risk decision.

FieldRequiredDescription
approverYesPerson, team, or policy owner accepting the risk.
reasonYesWhy this gate is not required for this scope.
scopeYesCase, gate, platform, provider, or release range.
expiresYesDate, version, or condition that invalidates the waiver.
replacement_evidenceRecommendedLower-strength evidence that still exists.
follow_upRecommendedIssue, task, or next QC case.

Anti-patterns

Anti-patternCorrect status
"Looks good" with no artifactneeds-review or blocked
Screenshot without command/runtime evidencepartial ui-interaction, not full pass
Live provider output with no redaction/budget noteneeds-review
Unit tests only for desktop bridge behaviorblocked for GUI/surface claim
qcloop exhausted but reported as failed without attemptsexhausted
Waiver without owner or expiryinvalid waiver

Report closeout checklist

A report is ready to publish when:

  • every required gate has a verdict;
  • every passed or failed verdict has evidence refs;
  • every blocked, exhausted, waived, or needs-review status explains why it is not a pass;
  • live-provider evidence is redacted and budgeted;
  • surface evidence links visible behavior to runtime or protocol facts;
  • remaining risk and next action are explicit.

Draft standard for evidence-driven quality control of Agent projects.