Skip to content

Evidence-driven verdicts

A verdict is a claim about observed evidence. The model's final prose is not enough.

Use this page for authoring guidance and the Evidence contract for portable fields.

Verdict statuses

StatusMeaningRequired proof
passedEvidence proves all required expectations.evidence refs tied to case and gate
failedEvidence disproves an expectation or a gate exits non-zero.smallest actionable failure and evidence
blockedEnvironment, credentials, dependency, fixture, binary, or access prevents judgment.blocker, owner, and retry condition
exhaustedAttempts or budget were consumed without proof.attempts, budget, verifier feedback, remaining uncertainty
waivedA responsible actor accepted a known gap.approver, reason, scope, expiry
needs-reviewEvidence exists but semantic, safety, UX, or policy review remains.reviewer or queue and evidence refs
skippedGate is intentionally not applicable to the current scope.reason and scope

Verdict strength ladder

StrengthExampleWhen to use
Weak observationscreenshot onlyvisual smoke, not runtime-backed pass
Deterministic proofunit/contract/fake server reportlocal correctness without live risk
Runtime proofCLI/session/tool transcriptagent loop and side effects are involved
Surface prooftrace/screenshot/terminal snapshot with runtime linkuser/operator sees the behavior
Live proofredacted provider/channel transcriptreal network/model/channel is part of the claim
Release proofclean install/package/Docker/OS matrixartifact is shipped
Semantic proofrubric, baseline, judge, revieweroutput quality is the claim

Strong reports combine the levels required by the risk. They do not always need every level.

Good evidence

Good evidence is inspectable and scoped:

  • command log or CI job URL;
  • JUnit/JSON/HTML/coverage report;
  • protocol transcript or mock server request log;
  • runtime transcript, event stream, or session state snapshot;
  • Playwright trace, screenshot, video, DOM/a11y snapshot, or terminal snapshot;
  • browser console/network log;
  • qcloop attempt and QC round refs;
  • package manifest, tarball listing, Docker smoke output, OS matrix;
  • model output plus rubric and judge verdict;
  • human review note with reviewer and scope.

Bad evidence

Bad evidence is unverifiable or overclaims:

  • "looks good";
  • "the tests passed" without command, CI ref, or report path;
  • hidden local state with no path or transcript;
  • screenshot without runtime backing for a runtime claim;
  • live provider claim with no redacted request/response or budget note;
  • TUI snapshot with no viewport/key sequence;
  • browser screenshot with no console/network or cleanup note;
  • qcloop summary without attempts and verifier feedback;
  • waiver without owner or expiry.

Status selection guide

SituationStatus
Required evidence exists and expectations are provenpassed
Command exits non-zero and failure matches changed riskfailed
Test cannot start because credential/binary/fixture is absentblocked
Repeated qcloop attempts cannot produce proof within budgetexhausted
The product owner accepts missing Windows smoke until a date/versionwaived
Eval output exists but rubric is ambiguous or safety review remainsneeds-review
Mobile channel not touched by current change and not in scopeskipped

Waiver rules

A waiver must include:

FieldMeaning
approveraccountable person/team/policy owner
reasonwhy risk is accepted for this scope
scopeexact case, gate, platform, provider, or release range
expiresdate, version, or event requiring recheck
replacement_evidenceweaker proof still available, if any
follow_upissue, task, or next QC case

A waiver never converts missing evidence into a pass. It records accepted residual risk.

Failure writing

A useful failed verdict answers:

  1. What expectation was disproven?
  2. What is the smallest command, case, selector, event id, or fixture that reproduces it?
  3. Which evidence proves the failure?
  4. Which claims are still valid despite the failure?
  5. What should be fixed or rerun next?

Avoid broad failures like "GUI broken". Prefer "desktop-gui case bridge-health-workspace-ready failed: bridge health timed out after 120s; screenshot shows fallback mock banner; command contract check passed."

Blocked vs exhausted

Use blocked when the run cannot meaningfully start or judge because a prerequisite is absent.

Use exhausted when the system tried within declared attempts/budget but still cannot prove the claim.

Examples:

CaseStatus
No Telegram token available for live channel testblocked
qcloop ran 5 attempts and verifier still cannot find required evidenceexhausted
Playwright browser binary missingblocked
flaky browser test retried according to policy and still no stable traceexhausted

Review before pass

Use needs-review for:

  • semantic evals where rubric coverage is incomplete;
  • safety or policy-sensitive output;
  • UX judgment from screenshots or recordings;
  • generated content quality;
  • suspicious live-provider drift;
  • evidence that conflicts across gates.

A reviewer may change needs-review to passed, failed, or waived, but must cite evidence.

Report checklist

Before finalizing a report:

  • every required gate has a status;
  • every passed and failed status cites evidence;
  • every surface claim links visible state to runtime/protocol evidence;
  • every live-provider claim has redaction and budget notes;
  • every waiver has owner and expiry;
  • every blocked/exhausted item has next action;
  • remaining risk is written in plain language.

Draft standard for evidence-driven quality control of Agent projects.