Skip to content

Flow and taxonomy

This page is the complete Agent QC lifecycle and taxonomy reference. It mirrors the specification style used by Agent UI: explicit dimensions, fields, constraints, lifecycle stages, and validation cases.

Core contract

Agent QC is an evidence protocol for Agent project quality. A compatible QC plan classifies owned risk, selects gates, executes checks, stores evidence, and emits verdicts without turning model prose into proof.

Compatible QC reports MUST:

  • classify one or more project profiles;
  • name touched interaction surfaces when user-visible behavior is involved;
  • map each required gate to concrete local commands, CI jobs, qcloop items, or review steps;
  • preserve inspectable evidence refs for every pass/fail/blocked/exhausted/waived verdict;
  • separate deterministic, runtime, surface, live-provider, release, and semantic-eval claims;
  • state limitations and waivers explicitly.

Compatible QC reports MUST NOT:

  • treat a final assistant answer as evidence without a linked artifact;
  • infer runtime success from UI text alone;
  • hide live-provider calls inside default deterministic tests;
  • collapse screenshots, traces, terminal snapshots, and protocol transcripts into one vague "UI checked" claim;
  • call a gate passed when required evidence is missing.

Lifecycle overview

text
change or release scope
  -> classify profiles
  -> identify touched surfaces
  -> assign fact owners and risk owners
  -> select gate lanes
  -> write behavior-level cases
  -> execute deterministic gates
  -> execute runtime and surface gates
  -> opt into live/release/eval gates when required
  -> collect evidence refs
  -> issue verdicts
  -> publish report, waivers, and next action

The flow applies to CLI agents, SDKs, MCP/tool gateways, channel bots, TUI/GUI/WebUI products, browser automation systems, schedulers, skills/plugins, distribution packages, and eval suites.

Taxonomy dimensions

Project profile

Profiles describe owned project shape.

ProfileOwnsDefault risks
agent-runtime-cliagent loop, CLI, task execution, sandbox, tools, resumestream drift, permissions, subprocess cleanup, resume consistency
agent-sdk-apipublic SDK, generated client, API wrapperssignature drift, async cancellation, fake-server behavior
agent-tool-mcp-gatewaytool declarations, MCP/ACP bridge, connector runtimeprotocol conformance, stdio/http recovery, resource permission
multi-channel-agent-gatewaychat/channel adapters, webhooks, auth, mediaidentity, webhook verification, media routing, secret redaction
agent-ui-tui-desktopGUI, TUI, desktop shell, browser-visible flowsprojection drift, stale success, bridge readiness, screenshots/traces
agent-skills-pluginsskills, plugins, manifests, loaders, marketplacemanifest drift, package boundary, trust policy, fixture install
background-agent-schedulercron, queues, workers, retries, long-running agentsduplicate work, lost checkpoints, race, stuck loop
agent-distribution-releasepackage, Docker, installers, cross-platform releasemissing files, broken clean install, lock drift, supply chain
agent-evals-qualitytask quality, model behavior, rubrics, generated outputsprompt drift, judge instability, baseline regression, grounding gap

Interaction surface

Surfaces describe where the behavior is observed.

SurfaceUse whenRequired evidence
cli-streamstdout/stderr, JSONL/NDJSON, command UIcommand, exit status, transcript, structured sample
tuiterminal UI, Ink, ratatui, cursesviewport, key sequence, terminal snapshot, runtime transcript
webuibrowser dashboard, extension UI, QA/admin consolescreenshot/trace, console log, route/state assertion
desktop-guiTauri, Electron, native shellshell start, bridge health, workspace/session readiness, OS note
browser-automationCDP, Playwright, browser-use, remote browserDOM/a11y, screenshot, console/network, cleanup proof
channel-uichat app, QR, mobile, webhook-visible flowschannel transcript, media fixture, auth/webhook replay, redaction
eval-uiQA dashboards and eval reportsrubric, judge output, baseline delta, reviewer note

Gate family

Gate families describe validation style, not framework names.

FamilyDefault useEscalate when
staticformat, lint, type, schema, dependency hygienegenerated files or policy boundaries change
unitdeterministic local behavioralgorithms, parsers, reducers, adapters change
property-fuzzinvariants and generated inputparser, sandbox, path, protocol, serializer risk is high
contract-protocolschema/API/command/tool surfacesany wire shape, manifest, command, or SDK shape changes
fake-integrationlocal fake server or adapter flowexternal API behavior is simulated
runtime-e2ereal CLI/task/session without live provider riskloop, tool, permission, resume, subprocess flow changes
ui-interactionGUI/TUI/WebUI/browser/channel visible behaviorusers or operators observe the changed behavior
live-provideropt-in real network/model/channel pathprovider/channel behavior is part of the claim
stress-concurrencyraces, queue, leases, retries, long runsscheduler, parallel agents, workers, or locks change
distribution-releasepackage/install/Docker/OS matrixanything shipped outside source changes
semantic-evaltask quality, prompt, rubric, judgemodel behavior or output quality is the product
reviewhuman/LLM reviewsafety, policy, UX, or semantic judgment is required

Evidence kind

KindExamplesMust include
command-logshell output, CI step, cargo/npm/pytest/vitest outputcommand, exit status, environment note
test-reportJUnit, JSON, coverage, HTML reportsuite id, failing ids, artifact path or URL
protocol-transcriptfake server, MCP/ACP, WebSocket, HTTP transcriptrequest/response refs, redaction note
runtime-transcriptCLI JSONL, TUI-linked events, session staterun/session ids, event order, cleanup
surface-artifactscreenshot, video, Playwright trace, terminal snapshotviewport/device/OS, action sequence
browser-diagnosticconsole, network, DOM/a11y snapshotroute, selector or accessibility assertion
release-artifactpackage manifest, tarball list, Docker smokeversion, platform, install command
eval-artifactrubric, judge output, baseline diffdataset, model/judge, threshold
review-notehuman or LLM reviewreviewer, scope, evidence refs, decision
qcloop-runattempt and QC round refsitem value, attempt id, verifier feedback

Verdict status

StatusMeaningRequired fields
passedevidence proves all required expectationsevidence refs and scope
failedevidence disproves an expectation or a gate failedsmallest actionable failure and evidence
blockedmissing environment, credential, dependency, fixture, or binary prevents judgmentblocker and owner
exhaustedattempts or budget ended without proofattempt refs and remaining uncertainty
waivedaccountable owner accepted known gapapprover, reason, scope, expiry
needs-reviewevidence exists but judgment still needs semantic/safety reviewreviewer or review queue
skippedintentionally not applicable for this scopereason and scope

Fact owners

Agent QC should name who owns each fact instead of treating the report as the owner of everything.

OwnerOwnsQC responsibility
Runtimetask/session/tool/permission statecapture transcript and state refs
Protocol/SDKschemas, generated clients, adapterscapture contract diff and fake transcript
UI projectionvisible rendering and user controlscapture surface artifact and runtime linkage
Evidence servicedurable traces, replay, reviewslink evidence ids and export jobs
Policy/securityapprovals, waivers, credentials, retentionrecord risk decision and scope
Artifact/releasedeliverables, package contents, versionscapture manifest and install proof
Schedulerleases, checkpoints, retries, workerscapture timeline and duplicate-work proof
Eval systemrubrics, judge outputs, baselinescapture dataset, threshold, and deltas

Standard case envelope

A portable qc_case should carry these fields even when the JSON schema allows extension.

FieldRequiredPurpose
idyesstable case id
project_profileyesone profile from the taxonomy
surfacerecommended for visible casesobservation surface
targetyesfile, command, package, flow, API, or release target
risk_ownerrecommendedruntime, protocol, UI, scheduler, release, eval, policy
required_gatesyesgate families to satisfy
stepsyesreproducible commands or interactions
expectedyesbehavior-level expectations
required_evidenceyesartifacts needed for verdict
live_policyconditionalopt-in, credential scope, redaction, budget
waiver_policyconditionalowner, reason, expiry rules
verdictafter runstatus and evidence refs

Standard report envelope

A portable QC report should answer:

FieldQuestion
ScopeWhat change, release, or regression sweep is being judged?
ProfilesWhich project profiles apply?
SurfacesWhich user/operator surfaces were touched?
Required gatesWhich gates were required and why?
Executed gatesWhich commands, CI jobs, qcloop runs, or reviews ran?
Evidence refsWhere are logs, traces, screenshots, transcripts, reports, and reviews?
VerdictsWhich cases passed, failed, blocked, exhausted, waived, or need review?
Remaining riskWhat still should not be claimed?
Next actionFix, rerun, review, release, or waive?

Validation cases for the standard itself

A project can claim Agent QC compatibility only if these cases are representable:

  1. Codex-like runtime permission denial with CLI transcript, protocol event, and TUI row.
  2. Claude Code-like remote permission request with WebSocket/control transcript and TUI prompt.
  3. OpenClaw-like channel webhook replay with media fixture and redacted credential policy.
  4. Hermes-like scheduler restart with deterministic time, checkpoint, and duplicate-work proof.
  5. Desktop GUI native-bridge change with bridge health, workspace readiness, screenshot, and command-contract proof.
  6. Browser automation flow with DOM/a11y, screenshot, console/network, and cleanup evidence.
  7. Release smoke with package manifest, clean install, and platform note.
  8. Semantic eval regression with rubric, judge output, baseline delta, and reviewer note.

Draft standard for evidence-driven quality control of Agent projects.