Appearance
Quickstart
Use this flow when an agent, maintainer, CI job, or qcloop run needs to test an Agent project. The flow is portable across runtime, UI, gateway, scheduler, skill, eval, and release product shapes.
1. Define the QC scope
Write one sentence that names what is being judged.
Examples:
- "Release v1.4.0 of a Codex-like runtime CLI."
- "A Claude Code-like TUI remote-permission change."
- "An OpenClaw-like channel gateway auth/media change."
- "A Hermes-like scheduler restart fix."
- "A desktop GUI native-bridge change."
The scope decides which claims the report may make. Do not let a broad report imply untested surfaces.
2. Classify the project profile
Pick one or more profiles:
agent-runtime-cliagent-sdk-apiagent-tool-mcp-gatewaymulti-channel-agent-gatewayagent-ui-tui-desktopagent-skills-pluginsbackground-agent-scheduleragent-distribution-releaseagent-evals-quality
Classify by owned risk, not language. A Python project can own browser automation; a Rust project can own a TUI; a docs site can own a distribution/release surface.
3. Identify touched surfaces
Name where a user or operator observes the behavior:
cli-streamtuiwebuidesktop-guibrowser-automationchannel-uieval-ui
If behavior is visible, include qc_case.surface. A UI pass without surface evidence is incomplete.
4. Assign fact owners
For each case, decide which system owns the fact:
| Fact | Owner example | Evidence |
|---|---|---|
| runtime accepted work | agent runtime | stream or session transcript |
| tool call succeeded/failed | tool runtime or protocol adapter | tool id, progress, result/error |
| approval was resolved | policy/runtime action API | request id and response transcript |
| UI rendered a state | UI projection | screenshot, trace, terminal snapshot |
| artifact was created | artifact/release service | manifest, file, version, export ref |
| verdict passed | Agent QC report | linked evidence refs |
This prevents the common Agent UI failure where visible text is treated as runtime truth.
5. Select gate lanes
Use the gate matrix. At minimum, separate these lanes:
- deterministic local gates:
static,unit,contract-protocol,fake-integration; - runtime gates:
runtime-e2e,stress-concurrencywhen needed; - surface gates:
ui-interactionwith a named surface; - live gates:
live-provideronly with explicit opt-in; - release gates:
distribution-releasewhen anything is shipped; - semantic gates:
semantic-evalandreviewwhen quality judgment matters.
6. Write behavior-first cases
Each qc_case should state:
- behavior to prove;
- profile and surface;
- exact steps or commands;
- expected result;
- required gates;
- required evidence;
- status mapping for fail, blocked, exhausted, waived, and needs-review.
Avoid cases like "component exists". Prefer cases like "user denies a tool call; runtime records denial; TUI removes pending approval; no side effect occurs".
7. Define evidence policy
Before running, decide what counts as proof.
| Claim | Evidence |
|---|---|
| CLI stream works | command, exit status, stdout/stderr transcript, structured event sample |
| TUI works | viewport, key sequence, terminal snapshot, runtime transcript |
| WebUI works | Playwright/browser trace, screenshot, console/network log, route assertion |
| Desktop GUI works | shell start, bridge health, workspace readiness, screenshot, OS note |
| Browser automation works | DOM/a11y snapshot, screenshot, console/network, cleanup proof |
| Channel works | webhook replay, media fixture, auth proof, redacted channel transcript |
| Scheduler works | deterministic time/env, checkpoint, lease/reclaim, duplicate-work proof |
| Release works | package manifest, clean install, Docker/OS matrix, version output |
| Eval works | rubric, judge output, baseline delta, reviewer note |
See Evidence contract.
8. Use qcloop for repeatable independent checks
Use qcloop for many similar cases: files, channels, providers, commands, prompts, packages, or regression examples.
Do not use qcloop to hide missing project gates. A qcloop batch may produce verdicts, but bridge health, GUI smoke, package install smoke, and live-provider opt-in still need their own evidence.
9. Run gates from cheap to risky
A practical order is:
staticand schema checks;- targeted unit or contract tests;
- fake integration and runtime e2e;
- surface smoke or Playwright/TUI evidence;
- stress/concurrency where relevant;
- live provider or channel tests only with opt-in;
- release/install/Docker checks;
- semantic eval and review.
Stop early only when the failure makes downstream evidence meaningless. Otherwise record partial evidence and mark remaining gates accurately.
10. Report verdicts and limits
A complete report includes:
- scope;
- profiles and surfaces;
- gates required and gates executed;
- evidence refs;
- verdicts by case;
- blockers, exhausted attempts, waivers, and needs-review items;
- remaining risk;
- next action.
A report is not complete when it says only "tests passed" or "the agent checked it".
Minimal plan skeleton
json
{
"schema_version": "0.4.0",
"project": "example-agent",
"project_profiles": ["agent-runtime-cli", "agent-ui-tui-desktop"],
"required_gates": ["static", "contract-protocol", "runtime-e2e", "ui-interaction"],
"cases": [
{
"id": "permission-denial-tui",
"project_profile": "agent-ui-tui-desktop",
"surface": "tui",
"target": "remote permission prompt",
"required_gates": ["contract-protocol", "runtime-e2e", "ui-interaction"],
"steps": ["start fake runtime", "trigger high-risk tool", "deny approval"],
"expected": ["runtime records denial", "TUI removes pending prompt", "no side effect occurs"],
"required_evidence": ["protocol transcript", "terminal snapshot", "runtime transcript"]
}
]
}1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18