Skip to content

Test techniques and compositions

Agent QC gate families describe why a boundary must be checked. Test techniques describe how evidence is produced. Strong Agent QC plans combine techniques instead of relying on one broad command.

Use this page when a plan says only "run tests" or "do UI smoke" and you need a richer, inspectable strategy for Agent runtime, Agent UI, skills/plugins, browser automation, channel gateways, or release packages.

Evidence braid rule

A high-confidence Agent test usually braids five strands:

text
white-box invariant -> protocol/contract -> black-box run -> surface artifact -> cleanup/review

Not every case needs all five, but every pass must state which strands are present and which claims remain unproven.

Technique taxonomy

TechniqueWhat it provesRequired evidenceWhat it does not prove alone
Static/policy checkFormatting, types, import boundaries, generated drift, forbidden APIscommand log, SARIF or lint report, tool versionruntime behavior or UX
White-box unit testReducers, parsers, serializers, permission decisions, state machinestest report, fixture ids, assertion diffspackaged app or user-visible behavior
Property/fuzz/metamorphic testInvariants over large or generated input setsseed, corpus, minimized failure, invariant textexact user flow
Golden transcriptStable CLI/runtime/protocol/event output shapetranscript file, update diff, dynamic-field normalizationvisual layout or live provider quality
Snapshot testStable rendered output or serialized objectsnapshot diff, viewport/device, update reviewcorrectness of the source runtime fact
Contract/protocol testSchema, tool declarations, SDK/API, manifest, transport behaviorschema diff, fake server transcript, generated artifact checkactual live provider behavior
Fake integrationAdapter/runtime behavior against a controlled local servicefake server log, request/response refs, fixture versionreal provider drift
Black-box smokeMinimal delivered behavior through public entrypointcommand/browser/app/channel log, exit status, screenshot when visibledeep edge cases
Runtime E2EAgent loop, tools, permissions, resume, cleanupruntime transcript, state snapshot, side-effect proofUI projection unless linked
Surface E2EUser/operator can see and control the behaviorscreenshot/trace/terminal frame, key/click/message sequenceunderlying runtime truth unless linked
Replay/regressionPast failure remains fixedreplay fixture, old bug id, expected failure modenew unknown failures
Stress/concurrency/chaosRace, lease, retry, cancellation, long-running resilienceworker timeline, seed/config, duration, cleanupsemantic answer quality
Security/adversarialPermission, prompt injection, path, SSRF, secret, policy boundariesattack fixture, denial transcript, side-effect checkhappy path usability
Semantic evalOutput quality, grounding, tool choice, policy adherencedataset, rubric, model/judge, baseline deltadeterministic code correctness
Release/install smokeShipped artifact can install and run outside the source treepackage manifest, clean install, Docker/OS log, version outputsource tree test coverage

Black-box, white-box, and gray-box

ModeAgent QC useBest targetsEvidence pattern
White-boxProve internal invariants before a user flow existsevent reducers, permission policy, tool args sanitizer, stream parser, scheduler leaseunit/property report plus fixture ids
Black-boxProve delivered behavior through public entrypointCLI command, SDK call, TUI flow, WebUI route, desktop shell, webhook, package installcommand or interaction transcript plus exit/status and artifacts
Gray-boxCombine public behavior with internal instrumentationruntime UI, browser agent, channel gateway, background schedulerblack-box run plus protocol/runtime transcript and state snapshot

Agent projects need gray-box testing more often than ordinary apps because the visible output can be plausible while the runtime state is wrong.

Snapshot standards

Snapshots are useful only when they are scoped and reviewable.

Snapshot kindUse it forMust include
Text/golden transcriptCLI output, JSONL/NDJSON stream, model event normalizationstable fixture, exit status, dynamic id redaction
Terminal snapshotTUI frame, approval overlay, footer/status row, composerterminal size, key sequence, ANSI/Unicode policy
DOM/ARIA snapshotWebUI accessibility tree, browser-mode component stateroute, viewport/device, locator or role assertion
Screenshot/videoGUI/desktop/browser/channel report surfaceaction sequence, OS/browser/device, console/network note
Protocol/schema snapshotgenerated schema, SDK wire contract, MCP/tool declarationgenerator command, diff, compatibility note
Runtime state snapshotsession/thread/turn/tool/artifact/scheduler statecorrelation ids, timestamp policy, cleanup note
Package manifest snapshottarball/image/install contentsversion, platform, file allow/deny policy

Snapshot rules:

  • Normalize timestamps, random ids, temp paths, and provider-specific text before snapshotting.
  • Review snapshot updates as product changes, not as mechanical noise.
  • Pair UI snapshots with runtime/protocol transcripts when the claim is more than visual layout.
  • Pair protocol snapshots with fake integration when the claim is more than schema shape.
  • Keep one focused snapshot per behavior; avoid giant snapshots that hide meaningful diffs.

Codex-style TUI testing shows the value of terminal snapshots for approval overlays, footer modes, picker widths, request forms, narrow terminal heights, and diff/code blocks. Hermes-style TUI testing adds terminal mechanics such as OSC52, virtual history, Unicode, streaming markdown, queue state, and session lifecycle. Claude Code-style local source inspection shows that Ink TUI, remote permission, WebSocket control, and SDK stream adapters need snapshot plus control transcript, not snapshot alone.

Smoke test ladder

Smoke tests are fast confidence checks. They do not replace runtime, contract, or surface evidence.

Smoke levelPurposeExamplesExit rule
Import/build smokeProve package imports or buildscargo test -p crate, vitest run, python -m package --helpfail fast on syntax/link/import break
Runtime smokeProve the agent loop starts with fake/local provideragent exec "hello", fake tool call, MCP list toolstranscript shows terminal status and cleanup
Surface smokeProve visible shell can open and reflect runtime stateTUI first frame, WebUI route, desktop bridge health, channel webhook replaysurface artifact plus runtime backing
Release smokeProve artifact works outside source treeclean install, Docker start, package help/versioninstall log and manifest match release
Canary/live smokeProve real provider/channel still worksopt-in provider call, live channel ping, model profile proberedacted transcript, budget, credential scope

Use smoke for broad detection and then use targeted tests for diagnosis.

Testing Agent runtime

Runtime tests should treat the agent as a state machine, not as a text generator.

Minimum runtime invariants:

Runtime areaRequired casesEvidence
Turn lifecycleaccepted, queued, running, completed, failed, cancelledevent transcript, terminal status, exit code
Stream shapepartial text, reasoning/tool events, final text, terminal markerJSONL/SSE fixture, parser report, golden transcript
Tool executiondeclaration, argument validation, progress, result, errortool id correlation, fake tool transcript, side-effect check
Permission/HITLallow, deny, edit/input, timeout, cancel, reconnectapproval request/response transcript, surface frame
Files/processescwd, sandbox, patch/write, subprocess tree, cleanupcommand log, path fixture, orphan-process proof
Resume/persistenceold session, crash/restart, checkpoint, artifact refsstate snapshot, replay transcript, cleanup note
Scheduler/parallelismlease, retry, fanout/fanin, duplicate-work preventiondeterministic clock, worker timeline, stress/chaos result
Credential/provider scopefake by default, live opt-in, redaction, budgetenv scope, redacted request/response, waiver if missing

Runtime anti-patterns:

  • asserting only final assistant text;
  • hiding provider calls inside default unit tests;
  • testing tool declaration but not invocation and failure;
  • testing success but not deny/cancel/abort/resume;
  • omitting cleanup proof for subprocesses, browsers, workers, or temp state.

Testing Agent UI

Agent UI tests must prove that visible surfaces are runtime-backed projections.

UI areaWhat to testStrong evidence
Composer/inputsubmit, queued input, steer-current, attachments, paste, slash commandskey/click sequence, runtime input id, snapshot
Statusfirst status before text, retrying, blocked, failed, doneruntime event order, UI frame, timing metric
Tool cardssafe arg summary, progress, result, error, offload refstool id correlation, screenshot/terminal snapshot, transcript
Approval/HITLpending, allow, deny, edit, timeout, cancellationaction request/response transcript, keyboard/a11y proof
Artifactscreate, diff, preview, export, failed saveartifact id/path, UI snapshot, export log
Evidence/replaytrace links, report export, old-session hydrationevidence ids, report screenshot, hydration log
Team/backgroundqueued worker, running worker, failed/retried worker, handoffdelegation graph, task card snapshot, worker transcript
Empty/stale statesmissing facts, bridge unavailable, reconnecting, blockedsafe fallback frame, console/network log, runtime state ref

Surface-specific upgrades:

  • TUI: multi-viewport, ANSI/Unicode width, Ctrl-C vs Esc semantics, resize, clipboard/OSC52 if supported.
  • WebUI: browser trace, DOM/ARIA snapshot, console/network, reload/resume, keyboard/a11y.
  • Desktop GUI: app shell start, bridge health, workspace readiness, native command contract, OS note.
  • Browser automation: screenshot plus DOM/a11y, console/network, unsafe navigation/SSRF fixtures, orphan cleanup.
  • Channel/mobile: webhook replay, media fixture, auth proof, redacted transcript, device/emulator logs.

Testing skills and plugins

Agent Skills-style systems need their own lifecycle tests. The standard lesson is progressive disclosure: a skill is a small package with metadata, instructions, optional scripts/assets, and evaluation evidence. Testing should follow that shape.

Skill/plugin phaseTestsEvidence
Manifest/frontmatterrequired fields, name/description, when-to-use, paths/hooks if supportedschema report, parse failure fixtures
Discovery/loadinguser/project/bundled precedence, symlink canonicalization, duplicate names, disabled settingsloader transcript, fixture directory tree
Context budgetfrontmatter-only routing, lazy loading, token/size limitstoken estimate, selected skill list, rejection evidence
Scripts/assetsscript existence, executable bit, relative path resolution, clean temp dir, no raw secretsdry-run log, sandbox/env scope, asset manifest
Trust boundarylocal vs managed vs remote/MCP skill policy, path traversal, hook restrictionspolicy test, denial transcript, audit note
Runtime effectskill changes allowed tools/prompts only through owning APIruntime event, tool declaration diff, UI status
Evaluationclean-context task, assertion grading, transcript, human feedback loopeval rubric, attempt transcripts, verifier output
Packaging/releasepackage contents, install fixture, marketplace/registry metadatamanifest snapshot, install smoke, version check

Claude Code local source exposes useful loader concerns: SKILL.md directory format, frontmatter parsing, hooks validation, path frontmatter, symlink canonicalization, token estimation, duplicate detection, and remote MCP skills as untrusted. Agent QC generalizes those as skill/plugin gates; it does not require Claude Code's exact implementation.

Advanced composition recipes

Runtime + UI evidence braid

Use when a runtime fact is visible in TUI/WebUI/desktop GUI.

text
contract-protocol
  -> fake runtime transcript
  -> black-box user action
  -> surface snapshot/trace
  -> state snapshot + cleanup

Example claims: approval overlay, tool card progress, bridge health, queued worker state.

TUI approval braid

text
white-box permission resolver
  -> protocol action_request fixture
  -> pseudo-terminal key sequence
  -> terminal snapshots for pending/allow/deny/cancel
  -> side-effect denial check
  -> subprocess cleanup

Add multi-viewport, Unicode/ANSI, Ctrl-C/Esc, and reconnect variants when the TUI is core product surface.

Provider adapter ladder

text
normalizer unit tests
  -> contract/schema snapshot
  -> fake provider replay
  -> runtime E2E with fake provider
  -> opt-in live canary
  -> semantic eval and reviewer note

Use this for LLM providers, browser providers, search providers, channel providers, or gateway backends.

Browser agent safety braid

text
URL/path policy unit tests
  -> SSRF/file/credential attack fixtures
  -> Playwright/browser trace with DOM+a11y snapshot
  -> console/network log inspection
  -> orphan browser/tab cleanup proof

A screenshot-only pass is insufficient for browser automation.

Channel gateway braid

text
auth verifier unit test
  -> webhook replay before body parsing
  -> media fixture and redaction check
  -> fake channel send transcript
  -> optional live channel canary
  -> report redaction review

Use separate gates for channel contract, media handling, live transport, and semantic model quality.

Scheduler/recovery braid

text
deterministic clock unit test
  -> lease/checkpoint fake integration
  -> crash/restart replay
  -> concurrency stress or chaos kill
  -> duplicate-work oracle
  -> cleanup and ownership report

This is mandatory for background agents, multi-agent workers, and long-running jobs.

Skill/plugin lifecycle braid

text
manifest schema
  -> discovery/precedence fixture
  -> script/asset dry run in clean temp dir
  -> trust boundary denial tests
  -> clean-context skill eval
  -> package/install smoke

Use assertion grading and transcripts for skill quality, not only a lint pass.

Release confidence braid

text
source tests
  -> generated/lock drift check
  -> package manifest snapshot
  -> clean install smoke
  -> first-run runtime smoke
  -> OS/Docker matrix
  -> live canary if advertised

A release claim is about the artifact, not only the repository.

Technique selection matrix

ClaimMinimum techniquesStronger composition
Runtime command worksblack-box command smoke, exit statuscontract, fake provider, stream golden, cleanup
Permission boundary workswhite-box policy, runtime denial transcriptTUI/WebUI approval surface, side-effect oracle, reconnect/cancel
TUI is correctterminal snapshotruntime transcript, multi-viewport, Unicode/ANSI, interrupt
WebUI is correctcomponent/browser assertionPlaywright trace, DOM/ARIA, console/network, reload/resume
Desktop GUI is usableshell start smokebridge health, workspace readiness, native contract, screenshot/trace
Browser agent is safescreenshot + DOMSSRF/navigation fixture, console/network, cleanup/orphan proof
Channel gateway workscontract fixturewebhook replay, media fixture, auth proof, live opt-in canary
Skill/plugin worksmanifest parseloader precedence, script dry run, trust boundary, clean-context eval
Scheduler is reliabledeterministic unitrestart/reclaim, stress, chaos kill, duplicate-work proof
Model quality improvedeval rubricbaseline delta, judge output, failure examples, human review
Package is releasablebuild outputmanifest snapshot, clean install, Docker/OS smoke, supply-chain check

QC case fields for techniques

Add these fields to the case body or report extension when the project needs richer composition:

json
{
  "techniques": ["white-box-unit", "contract-protocol", "black-box-smoke", "surface-snapshot", "cleanup-proof"],
  "box_mode": "gray-box",
  "snapshot_policy": "normalize dynamic ids; update only after reviewer approval",
  "smoke_level": "runtime|surface|release|live-canary",
  "runtime_backing": "fake-provider|real-runtime|live-provider|mock-bridge",
  "negative_cases": ["deny", "cancel", "malformed-stream", "restart"],
  "composition_rationale": "why this braid proves the claim"
}

These fields are intentionally advisory. Agent QC standardizes the evidence and verdict semantics; projects decide how to encode technique metadata in their local schema.

Anti-patterns

Anti-patternCorrect replacement
One broad test command as proof for every profileprofile-specific gates plus explicit evidence refs
Snapshot update with no review notesnapshot diff review and behavior rationale
Smoke test marketed as full E2Elabel as smoke and list remaining risks
White-box unit test used as UI proofadd surface artifact and runtime link
Black-box final text used as runtime proofadd structured event transcript and state snapshot
Live provider call hidden in unit testsexplicit live lane, budget, redaction, opt-in flag
Browser screenshot without DOM/console/network/cleanupbrowser evidence bundle
Skill manifest lint onlyloader, script, trust, clean-context eval, package smoke

Draft standard for evidence-driven quality control of Agent projects.