Skip to content

Star project testing systems

This reference explains how several strong Agent projects organize testing. Agent QC does not copy their commands as a universal recipe. It extracts reusable test architecture: how each project separates deterministic tests from live provider risk, how UI/TUI/WebUI evidence is captured, and how runtime/protocol facts are connected to visible surfaces.

How to read this page

  • Treat each local repository as a case study, not as a normative dependency.
  • Copy the testing shape, not the exact stack.
  • Keep limitations explicit. The Claude Code local snapshot has useful interface code but no local package.json or workflow metadata, so this page does not claim upstream CI behavior for that snapshot.
  • When a project has UI, require both surface proof and runtime proof.

Agent UI and Agent Skills lessons applied here

This page treats Agent UI as a primary reference for surface testing. The reusable lessons are:

  • UI/TUI/WebUI/desktop states must be runtime-backed projections, not independent truth.
  • Final answer text must stay separate from reasoning, tool progress, approvals, artifacts, evidence, diagnostics, and team events.
  • Missing runtime facts must render as unknown, unavailable, stale, or blocked, not guessed success.
  • Controlled writes such as approval, interrupt, queue, steer, artifact edit, evidence export, review, or replay must go through the owning API.
  • Old sessions and long-running work need progressive hydration and surface-specific evidence.
  • Metrics such as first status, first text, bridge readiness, queue wait, trace size, and cleanup time are part of QC evidence.

Agent Skills contributes the authoring style: short entrypoints, frontmatter, field tables, minimal examples, progressive disclosure, eval loops, assertion grading, and transcripts. Agent QC uses that style for quality plans rather than skill packages.

Framework documentation lessons

Official framework docs are used as examples of evidence shape, not as mandatory tool choices:

FrameworkReusable QC lesson
PlaywrightProjects/devices, webServer, retries, reporters, trace/screenshot/video policies, and test isolation are portable browser-evidence concepts.
Vitestrun, projects/workspaces, JSON/JUnit reporters, coverage, snapshots, and browser mode map JS projects into deterministic and browser lanes.
pytestmarkers and -m selection, skip/xfail, parametrization, xdist, and JUnit-style reports help Python projects separate deterministic, integration, e2e, and live suites.
cargo nextest/Bazelfast Rust workspace runs, no-fail-fast behavior, release binary builds, and generated schema checks show how runtime projects layer local and release evidence.

Cross-project surface map

ProjectRuntime CLI / streamTUIWebUIDesktop GUIBrowser automationChannel/mobileEval/report UIRelease/distribution
Codexstrong: codex exec, JSON/event processors, SSE fixturesstrong: ratatui/insta snapshotsindirect: app-server/client protocol surfacesno desktop shell in inspected repolimited through app/server tooling, not primarynoreview/protocol artifactsstrong: Bazel, release binaries, npm packages
Claude Code local snapshotvisible SDK stream adapters and commandsvisible Ink surface and command viewsnot enough metadata to claimnoremote bridge/control surfacesnononot enough metadata to claim
OpenClawstrong gateway/CLI/router testsdedicated tui command and TUI lanesstrong control UI and QA Lab web runtimeplatform release paths, mac/mobile scriptsQA Lab browser runtime, Docker/browser lanesstrong channel contracts, QR, Android/iOS, live transportsstrong QA Lab scenarios/reportsstrong Docker/install/release checks
Hermes Agentstrong Python CLI/gateway testsstrong ui-tui Vitest packageVite/React dashboard packageno native shell in inspected repostrong browser supervisor/CDP/Camofox/SSRF testsstrong gateway/channel testsrelease notes and web/dashboard surfacesDocker, uv lock, OSV, package checks

Agent QC conclusion: a project can be "well tested" in one surface and still under-tested in another. Do not collapse all UI proof into one boolean.

Shared test architecture pattern

Across the projects, the useful pattern is:

  1. Local deterministic lane: format, lint, typecheck, unit, contract, fake integration.
  2. Runtime lane: real CLI/task/session flow with fake provider or local server.
  3. Surface lane: TUI/WebUI/GUI/browser/channel evidence with screenshots, snapshots, traces, or transcripts.
  4. Live lane: opt-in real provider/channel/model tests with redaction and budget.
  5. Distribution lane: install, Docker/package, cross-platform, release manifest, lock/supply-chain checks.
  6. Review/eval lane: semantic quality, rubric, baseline diff, human or LLM review.

Agent QC plans should name which lanes apply and which lanes are intentionally out of scope.

Codex: runtime CLI plus TUI plus protocol stack

Local source: /Users/coso/Documents/dev/rust/codex.

Product shape

Codex combines several Agent product shapes:

  • Rust runtime CLI and task loop.
  • codex exec and structured stream outputs.
  • TUI implemented with ratatui and snapshot tests.
  • MCP/tool gateway, app server, app-server protocol, SDKs, release packaging, and sandbox layers.
  • Cross-platform sandbox and process execution policies.

How tests are organized

LayerConcrete signalsAgent QC interpretation
Repository policyroot AGENTS.md tells contributors to run targeted crate tests first, then full cargo test/just test when common/core/protocol changedtargeted verification before broad sweeps
Local Rust lanejust test runs cargo nextest run --no-fail-fast; cargo test -p <crate> for focused workdeterministic unit and runtime-e2e evidence
Bazel lanebazel test //... --keep_going, Bazel clippy, module lock checks, release binary buildscross-toolchain parity and release confidence
Supply/policy lanecargo-deny, codespell, clippy, argument-comment lint, lock checksstatic and distribution-release hygiene
Sandbox/process laneexec_policy_tests, Windows sandbox tests, sandbox tag tests, Landlock/bwrap/seatbelt-related testspermission boundary and platform-specific runtime gates
Tool/protocol laneMCP fixtures, app-server v2 protocol tests, schema fixture regeneration, dynamic tools, request permission testscontract-protocol and fake integration
Stream laneSSE end-to-end tests, fake response helpers, stream event utilities, JSON/event processor testsstream shape evidence before semantic claims
TUI laneratatui/insta snapshots across chat widget, bottom pane, approval overlay, footer, request-user-input, MCP elicitationTUI ui-interaction evidence
SDK/API laneTypeScript SDK event/thread APIs and app-server client surfacesagent-sdk-api contract evidence
Release laneBazel release binaries, npm/native package build scripts, Windows/zsh release workflowsdistribution-release gate

TUI details worth standardizing

Codex demonstrates that TUI testing needs more than a screenshot:

  • terminal-width and height variants: narrow, standard, and large terminals;
  • approval overlays for exec, patch, network, cross-thread, and additional permissions;
  • footer states: idle, running, Ctrl-C quit, Ctrl-C interrupt, Esc hint, queue hint, mode indicator, context/token status;
  • request-user-input forms: options, freeform, multi-question, tight height, hidden options, long option text;
  • model/session pickers: model migration prompt, fixed/auto column widths, narrow rows, scroll states;
  • composer edge cases: paste, backspace after paste, slash popup, mention popup, plugin popup, remote image rows, shell-command mode;
  • history/chat frames: diff syntax, code blocks, completed hook output, pending input, stream deltas, compact/resume/fork shapes;
  • MCP and app-server states: MCP startup failures, elicitation forms, app-server collaboration and guardian review states;
  • platform-specific snapshots such as Windows approval popup variants.

Agent QC rule: a TUI pass should cite terminal snapshots and runtime transcripts. Snapshot-only proof shows rendering; transcript-linked proof shows the rendering came from the correct Agent event.

Runtime details worth standardizing

Codex separates deterministic runtime tests from live/provider risk:

  • fake model server and SSE fixtures test stream shape without burning provider budget;
  • app-server protocol tests assert wire shape independently from the TUI;
  • apply-patch tests cover CLI and tool surfaces;
  • exec/unified process tests preserve command output, cleanup, and failure semantics;
  • sandbox tests assert denied actions and platform policy transforms;
  • schema fixture writers make protocol drift reviewable.

Agent QC rule: CLI/runtime projects need contract-protocol and runtime-e2e gates before semantic-eval can be trusted.

What to copy into Agent QC plans

For a Codex-like project, include cases such as:

  • denied unsafe command produces a visible controlled error and non-success runtime event;
  • apply-patch success/failure has stable CLI transcript and patch result;
  • MCP tool declaration round-trips through config, server fixture, runtime event, and TUI row;
  • Ctrl-C interrupts a running turn without leaving orphan subprocesses;
  • app-server protocol schema diff is reviewed when command shape changes;
  • release package contains expected native binaries and platform helpers.

Claude Code local snapshot: TUI/runtime surface under incomplete repo metadata

Local source: /Users/coso/Documents/dev/js/claudecode.

Source limitation

The inspected local snapshot contains source files under src/ and vendor/, but no local package.json, lockfile, or GitHub workflow metadata. Agent QC must therefore avoid claiming upstream CI, test commands, package coverage, or release guarantees from this snapshot. The useful signal is interface-surface shape.

Product surfaces visible in the snapshot

SurfaceLocal indicatorsAgent QC gate
Ink TUIsrc/ink.ts, .tsx command views, terminal focus/input/selection hooks, task viewsTUI ui-interaction
Command palettemany src/commands/** handlers and renderable command viewscommand routing and TUI state snapshots
Remote session bridgesrc/remote/RemoteSessionManager.ts, src/remote/SessionsWebSocket.ts, server direct connect managercontract-protocol, runtime-e2e
Permission flowremotePermissionBridge.ts, control schemas with can_use_tool, synthetic assistant/tool confirmation flowhigh-risk TUI + protocol evidence
SDK streamsrc/remote/sdkMessageAdapter.ts, src/entrypoints/agentSdkTypes.ts, stream/control schemasagent-sdk-api stream contract
Skills/pluginsSDK schemas include skills/plugins; output style/plugin loading code is visibleagent-skills-plugins

What the standard should require

A Claude Code-style TUI runtime should prove:

  • success, empty, error, cancelled, reconnecting, disconnected, and remote states render distinctly;
  • command views route to the same state transitions as slash commands or command palette entries;
  • permission prompts show tool name, request id, proposed input, permission suggestions, and deny/allow outcome;
  • remote permission responses preserve request correlation and behavior (allow, deny, or project-specific modes);
  • server-side cancellation removes or marks the pending prompt instead of leaving stale approvals visible;
  • reconnect/interrupt cannot leave the TUI showing stale success;
  • SDK stream adapters preserve event type, session id, tool-use id, and partial/final message semantics;
  • tool results from a remote server render as tool results, not as prompt echoes;
  • plugin/skill reload events cannot silently change allowed tools without a visible status or audit event.

Evidence recipe

A minimal QC case should collect:

  1. pseudo-terminal transcript of a remote permission request;
  2. TUI snapshot showing the synthetic confirmation row;
  3. WebSocket/control transcript for can_use_tool request and response;
  4. SDK stream fixture proving event conversion;
  5. negative case for cancellation or disconnect.

Agent QC rule: when repo metadata is incomplete, write the limitation into evidence_policy and require interface-level evidence instead of inventing a CI story.

OpenClaw: multi-channel gateway plus WebUI plus QA Lab

Local source: /Users/coso/Documents/dev/js/openclaw.

Product shape

OpenClaw is a dense Agent system:

  • multi-channel gateway for provider/channel integrations;
  • plugin ecosystem and plugin SDK;
  • CLI, gateway, TUI command, control WebUI, Android/iOS/macOS platform paths;
  • QA Lab extension with web runtime, browser runtime, scenario runner, live transport tests, and reports;
  • Docker/install/release smoke paths;
  • live provider lanes for models, gateways, and CLI backends.

How tests are organized

OpenClaw's package.json exposes many lanes. The important Agent QC pattern is the separation, not the number of commands.

LayerConcrete signalsAgent QC interpretation
Test routernode scripts/test-projects.mjs, test:changed, test:max, serial/max-worker variantschanged-scope and profile-aware gate selection
Static/policycheck, lint, import-cycle checks, LOC checks, host env policy, webhook/auth boundary lintsstatic plus security policy
Unit/gateway lanestest:unit, test:gateway, gateway client/server/method configsdeterministic runtime and gateway behavior
Contract lanestest:contracts:channels, test:contracts:plugins, plugin SDK export/API checks, protocol generation checkscontract-protocol for channel/plugin/runtime boundaries
WebUI lanetest:ui, ui package tests, browser-playwright-style UI config testswebui under ui-interaction
TUI/platform lanestui, TUI scripts, test:windows:ci, test:macos:ci, Android/iOS unit/integration scriptssurface-specific UI/platform proof
QA Lab laneextensions/qa-lab scenario catalog, web runtime, browser runtime, reports, live transports, suite summary JSONagent-evals-quality, eval-ui, webui, browser-automation
Channel laneschannel configs for Telegram/Matrix/Discord/Feishu/Zalo/etc., webhook/media/auth testsmulti-channel-agent-gateway
Live provider lanestest:live:*, live model profiles, live gateway Docker lanes, live CLI backend lanes for Claude/Codex/Gemini-style backendsexplicit live-provider with opt-in
Docker/install lanesinstall smoke, OpenWebUI Docker, MCP channels Docker, QR import, plugins Docker, gateway network Dockerdistribution-release and runtime smoke
Release lanesrelease:check, npm checks, plugin release checks, version syncrelease readiness and package boundary proof
Performance lanesstartup bench, import duration, perf budget, memory checksperformance risk gates

WebUI details worth standardizing

OpenClaw shows that WebUI proof should be layered:

  • component/state tests for navigation, chat normalization, settings, controller panels, usage panels, tool cards, and config surfaces;
  • browser-only tests for focus, markdown, sidebar status, external links, image opening, and browser APIs;
  • QA Lab web runtime tests for scenario execution and report rendering;
  • Docker-hosted OpenWebUI smoke to prove integration in a clean environment;
  • console/network evidence whenever browser behavior is under test.

Agent QC rule: when behavior depends on DOM, focus, browser APIs, markdown sanitization, navigation, or report rendering, webui evidence must include browser-level artifacts, not just jsdom/component tests.

Channel/provider details worth standardizing

OpenClaw makes four separations that Agent QC should require:

  • channel contract tests are not live channel tests;
  • fake provider integration is not live provider coverage;
  • media/webhook/auth replay is separate from model semantic quality;
  • plugin boundary tests are separate from runtime gateway tests.

Examples of useful case shapes:

  • secret refs are redacted and inactive channel credentials cannot be used;
  • QR import creates a scoped session and can be replayed in Docker smoke;
  • webhook body verification happens before parsing user content;
  • media attachments preserve type/size limits and redaction;
  • live transport credentials are leased, timed out, and redacted in reports;
  • control WebUI shows actual gateway status, not cached healthy state.

Agent QC rule: multi-channel-agent-gateway projects should never hide live-provider assumptions inside ordinary unit tests.

Hermes Agent: Python agent plus TUI plus browser/web tools plus scheduler

Local source: /Users/coso/Documents/dev/python/hermes-agent.

Product shape

Hermes combines:

  • Python Agent runtime, CLI, toolsets, gateway, and ACP/MCP adapters;
  • pytest-based backend tests;
  • browser, web provider, CDP, Camofox, Browserbase-style provider, and SSRF hardening tests;
  • cron/background scheduler, checkpointing, approval, restart/retry, and concurrency surfaces;
  • ui-tui Ink/React TUI package with Vitest tests;
  • web Vite/React dashboard package;
  • Docker image, uv lock, OSV/security, and release checks.

How tests are organized

LayerConcrete signalsAgent QC interpretation
Canonical runnerscripts/run_tests.sh pins -n 4, TZ=UTC, LANG=C.UTF-8, PYTHONHASHSEED=0, activates venv, blanks credential env vars, excludes integration/e2e by defaultreproducible local evidence and credential hygiene
Pytest backendtests/ with gateway, cron, CLI, ACP, browser/tool, security, restart, retry, queue, platform testsdeterministic unit, fake-integration, runtime-e2e
Tool safetywrite deny, file guards, symlink confusion, URL safety, yolo/approval modes, env passthroughpermission/sandbox gates
Browser/webbrowser supervisor, browser hardening, CDP, local SSRF, Camofox state, web providersbrowser-automation gates
Gateway/channelDiscord, Feishu, Matrix, Mattermost, Google Chat, QQBot, delivery, media, reconnect, dedup, pairing, roles/DM scopechannel-ui and gateway contracts
MCP/OAuth/ACPMCP e2e, OAuth metadata, SSE transport, reconnect, circuit breaker, tool 401 handling, ACP approval isolationcontract-protocol and recovery
Schedulercron jobs, cron prompt injection, inactivity timeout, workdir, scheduler MCP init, checkpoint/session cleanupbackground-agent-scheduler
TUIui-tui Vitest: terminal parity, viewport, virtual history, slash parity, streaming markdown, OSC52, clipboard, terminal modesTUI ui-interaction
Web dashboardweb package uses Vite/React build and lint scriptswebui when dashboard behavior changes
DistributionDockerfile builds browser dashboard/TUI assets; uv lock; OSV/security notesdistribution-release and supply chain

TUI/terminal details worth standardizing

Hermes TUI tests cover practical terminal mechanics:

  • text wrapping, virtual history heights, scroll, viewport stores, precision wheel;
  • terminal modes, truecolor, OSC52 clipboard, emoji, math Unicode, syntax/markdown;
  • slash command parity, gateway events, session lifecycle, queue handling, turn store, state isolation;
  • streaming markdown, reasoning/details rendering, subagent tree, status ticker;
  • text input navigation, pass-through, wrapping, completion, composer state.

Agent QC rule: TUI testing should include terminal input/output mechanics, not only component snapshots.

Browser and web details worth standardizing

Hermes browser/tool tests map directly to Agent QC:

  • browser supervisor health and orphan reaper;
  • browser hardening and local SSRF protections;
  • CDP override, browser console, and local provider behavior;
  • Camofox persistence/state isolation;
  • Brave/DDGS/SearXNG/Tavily-like web provider contracts;
  • CLI browser connect and gateway browser-related command tests.

Agent QC rule: browser automation gates must include safety and cleanup evidence, not only screenshots.

Scheduler/channel details worth standardizing

Hermes shows why background agents need their own gate family:

  • cron prompt injection must be scanned after skills/context are assembled, not only at user input;
  • scheduler restart must not duplicate work or lose checkpoints;
  • inactivity timeout should track real tool activity, not wall-clock time alone;
  • gateway restart/retry/dedup tests should preserve message ids and delivery state;
  • credential-shaped environment variables must be blanked or scoped in tests.

Agent QC rule: a background scheduler pass should include deterministic clock/env settings, checkpoint evidence, and cleanup evidence.

Cross-project extraction

Agent QC generalizes these projects into ten reusable rules:

  1. Start from owned risk, not from language or framework.
  2. Split UI surface proof from runtime/protocol proof, then connect them with evidence refs.
  3. Keep fake integration, live provider, and release smoke as separate gates.
  4. For TUI/WebUI/GUI, preserve surface artifacts: snapshots, traces, screenshots, console logs, terminal transcripts.
  5. For browser automation, require DOM/a11y plus console/network plus cleanup evidence.
  6. For channel/mobile, separate webhook/media/auth replay from live provider tests.
  7. For background agents, pin deterministic time/env/worker settings and preserve checkpoint evidence.
  8. For SDK/protocol surfaces, use generated schema diffs and fake servers before live runs.
  9. For release claims, test package contents and installation paths, not just source tests.
  10. For incomplete local snapshots, record what was inspected and what cannot be inferred.
RiskMinimum caseStronger case
Permission promptsnapshot/frame shows prompt, transcript shows request idallow/deny/cancel/reconnect variants with protocol transcript
Tool streamfake provider stream parses and rendersmalformed stream, tool error, partial/final event, retry, abort
TUI renderingone stable snapshotmulti-viewport, key sequence, Unicode/ANSI, runtime-linked transcript
WebUI controlcomponent state testbrowser trace, console/network, keyboard/a11y, reload/resume
Desktop bridgeshell startsbridge health, workspace readiness, native command contract, screenshot
Browser controlscreenshotDOM/a11y, console/network, cleanup, SSRF/navigation safety
Channel adaptercontract fixturewebhook replay, media fixture, redacted transcript, live opt-in lane
Schedulerdeterministic unitrestart/reclaim, concurrency, checkpoint, duplicate-work prevention
Eval reportrubric existsbaseline delta, judge output, failing examples, reviewer note
Releasebuild succeedspackage manifest, install smoke, Docker/platform matrix, lock/security check

Draft standard for evidence-driven quality control of Agent projects.