Star project testing systems

This reference explains how several strong Agent projects organize testing. Agent QC does not copy their commands as a universal recipe. It extracts reusable test architecture: how each project separates deterministic tests from live provider risk, how UI/TUI/WebUI evidence is captured, and how runtime/protocol facts are connected to visible surfaces.

How to read this page

Treat each local repository as a case study, not as a normative dependency.
Copy the testing shape, not the exact stack.
Keep limitations explicit. The Claude Code local snapshot has useful interface code but no local package.json or workflow metadata, so this page does not claim upstream CI behavior for that snapshot.
When a project has UI, require both surface proof and runtime proof.

Agent UI and Agent Skills lessons applied here

This page treats Agent UI as a primary reference for surface testing. The reusable lessons are:

UI/TUI/WebUI/desktop states must be runtime-backed projections, not independent truth.
Final answer text must stay separate from reasoning, tool progress, approvals, artifacts, evidence, diagnostics, and team events.
Missing runtime facts must render as unknown, unavailable, stale, or blocked, not guessed success.
Controlled writes such as approval, interrupt, queue, steer, artifact edit, evidence export, review, or replay must go through the owning API.
Old sessions and long-running work need progressive hydration and surface-specific evidence.
Metrics such as first status, first text, bridge readiness, queue wait, trace size, and cleanup time are part of QC evidence.

Agent Skills contributes the authoring style: short entrypoints, frontmatter, field tables, minimal examples, progressive disclosure, eval loops, assertion grading, and transcripts. Agent QC uses that style for quality plans rather than skill packages.

Framework documentation lessons

Official framework docs are used as examples of evidence shape, not as mandatory tool choices:

Framework	Reusable QC lesson
Playwright	Projects/devices, `webServer`, retries, reporters, trace/screenshot/video policies, and test isolation are portable browser-evidence concepts.
Vitest	`run`, projects/workspaces, JSON/JUnit reporters, coverage, snapshots, and browser mode map JS projects into deterministic and browser lanes.
pytest	markers and `-m` selection, skip/xfail, parametrization, xdist, and JUnit-style reports help Python projects separate deterministic, integration, e2e, and live suites.
cargo nextest/Bazel	fast Rust workspace runs, no-fail-fast behavior, release binary builds, and generated schema checks show how runtime projects layer local and release evidence.

Cross-project surface map

Project	Runtime CLI / stream	TUI	WebUI	Desktop GUI	Browser automation	Channel/mobile	Eval/report UI	Release/distribution
Codex	strong: `codex exec`, JSON/event processors, SSE fixtures	strong: ratatui/insta snapshots	indirect: app-server/client protocol surfaces	no desktop shell in inspected repo	limited through app/server tooling, not primary	no	review/protocol artifacts	strong: Bazel, release binaries, npm packages
Claude Code local snapshot	visible SDK stream adapters and commands	visible Ink surface and command views	not enough metadata to claim	no	remote bridge/control surfaces	no	no	not enough metadata to claim
OpenClaw	strong gateway/CLI/router tests	dedicated `tui` command and TUI lanes	strong control UI and QA Lab web runtime	platform release paths, mac/mobile scripts	QA Lab browser runtime, Docker/browser lanes	strong channel contracts, QR, Android/iOS, live transports	strong QA Lab scenarios/reports	strong Docker/install/release checks
Hermes Agent	strong Python CLI/gateway tests	strong `ui-tui` Vitest package	Vite/React dashboard package	no native shell in inspected repo	strong browser supervisor/CDP/Camofox/SSRF tests	strong gateway/channel tests	release notes and web/dashboard surfaces	Docker, uv lock, OSV, package checks

Agent QC conclusion: a project can be "well tested" in one surface and still under-tested in another. Do not collapse all UI proof into one boolean.

Shared test architecture pattern

Across the projects, the useful pattern is:

Local deterministic lane: format, lint, typecheck, unit, contract, fake integration.
Runtime lane: real CLI/task/session flow with fake provider or local server.
Surface lane: TUI/WebUI/GUI/browser/channel evidence with screenshots, snapshots, traces, or transcripts.
Live lane: opt-in real provider/channel/model tests with redaction and budget.
Distribution lane: install, Docker/package, cross-platform, release manifest, lock/supply-chain checks.
Review/eval lane: semantic quality, rubric, baseline diff, human or LLM review.

Agent QC plans should name which lanes apply and which lanes are intentionally out of scope.

Codex: runtime CLI plus TUI plus protocol stack

Local source: /Users/coso/Documents/dev/rust/codex.

Product shape

Codex combines several Agent product shapes:

Rust runtime CLI and task loop.
codex exec and structured stream outputs.
TUI implemented with ratatui and snapshot tests.
MCP/tool gateway, app server, app-server protocol, SDKs, release packaging, and sandbox layers.
Cross-platform sandbox and process execution policies.

How tests are organized

Layer	Concrete signals	Agent QC interpretation
Repository policy	root `AGENTS.md` tells contributors to run targeted crate tests first, then full `cargo test`/`just test` when common/core/protocol changed	targeted verification before broad sweeps
Local Rust lane	`just test` runs `cargo nextest run --no-fail-fast`; `cargo test -p <crate>` for focused work	deterministic `unit` and `runtime-e2e` evidence
Bazel lane	`bazel test //... --keep_going`, Bazel clippy, module lock checks, release binary builds	cross-toolchain parity and release confidence
Supply/policy lane	`cargo-deny`, codespell, clippy, argument-comment lint, lock checks	`static` and `distribution-release` hygiene
Sandbox/process lane	`exec_policy_tests`, Windows sandbox tests, sandbox tag tests, Landlock/bwrap/seatbelt-related tests	permission boundary and platform-specific runtime gates
Tool/protocol lane	MCP fixtures, app-server v2 protocol tests, schema fixture regeneration, dynamic tools, request permission tests	`contract-protocol` and fake integration
Stream lane	SSE end-to-end tests, fake response helpers, stream event utilities, JSON/event processor tests	stream shape evidence before semantic claims
TUI lane	ratatui/insta snapshots across chat widget, bottom pane, approval overlay, footer, request-user-input, MCP elicitation	TUI `ui-interaction` evidence
SDK/API lane	TypeScript SDK event/thread APIs and app-server client surfaces	`agent-sdk-api` contract evidence
Release lane	Bazel release binaries, npm/native package build scripts, Windows/zsh release workflows	`distribution-release` gate

TUI details worth standardizing

Codex demonstrates that TUI testing needs more than a screenshot:

terminal-width and height variants: narrow, standard, and large terminals;
approval overlays for exec, patch, network, cross-thread, and additional permissions;
footer states: idle, running, Ctrl-C quit, Ctrl-C interrupt, Esc hint, queue hint, mode indicator, context/token status;
request-user-input forms: options, freeform, multi-question, tight height, hidden options, long option text;
model/session pickers: model migration prompt, fixed/auto column widths, narrow rows, scroll states;
composer edge cases: paste, backspace after paste, slash popup, mention popup, plugin popup, remote image rows, shell-command mode;
history/chat frames: diff syntax, code blocks, completed hook output, pending input, stream deltas, compact/resume/fork shapes;
MCP and app-server states: MCP startup failures, elicitation forms, app-server collaboration and guardian review states;
platform-specific snapshots such as Windows approval popup variants.

Agent QC rule: a TUI pass should cite terminal snapshots and runtime transcripts. Snapshot-only proof shows rendering; transcript-linked proof shows the rendering came from the correct Agent event.

Runtime details worth standardizing

Codex separates deterministic runtime tests from live/provider risk:

fake model server and SSE fixtures test stream shape without burning provider budget;
app-server protocol tests assert wire shape independently from the TUI;
apply-patch tests cover CLI and tool surfaces;
exec/unified process tests preserve command output, cleanup, and failure semantics;
sandbox tests assert denied actions and platform policy transforms;
schema fixture writers make protocol drift reviewable.

Agent QC rule: CLI/runtime projects need contract-protocol and runtime-e2e gates before semantic-eval can be trusted.

What to copy into Agent QC plans

For a Codex-like project, include cases such as:

denied unsafe command produces a visible controlled error and non-success runtime event;
apply-patch success/failure has stable CLI transcript and patch result;
MCP tool declaration round-trips through config, server fixture, runtime event, and TUI row;
Ctrl-C interrupts a running turn without leaving orphan subprocesses;
app-server protocol schema diff is reviewed when command shape changes;
release package contains expected native binaries and platform helpers.

Claude Code local snapshot: TUI/runtime surface under incomplete repo metadata

Local source: /Users/coso/Documents/dev/js/claudecode.

Source limitation

The inspected local snapshot contains source files under src/ and vendor/, but no local package.json, lockfile, or GitHub workflow metadata. Agent QC must therefore avoid claiming upstream CI, test commands, package coverage, or release guarantees from this snapshot. The useful signal is interface-surface shape.

Product surfaces visible in the snapshot

Surface	Local indicators	Agent QC gate
Ink TUI	`src/ink.ts`, `.tsx` command views, terminal focus/input/selection hooks, task views	TUI `ui-interaction`
Command palette	many `src/commands/**` handlers and renderable command views	command routing and TUI state snapshots
Remote session bridge	`src/remote/RemoteSessionManager.ts`, `src/remote/SessionsWebSocket.ts`, server direct connect manager	`contract-protocol`, `runtime-e2e`
Permission flow	`remotePermissionBridge.ts`, control schemas with `can_use_tool`, synthetic assistant/tool confirmation flow	high-risk TUI + protocol evidence
SDK stream	`src/remote/sdkMessageAdapter.ts`, `src/entrypoints/agentSdkTypes.ts`, stream/control schemas	`agent-sdk-api` stream contract
Skills/plugins	SDK schemas include skills/plugins; output style/plugin loading code is visible	`agent-skills-plugins`

What the standard should require

A Claude Code-style TUI runtime should prove:

success, empty, error, cancelled, reconnecting, disconnected, and remote states render distinctly;
command views route to the same state transitions as slash commands or command palette entries;
permission prompts show tool name, request id, proposed input, permission suggestions, and deny/allow outcome;
remote permission responses preserve request correlation and behavior (allow, deny, or project-specific modes);
server-side cancellation removes or marks the pending prompt instead of leaving stale approvals visible;
reconnect/interrupt cannot leave the TUI showing stale success;
SDK stream adapters preserve event type, session id, tool-use id, and partial/final message semantics;
tool results from a remote server render as tool results, not as prompt echoes;
plugin/skill reload events cannot silently change allowed tools without a visible status or audit event.

Evidence recipe

A minimal QC case should collect:

pseudo-terminal transcript of a remote permission request;
TUI snapshot showing the synthetic confirmation row;
WebSocket/control transcript for can_use_tool request and response;
SDK stream fixture proving event conversion;
negative case for cancellation or disconnect.

Agent QC rule: when repo metadata is incomplete, write the limitation into evidence_policy and require interface-level evidence instead of inventing a CI story.

OpenClaw: multi-channel gateway plus WebUI plus QA Lab

Local source: /Users/coso/Documents/dev/js/openclaw.

Product shape

OpenClaw is a dense Agent system:

multi-channel gateway for provider/channel integrations;
plugin ecosystem and plugin SDK;
CLI, gateway, TUI command, control WebUI, Android/iOS/macOS platform paths;
QA Lab extension with web runtime, browser runtime, scenario runner, live transport tests, and reports;
Docker/install/release smoke paths;
live provider lanes for models, gateways, and CLI backends.

How tests are organized

OpenClaw's package.json exposes many lanes. The important Agent QC pattern is the separation, not the number of commands.

Layer	Concrete signals	Agent QC interpretation
Test router	`node scripts/test-projects.mjs`, `test:changed`, `test:max`, serial/max-worker variants	changed-scope and profile-aware gate selection
Static/policy	`check`, `lint`, import-cycle checks, LOC checks, host env policy, webhook/auth boundary lints	`static` plus security policy
Unit/gateway lanes	`test:unit`, `test:gateway`, gateway client/server/method configs	deterministic runtime and gateway behavior
Contract lanes	`test:contracts:channels`, `test:contracts:plugins`, plugin SDK export/API checks, protocol generation checks	`contract-protocol` for channel/plugin/runtime boundaries
WebUI lane	`test:ui`, `ui` package tests, browser-playwright-style UI config tests	`webui` under `ui-interaction`
TUI/platform lanes	`tui`, TUI scripts, `test:windows:ci`, `test:macos:ci`, Android/iOS unit/integration scripts	surface-specific UI/platform proof
QA Lab lane	`extensions/qa-lab` scenario catalog, web runtime, browser runtime, reports, live transports, suite summary JSON	`agent-evals-quality`, `eval-ui`, `webui`, `browser-automation`
Channel lanes	channel configs for Telegram/Matrix/Discord/Feishu/Zalo/etc., webhook/media/auth tests	`multi-channel-agent-gateway`
Live provider lanes	`test:live:*`, live model profiles, live gateway Docker lanes, live CLI backend lanes for Claude/Codex/Gemini-style backends	explicit `live-provider` with opt-in
Docker/install lanes	install smoke, OpenWebUI Docker, MCP channels Docker, QR import, plugins Docker, gateway network Docker	`distribution-release` and runtime smoke
Release lanes	`release:check`, npm checks, plugin release checks, version sync	release readiness and package boundary proof
Performance lanes	startup bench, import duration, perf budget, memory checks	performance risk gates

WebUI details worth standardizing

OpenClaw shows that WebUI proof should be layered:

component/state tests for navigation, chat normalization, settings, controller panels, usage panels, tool cards, and config surfaces;
browser-only tests for focus, markdown, sidebar status, external links, image opening, and browser APIs;
QA Lab web runtime tests for scenario execution and report rendering;
Docker-hosted OpenWebUI smoke to prove integration in a clean environment;
console/network evidence whenever browser behavior is under test.

Agent QC rule: when behavior depends on DOM, focus, browser APIs, markdown sanitization, navigation, or report rendering, webui evidence must include browser-level artifacts, not just jsdom/component tests.

Channel/provider details worth standardizing

OpenClaw makes four separations that Agent QC should require:

channel contract tests are not live channel tests;
fake provider integration is not live provider coverage;
media/webhook/auth replay is separate from model semantic quality;
plugin boundary tests are separate from runtime gateway tests.

Examples of useful case shapes:

secret refs are redacted and inactive channel credentials cannot be used;
QR import creates a scoped session and can be replayed in Docker smoke;
webhook body verification happens before parsing user content;
media attachments preserve type/size limits and redaction;
live transport credentials are leased, timed out, and redacted in reports;
control WebUI shows actual gateway status, not cached healthy state.

Agent QC rule: multi-channel-agent-gateway projects should never hide live-provider assumptions inside ordinary unit tests.

Hermes Agent: Python agent plus TUI plus browser/web tools plus scheduler

Local source: /Users/coso/Documents/dev/python/hermes-agent.

Product shape

Hermes combines:

Python Agent runtime, CLI, toolsets, gateway, and ACP/MCP adapters;
pytest-based backend tests;
browser, web provider, CDP, Camofox, Browserbase-style provider, and SSRF hardening tests;
cron/background scheduler, checkpointing, approval, restart/retry, and concurrency surfaces;
ui-tui Ink/React TUI package with Vitest tests;
web Vite/React dashboard package;
Docker image, uv lock, OSV/security, and release checks.

How tests are organized

Layer	Concrete signals	Agent QC interpretation
Canonical runner	`scripts/run_tests.sh` pins `-n 4`, `TZ=UTC`, `LANG=C.UTF-8`, `PYTHONHASHSEED=0`, activates venv, blanks credential env vars, excludes integration/e2e by default	reproducible local evidence and credential hygiene
Pytest backend	`tests/` with gateway, cron, CLI, ACP, browser/tool, security, restart, retry, queue, platform tests	deterministic `unit`, `fake-integration`, `runtime-e2e`
Tool safety	write deny, file guards, symlink confusion, URL safety, yolo/approval modes, env passthrough	permission/sandbox gates
Browser/web	browser supervisor, browser hardening, CDP, local SSRF, Camofox state, web providers	`browser-automation` gates
Gateway/channel	Discord, Feishu, Matrix, Mattermost, Google Chat, QQBot, delivery, media, reconnect, dedup, pairing, roles/DM scope	`channel-ui` and gateway contracts
MCP/OAuth/ACP	MCP e2e, OAuth metadata, SSE transport, reconnect, circuit breaker, tool 401 handling, ACP approval isolation	`contract-protocol` and recovery
Scheduler	cron jobs, cron prompt injection, inactivity timeout, workdir, scheduler MCP init, checkpoint/session cleanup	`background-agent-scheduler`
TUI	`ui-tui` Vitest: terminal parity, viewport, virtual history, slash parity, streaming markdown, OSC52, clipboard, terminal modes	TUI `ui-interaction`
Web dashboard	`web` package uses Vite/React build and lint scripts	`webui` when dashboard behavior changes
Distribution	Dockerfile builds browser dashboard/TUI assets; uv lock; OSV/security notes	`distribution-release` and supply chain

TUI/terminal details worth standardizing

Hermes TUI tests cover practical terminal mechanics:

text wrapping, virtual history heights, scroll, viewport stores, precision wheel;
terminal modes, truecolor, OSC52 clipboard, emoji, math Unicode, syntax/markdown;
slash command parity, gateway events, session lifecycle, queue handling, turn store, state isolation;
streaming markdown, reasoning/details rendering, subagent tree, status ticker;
text input navigation, pass-through, wrapping, completion, composer state.

Agent QC rule: TUI testing should include terminal input/output mechanics, not only component snapshots.

Browser and web details worth standardizing

Hermes browser/tool tests map directly to Agent QC:

browser supervisor health and orphan reaper;
browser hardening and local SSRF protections;
CDP override, browser console, and local provider behavior;
Camofox persistence/state isolation;
Brave/DDGS/SearXNG/Tavily-like web provider contracts;
CLI browser connect and gateway browser-related command tests.

Agent QC rule: browser automation gates must include safety and cleanup evidence, not only screenshots.

Scheduler/channel details worth standardizing

Hermes shows why background agents need their own gate family:

cron prompt injection must be scanned after skills/context are assembled, not only at user input;
scheduler restart must not duplicate work or lose checkpoints;
inactivity timeout should track real tool activity, not wall-clock time alone;
gateway restart/retry/dedup tests should preserve message ids and delivery state;
credential-shaped environment variables must be blanked or scoped in tests.

Agent QC rule: a background scheduler pass should include deterministic clock/env settings, checkpoint evidence, and cleanup evidence.

Cross-project extraction

Agent QC generalizes these projects into ten reusable rules:

Start from owned risk, not from language or framework.
Split UI surface proof from runtime/protocol proof, then connect them with evidence refs.
Keep fake integration, live provider, and release smoke as separate gates.
For TUI/WebUI/GUI, preserve surface artifacts: snapshots, traces, screenshots, console logs, terminal transcripts.
For browser automation, require DOM/a11y plus console/network plus cleanup evidence.
For channel/mobile, separate webhook/media/auth replay from live provider tests.
For background agents, pin deterministic time/env/worker settings and preserve checkpoint evidence.
For SDK/protocol surfaces, use generated schema diffs and fake servers before live runs.
For release claims, test package contents and installation paths, not just source tests.
For incomplete local snapshots, record what was inspected and what cannot be inferred.

Recommended Agent QC case mix

Risk	Minimum case	Stronger case
Permission prompt	snapshot/frame shows prompt, transcript shows request id	allow/deny/cancel/reconnect variants with protocol transcript
Tool stream	fake provider stream parses and renders	malformed stream, tool error, partial/final event, retry, abort
TUI rendering	one stable snapshot	multi-viewport, key sequence, Unicode/ANSI, runtime-linked transcript
WebUI control	component state test	browser trace, console/network, keyboard/a11y, reload/resume
Desktop bridge	shell starts	bridge health, workspace readiness, native command contract, screenshot
Browser control	screenshot	DOM/a11y, console/network, cleanup, SSRF/navigation safety
Channel adapter	contract fixture	webhook replay, media fixture, redacted transcript, live opt-in lane
Scheduler	deterministic unit	restart/reclaim, concurrency, checkpoint, duplicate-work prevention
Eval report	rubric exists	baseline delta, judge output, failing examples, reviewer note
Release	build succeeds	package manifest, install smoke, Docker/platform matrix, lock/security check

Star project testing systems ​

How to read this page ​

Agent UI and Agent Skills lessons applied here ​

Framework documentation lessons ​

Cross-project surface map ​

Shared test architecture pattern ​

Codex: runtime CLI plus TUI plus protocol stack ​

Product shape ​

How tests are organized ​

TUI details worth standardizing ​

Runtime details worth standardizing ​

What to copy into Agent QC plans ​

Claude Code local snapshot: TUI/runtime surface under incomplete repo metadata ​

Source limitation ​

Product surfaces visible in the snapshot ​

What the standard should require ​

Evidence recipe ​

OpenClaw: multi-channel gateway plus WebUI plus QA Lab ​

Product shape ​

How tests are organized ​

WebUI details worth standardizing ​

Channel/provider details worth standardizing ​

Hermes Agent: Python agent plus TUI plus browser/web tools plus scheduler ​

Product shape ​

How tests are organized ​

TUI/terminal details worth standardizing ​

Browser and web details worth standardizing ​

Scheduler/channel details worth standardizing ​

Cross-project extraction ​

Recommended Agent QC case mix ​

Star project testing systems

How to read this page

Agent UI and Agent Skills lessons applied here

Framework documentation lessons

Cross-project surface map

Shared test architecture pattern

Codex: runtime CLI plus TUI plus protocol stack

Product shape

How tests are organized

TUI details worth standardizing

Runtime details worth standardizing

What to copy into Agent QC plans

Claude Code local snapshot: TUI/runtime surface under incomplete repo metadata

Source limitation

Product surfaces visible in the snapshot

What the standard should require

Evidence recipe

OpenClaw: multi-channel gateway plus WebUI plus QA Lab

Product shape

How tests are organized

WebUI details worth standardizing

Channel/provider details worth standardizing

Hermes Agent: Python agent plus TUI plus browser/web tools plus scheduler

Product shape

How tests are organized

TUI/terminal details worth standardizing

Browser and web details worth standardizing

Scheduler/channel details worth standardizing

Cross-project extraction

Recommended Agent QC case mix