Skip to content

Performance and reliability metrics

Agent projects often fail by appearing alive while queues, streams, tools, browsers, or background workers are stuck. QC evidence should capture enough timing and reliability data to explain perceived slowness and flaky behavior.

Agent QC does not mandate universal thresholds. Each project should define thresholds by product profile and risk.

Runtime responsiveness

MetricMeaningApplies to
submit_to_accept_msuser action to runtime acceptanceCLI, TUI, GUI, WebUI
first_status_msfirst user-visible runtime statusAgent UI/TUI/desktop
first_text_delta_msfirst model/user-facing text deltastreams and chat UIs
first_tool_event_msfirst tool start/progress eventtool/runtime gates
interrupt_ack_msinterrupt/cancel request to runtime acknowledgementCLI/TUI/GUI
resume_ready_msold session or task resume to usable statesessions, schedulers

These metrics are inspired by Agent UI's separation of listener binding, runtime acceptance, first status, first text, and paint timing.

Stream and projection health

MetricMeaningEvidence
event_sequence_gap_countmissing or out-of-order runtime eventsprotocol transcript
delta_backlog_depthqueued unrendered text/tool deltasUI diagnostics
oldest_unrendered_delta_msoldest pending delta ageUI diagnostics
final_reconciliation_duplicatesduplicated streamed/final text counttranscript + surface artifact
stale_success_countUI claimed success before runtime confirmationruntime/UI comparison
missing_fact_fallback_countunknown, unavailable, stale, or blocked projectionsUI snapshot/report

Tool and permission reliability

MetricMeaningEvidence
tool_start_to_result_mstool duration by tool idtool transcript
tool_error_recovery_countretry/recovery attempts after tool failureruntime transcript
approval_pending_mstime in human-in-the-loop stateaction transcript
approval_correlation_failuresrequest/response id mismatchesprotocol test
denied_side_effect_countdenied action still caused side effectsandbox/process evidence
orphan_process_countsubprocess/browser workers left behindcleanup evidence

Browser, WebUI, TUI, and desktop reliability

SurfaceMetrics
webuipage load, first status paint, console error count, failed network count, trace size
desktop-guishell start, bridge health time, workspace readiness, native command timeout, mock fallback count
tuifirst frame, redraw latency, viewport reflow failures, key handling failures, Unicode/ANSI rendering failures
browser-automationnavigation time, DOM ready, console/network errors, screenshot/trace success, cleanup/orphan count
channel-uiwebhook verification time, dedup count, media processing time, retry count, delivery ack time

Playwright-style projects should retain trace/screenshot/video on failure and record browser project/device when relevant. Vitest browser-mode or component tests can prove component behavior, but browser-only APIs need browser evidence.

Scheduler and background reliability

MetricMeaning
lease_reclaim_mstime to reclaim work after interrupted owner
checkpoint_age_msage of last durable checkpoint
duplicate_job_countduplicate execution for same job id
lost_job_countscheduled jobs not executed by deadline
retry_attempt_countattempts per task before success/failure/exhaustion
worker_shutdown_msgraceful worker termination time
queue_depthpending work by queue or priority

Hermes-style projects should pin deterministic clock/env for normal tests and reserve live provider/channel checks for explicit opt-in lanes.

Release and distribution reliability

MetricMeaning
clean_install_msfresh install duration
package_size_bytespackage or image size
manifest_missing_countexpected files absent from package
version_mismatch_countpackage/app/Cargo/Tauri/version drift
docker_smoke_msDocker smoke duration
platform_failure_countOS matrix failures
lock_drift_countlockfile or generated artifact drift

Codex-style projects may use Bazel/nextest/release binaries. OpenClaw-style projects may use Docker/install smoke and plugin release checks. Agent QC only requires the evidence shape.

Suggested threshold policy

A QC plan SHOULD define:

ThresholdExample
Local defaultdeterministic gates must pass with no live credentials
Surface smokefirst status or bridge health must appear within product-specific timeout
Flake budgetretry count and rerun policy for known flaky lanes
Live budgetprovider/channel cost, credential scope, and timeout
Release budgetinstall time, package size, OS matrix, Docker smoke timeout
Waiver expirydate/version when missing metric must be rechecked

Evidence guidance

When a performance or reliability gate fails, preserve:

  • the command or interaction that started the run;
  • timestamps and environment;
  • trace/screenshot/transcript around the slow or flaky segment;
  • retry and cleanup outcome;
  • whether the failure blocks release, needs review, or can be waived.

Draft standard for evidence-driven quality control of Agent projects.