Benchmark instrumentation

Agent Runtime 不负责给产品打分，但它必须让 benchmark 可复现、可归因、可审计。Lime 的目标是通过测试发现问题并进化，因此 runtime 需要把每次 benchmark trial 当作一等执行事实，而不是把结果散落在 GUI smoke、日志和人工总结里。

目标

Benchmark instrumentation 要回答七个问题：

本次 trial 用的是哪个 dataset、task、baseline/candidate config？
Harbor task / job / trial / artifact refs 分别在哪里？
Agent 执行过程中做了哪些 tool/action/model/context 决策？
Verifier 或 reward 是怎样得出分数的？
失败是 runtime、模型、工具、GUI projection、环境还是 verifier 问题？
这个发现如何回写到 Agent QC gate、replay 或 release blocker？
同一条 runtime fact 能不能同时服务 replay、review、benchmark viewer 和 Lime UI diagnostics？

Event families

兼容 runtime SHOULD 支持这些 benchmark event：

Event	用途
`benchmark.dataset.resolved`	记录 dataset id、version、selection policy、冻结时间、Harbor local path 或 registry ref。
`benchmark.configuration.resolved`	记录 baseline/candidate 的 runtime、agent、model、prompt、tool、context、routing profile。
`benchmark.trial.started`	记录 task、config、attempt、sandbox/env、timeout、预算和 Harbor trial ref。
`benchmark.trial.completed`	记录成功 trial、duration、artifact refs、trajectory refs 与 cleanup。
`benchmark.trial.failed`	记录失败、timeout、blocked、verifier error、environment issue 和 failure category。
`benchmark.reward.recorded`	记录 reward、reward details、criterion summary、verifier refs 和 drift/oracle sanity。
`benchmark.comparison.completed`	记录 baseline/candidate aggregate delta、promotion/revert decision 和 remaining risk。

这些事件可以由 benchmark runner、runtime adapter 或 evidence exporter 写入，但必须使用同一条 runtime correlation spine。

Harbor 兼容责任

如果 Lime 用 Harbor 跑 Agent QC benchmark，Runtime 或 adapter 必须补齐这些事实：

Harbor 事实	Runtime 责任
`task.toml`	保存 task id、environment、agent/verifier timeout、artifacts、multi-step 配置的引用。
`/logs/artifacts/`	Agent 有意发布给 verifier 的产物；Runtime 应写 manifest，说明 producer、hash、size、redaction。
`/logs/agent/trajectory.json`	导出 ATIF-compatible trajectory；如果 Harbor agent 不产生，Runtime adapter 必须转换。
`/logs/verifier/reward.txt	json`
`/logs/verifier/reward-details.json`	记录每个 criterion、judge reasoning、error、证据 ref；没有该文件时把 `test-stdout.txt`、`ctrf.json` 或 reviewer note 作为 details ref。
Separate verifier transfer	`/logs/agent/` 与 `/logs/verifier/` 不会隐式传给 separate verifier；需要 trajectory grading 时必须声明 configured artifact。
`jobs/<job>/<trial>/result.json`	作为 trial result ref；Runtime 只追加 correlation，不重写 Harbor verdict。

KISS 规则：Runtime 只导出事实，不做 benchmark 评分；Agent QC 才做 gate verdict 和 promotion decision。

Trial correlation spine

每个 trial evidence pack SHOULD 包含：

json

{
  "benchmark": {
    "datasetId": "lime-internal-agent-runtime-tasks",
    "datasetVersion": "2026-05-frozen",
    "datasetRef": "benchmarks/lime-agent-runtime",
    "taskId": "tool-approval-sandbox-boundary",
    "trialId": "trial-tool-approval-candidate-1",
    "configurationId": "candidate-feedback-v2",
    "role": "candidate",
    "harborJobRef": "jobs/lime-runtime-candidate-feedback-v2",
    "harborTrialRef": "jobs/lime-runtime-candidate-feedback-v2/trial-tool-approval-candidate-1"
  },
  "runtimeCorrelation": {
    "runtimeId": "lime_runtime_local",
    "sessionId": "sess_123",
    "threadId": "thread_123",
    "turnId": "turn_123",
    "taskId": "task_123",
    "runId": "run_123",
    "traceId": "trace_123"
  },
  "refs": {
    "trajectoryRef": "/logs/agent/trajectory.json",
    "runtimeTranscriptRef": "/logs/artifacts/runtime-transcript.json",
    "rewardRef": "/logs/verifier/reward.json",
    "rewardDetailsRef": "/logs/verifier/reward-details.json",
    "artifactManifestRef": "/logs/artifacts/manifest.json",
    "agentQcReportRef": ".lime/qc/current-agent-qc-report.json"
  }
}

缺少 runtimeCorrelation 时，benchmark 可以有分数，但不能用于解释 Lime runtime 是否变好。

ATIF trajectory 导出

Runtime SHOULD 把内部 event stream 导出为 Harbor ATIF-compatible trajectory：

Runtime fact	ATIF 目标
`sessionId` / `runId`	`session_id` 或 metadata correlation。
model selection	`agent.name`、`agent.version`、`agent.model_name`。
user input / instruction	`steps[].source = "user"` 与 message。
reasoning/model output	`steps[].source = "agent"`、message、`reasoning_content`。
tool start/args/result	`steps[].tool_calls` 与 `observation.results`。
permission decision	tool/action metadata 或 observation status。
token/cost/cache	`steps[].metrics` 与 `final_metrics`。
runtime warnings/errors	step metadata、observation error 或 final failure category。

Trajectory 可以脱敏和摘要，但不能删除 failure attribution 所需的 ids。

Trajectory requirements

Trajectory SHOULD 保留以下事件或等价摘要：

Area	必需事实
Input	instruction、attachments、options、source channel。
Model	selected model、provider、request ids、token/cost、fallback。
Context	selected refs、compaction、missing facts、context warnings。
Tools	tool inventory、tool args、progress、result/error、result refs。
Permission	`action.required`、decision id、approve/deny/edit、side-effect check。
Process/browser	command、cwd、exit code、console/network、cleanup。
Artifacts	produced refs、export status、validation issues。
Outcome	final status、failure category、known gaps、evidence ids。

Control plane support

Runtime 或宿主 SHOULD 提供这些命令语义：

Command	目的
`start_benchmark_trial`	绑定 dataset/task/config、Harbor job/trial、sandbox 和 timeout，并创建 trial scope。
`record_benchmark_reward`	写入 reward、reward details、verifier status、failure category 和 drift/oracle sanity。
`export_benchmark_trial`	导出 trajectory、runtime transcript、artifacts、reward、Agent QC refs 与 redaction manifest。
`compare_benchmark_runs`	记录 baseline/candidate delta、cost、evidence completeness 与决策。

如果产品不暴露这些命令，也应通过 export_evidence / export_replay 输出等价结构。

Event payload 示例

benchmark.trial.started：

json

{
  "type": "benchmark.trial.started",
  "eventId": "evt_trial_started_1",
  "schemaVersion": "lime-profile-0.4.0",
  "runtimeId": "lime_runtime_local",
  "sessionId": "sess_123",
  "threadId": "thread_123",
  "turnId": "turn_123",
  "taskId": "task_123",
  "runId": "run_123",
  "sequence": 42,
  "timestamp": "2026-05-17T09:00:00Z",
  "benchmark": {
    "datasetId": "lime-internal-agent-runtime-tasks",
    "datasetVersion": "2026-05-frozen",
    "taskId": "tool-approval-sandbox-boundary",
    "trialId": "trial-tool-approval-candidate-1",
    "configurationId": "candidate-feedback-v2",
    "harborJobRef": "jobs/lime-runtime-candidate-feedback-v2",
    "harborTrialRef": "jobs/lime-runtime-candidate-feedback-v2/trial-tool-approval-candidate-1"
  },
  "refs": {
    "taskTomlRef": "benchmarks/lime-agent-runtime/tool-approval-sandbox-boundary/task.toml"
  },
  "payload": {
    "attempt": 1,
    "timeoutSec": 300,
    "singleChangedVariable": "tool failure feedback profile"
  }
}

benchmark.reward.recorded：

json

{
  "type": "benchmark.reward.recorded",
  "eventId": "evt_reward_1",
  "schemaVersion": "lime-profile-0.4.0",
  "runtimeId": "lime_runtime_local",
  "sessionId": "sess_123",
  "threadId": "thread_123",
  "turnId": "turn_123",
  "taskId": "task_123",
  "runId": "run_123",
  "sequence": 55,
  "timestamp": "2026-05-17T09:03:00Z",
  "benchmark": {
    "datasetId": "lime-internal-agent-runtime-tasks",
    "datasetVersion": "2026-05-frozen",
    "taskId": "tool-approval-sandbox-boundary",
    "trialId": "trial-tool-approval-candidate-1",
    "configurationId": "candidate-feedback-v2"
  },
  "refs": {
    "rewardRef": "/logs/verifier/reward.json",
    "rewardDetailsRef": "/logs/verifier/reward-details.json",
    "trajectoryRef": "/logs/agent/trajectory.json"
  },
  "payload": {
    "reward": 1,
    "criteria": ["approval-denial-facts", "trajectory-present"],
    "failureCategory": "none"
  }
}

Lime 测试示例

1. 检查 trial pack 是否可关联

bash

jq -e '
  .benchmark.datasetId
  and .benchmark.taskId
  and .benchmark.trialId
  and .benchmark.harborJobRef
  and .runtimeCorrelation.sessionId
  and .runtimeCorrelation.threadId
  and .runtimeCorrelation.turnId
  and .runtimeCorrelation.runId
  and .refs.trajectoryRef
  and .refs.rewardDetailsRef
  and .refs.artifactManifestRef
' .lime/qc/benchmark/trial-tool-approval-candidate-1.json

失败处理：

缺 benchmark 字段：benchmark runner 阻断；
缺 runtime correlation：Agent Runtime profile 阻断；
缺 reward details：benchmark-eval 阻断；
缺 trajectory：只能做 smoke，不能做 hill climbing；
缺 artifact manifest：Harbor trial 可能能看，但 Agent QC 不能放行 promotion。

2. 检查 reward 改善没有牺牲 P0 QC

bash

jq -e '
  .comparison.meanRewardDelta >= 0
  and .comparison.p0QcGateRegressionCount == 0
  and .comparison.evidenceCompletenessRate >= .comparison.baselineEvidenceCompletenessRate
' .lime/qc/benchmark/feedback-v2-comparison.json

这个检查应该在 candidate promotion 前运行。它体现 KISS：只用三个硬指标挡住最危险的“分数变高但产品变差”。

3. 把失败回写到 QC

当 benchmark.trial.failed 的 failureCategory 是 tool-result-missing、stream-stuck、permission-bypass 或 stale-success 时，Lime 应把该失败写回：

Agent QC scenario 的 failureModes；
runtime replay fixture；
对应 gate 的 verifier criterion；
如果影响 release，写入 release blocker 或 waiver。

反模式

反模式	风险
Benchmark 只保存最终分数	无法归因，无法修复。
Trajectory 缺 tool/action ids	看不出是模型、工具还是权限问题。
Candidate 用不同 dataset 或 verifier	A/B 失效。
Runtime 不导出成本和 timeout	高分可能不可运营。
GUI benchmark 不连接 runtime facts	只证明画面，不证明执行真相。
Separate verifier 未声明 trajectory artifact	verifier 无法审计 agent 真实动作。

Benchmark instrumentation ​

目标 ​

Event families ​

Harbor 兼容责任 ​

Trial correlation spine ​

ATIF trajectory 导出 ​

Trajectory requirements ​

Control plane support ​

Event payload 示例 ​

Lime 测试示例 ​

1. 检查 trial pack 是否可关联 ​

2. 检查 reward 改善没有牺牲 P0 QC ​

3. 把失败回写到 QC ​

反模式 ​

Benchmark instrumentation

目标

Event families

Harbor 兼容责任

Trial correlation spine

ATIF trajectory 导出

Trajectory requirements

Control plane support

Event payload 示例

Lime 测试示例

1. 检查 trial pack 是否可关联

2. 检查 reward 改善没有牺牲 P0 QC

3. 把失败回写到 QC

反模式