ci(smoke): add token-usage sanity checks to smoke workflows#5264
Conversation
✅ Coverage Check PassedOverall Coverage
📁 Per-file Coverage Changes (1 files)
Coverage comparison generated by |
Add a verify_token_usage job to smoke-copilot, smoke-claude, and smoke-codex that runs after the agent job on the downloaded agent artifact and fails the workflow when token accounting looks wrong. The checker (scripts/ci/check-token-usage.js) enforces two invariants: - Internal consistency: the sum of per-response records in token-usage.jsonl must exactly equal the aggregated agent_usage.json (input/output/cache_read/cache_write). This is engine-independent. - cache_read_tokens must not be 0 across multiple responses, which is the symptom of the cached-token normalization bug. ai_credits/ambient_context drift is reported as warnings only. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
772a6b1 to
6cf652c
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a CI-side “token usage sanity check” to the smoke workflows by introducing a small Node.js checker script (plus unit tests) and wiring a new verify_token_usage job into the generated smoke workflow graphs so token-usage inconsistencies fail the workflow.
Changes:
- Add
scripts/ci/check-token-usage.jsandscripts/ci/check-token-usage.test.tsto validate token-usage internal consistency and detectcache_read_tokens == 0across multi-turn runs. - Extend smoke workflow sources (
smoke-copilot.md,smoke-claude.md,smoke-codex.md) and regenerated lock workflows to run the checker against the downloadedagentartifact. - Update the gVisor firewall comparison workflow to wait for Squid/Envoy readiness (but see review comments on the Squid probe).
Show a summary per file
| File | Description |
|---|---|
scripts/ci/check-token-usage.js |
New zero-dependency Node checker that locates usage files in the agent artifact and enforces invariants. |
scripts/ci/check-token-usage.test.ts |
Unit tests for parsing, summation/consistency, cache-read guard, path resolution, and arg parsing. |
.github/workflows/smoke-copilot.md |
Adds verify_token_usage job to run the checker after agent. |
.github/workflows/smoke-claude.md |
Adds verify_token_usage job to run the checker after agent. |
.github/workflows/smoke-codex.md |
Adds verify_token_usage job to run the checker after agent. |
.github/workflows/smoke-copilot.lock.yml |
Regenerated locked workflow including verify_token_usage in the job graph. |
.github/workflows/smoke-claude.lock.yml |
Regenerated locked workflow including verify_token_usage (also includes additional generated deltas). |
.github/workflows/smoke-codex.lock.yml |
Regenerated locked workflow including verify_token_usage in the job graph. |
.github/workflows/test-gvisor-firewall-comparison.yml |
Replaces fixed sleeps with readiness loops for Squid/Envoy startup. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 8/8 changed files
- Comments generated: 3
| const text = fs.readFileSync(agentUsage, 'utf8').trim(); | ||
| // agent_usage may be a single JSON object or a one-line JSONL file. | ||
| const parsed = parseJsonl(text); | ||
| aggregate = parsed.length > 0 ? parsed[parsed.length - 1] : null; | ||
| } |
| ]) || | ||
| findFileRecursive(root, 'agent_usage.json'); | ||
|
|
| - name: Setup Scripts | ||
| id: setup | ||
| uses: github/gh-aw-actions/setup@c0338fef4749d08c21f8f975fb0e37efa17dda47 # v0.79.8 | ||
| uses: github/gh-aw-actions/setup@5c2fe865bb4dc46e1450f6ee0d0541d759aea73a # v0.79.6 |
|
@copilot address review feedback |
|
⏳ Copilot review left inline comments. @lpcox To proceed:
|
- Parse agent_usage with JSON.parse() first, fallback to JSONL - Recursive fallback also searches for agent_usage.jsonl - Restore smoke-claude.lock.yml to v0.79.8 versions and add only the verify_token_usage job - Add 3 unit tests (20 total)
Addressed in e6e291c:
Three new unit tests added (20 total): |
|
❌ Smoke Claude failed |
|
✅ Contribution Check completed successfully! |
|
🔑 Smoke Copilot PAT PAT auth validated. All systems operational. ✅ |
|
✅ Smoke Copilot BYOK AOAI (api-key) completed. Copilot AOAI BYOK (api-key) mode operational. 🔓 |
|
❌ Smoke Copilot BYOK reports failed. BYOK mode investigation needed... |
|
✅ Smoke Copilot BYOK AOAI (Entra) completed. Copilot AOAI BYOK (Entra) mode operational. 🔓 Completed smoke test summary with comment and labels |
|
📰 VERDICT: Smoke Copilot has concluded. All systems operational. This is a developing story. 🎤 |
|
📡 Smoke OTel Tracing completed. All tracing scenarios validated. ✅ |
|
🔌 Smoke Services — All services reachable! ✅ |
|
Chroot tests passed! Smoke Chroot - All security and functionality tests succeeded. |
|
✨ The prophecy is fulfilled... Smoke Codex has completed its mystical journey. The stars align. 🌟 |
|
✅ Build Test Suite completed successfully! |
🔥 Smoke Test: Copilot PAT — PASS
Overall: PASS · Auth mode: PAT (COPILOT_GITHUB_TOKEN)
|
|
✅ Smoke Gemini completed. All facets verified. 💎 Gemini smoke test completed with a FAIL status due to connectivity issues. |
|
@lpcox
|
Chroot Version Comparison Results
Overall: ❌ FAILED — Python and Node.js versions differ between host and chroot.
|
Smoke Test: API Proxy OpenTelemetry Tracing
All 5 scenarios passed. OTEL tracing integration is functioning correctly.
|
🔬 Smoke Test Results
PR: ci(smoke): add token-usage sanity checks to smoke workflows Overall:
|
|
@lpcox Smoke Test Results for Direct BYOK (Azure OpenAI Entra):
|
Smoke Test: GitHub Actions Services Connectivity
Overall: ❌ FAIL
|
Gemini Smoke Test Results
Overall Status: FAIL Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "localhost"See Network Configuration for more information.
|
🏗️ Build Test Suite Results
Overall: 8/8 ecosystems passed — ✅ PASS
|
|
Merged PRs:
Checks:
Overall: PASS Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "registry.npmjs.org"See Network Configuration for more information.
|
What
Adds a
verify_token_usagejob to the smoke-copilot, smoke-claude, and smoke-codex workflows. The job runs after theagentjob, downloads theagentartifact, and runsscripts/ci/check-token-usage.jsagainst it. The compiler wires the job intoconclusion.needs, so a failure fails the workflow.Checks enforced
The checker validates two engine-independent invariants over the agent artifact:
token-usage.jsonlmust exactly equal the aggregatedagent_usage.json(input/output/cache_read/cache_write). Hard fail on mismatch or missing aggregate.cache_read_tokens == 0across multiple responses is a hard failure (the symptom of the cached-token normalization bug fixed in fix(api-proxy): map OpenAI Responses API cached tokens to cache_read #5262). Below the min-requests threshold it only warns.ai_credits/ambient_contextdrift is reported as warnings only.Why internal consistency instead of engine-vs-proxy
Codex's engine-native telemetry (
turn.completed) reports cumulative counts that diverge ~2x from the api-proxy per-request sum, making an engine-vs-proxy comparison infeasible. The internal-consistency invariant was verified exact for both codex and copilot real artifacts.Tests
scripts/ci/check-token-usage.test.ts— 17 unit tests (parsing, summation, consistency, cache-read guard, file location, arg parsing).cache_read==0correctly fails.Notes
node(no setup-node/npm ci needed in the verify job)..mdsources edited, recompiled withgh aw compile, and post-processed viascripts/ci/postprocess-smoke-workflows.ts.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com