Skip to content

Fix cursor-cli E2E flakiness#654

Merged
khaong merged 31 commits intomainfrom
alisha/e2e-triage-local-only
Mar 17, 2026
Merged

Fix cursor-cli E2E flakiness#654
khaong merged 31 commits intomainfrom
alisha/e2e-triage-local-only

Conversation

@alishakawaguchi
Copy link
Copy Markdown
Contributor

@alishakawaguchi alishakawaguchi commented Mar 7, 2026

Summary

Two targeted fixes discovered during E2E triage:

  • Cursor CLI E2E: isolate config dir and fix wait pattern — Parallel cursor-cli E2E tests raced on the shared ~/.config/cursor/cli-config.json file, causing ENOENT errors. Each session now gets an isolated XDG_CONFIG_HOME with a pre-seeded cli-config.json. Also switches WaitFor from the / commands prompt pattern (always visible in the status bar, even during "Thinking") to the Add a follow-up text that only appears after the agent finishes.

🤖 Generated with Claude Code

alishakawaguchi and others added 17 commits March 6, 2026 11:42
…script

Automates E2E failure triage with three new components:
- scripts/download-e2e-artifacts.sh: reusable script to download CI artifacts
- .claude/skills/e2e-triage/SKILL.md: 7-step triage skill (classify flaky vs real bug, create PRs or issues)
- .github/workflows/e2e-triage.yml: workflow_run trigger that auto-runs Claude Opus on E2E failure

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 1aa72dcd8a2b
Post "Claude is triaging..." when triage starts and a structured
summary with PR/issue links when it completes. The skill now writes
triage-summary.json which the workflow parses with jq for the Slack
message. Falls back to a warning if no summary is produced.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 8e5dcc6ef8ab
…ifications

- Build Slack payload via jq (payload-file-path) instead of interpolating
  raw text into inline JSON, which broke on quotes/newlines in summaries
- Add secrets.E2E_SLACK_WEBHOOK_URL guard to "Build Slack summary" and
  "Notify Slack - triage complete" steps (matching the "started" step)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 7c1914052967
Rewrite SKILL.md with dual-mode support (auto-detected via WORKFLOW_RUN_ID
env var): local mode runs tests with mise and re-runs failures up to 3
times, CI mode triggers e2e-isolated.yml workflows for re-run verification.
Classification now uses re-run results as the primary signal (all fail =
real-bug, mixed results = flaky).

Workflow changes: actions permission upgraded to write for gh workflow run,
timeout increased to 60m for re-run polling, Claude prompt updated with
CI mode hint and re-run instructions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 9f75c3effd9b
Local mode now presents findings interactively and applies fixes
directly in the working tree instead of creating branches/PRs/issues:
- Step 4a: findings report, proposed fixes, user approval gate, in-place fixes
- Step 4b: unchanged CI behavior (batched PR for flaky, issues for real bugs)
- Step 5: local mode gets simpler summary table, no triage-summary.json

Entire-Checkpoint: 4e1d9cf59d52
Consistent test failures can be test infrastructure bugs (e2e/ code),
not product bugs (cmd/entire/cli/). Update classification signals,
fix lists, and action sections to distinguish the two.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 28c90fcc7266
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: ba6877944a6c
Replace duplicated artifact-reading steps in e2e-triage Step 1 with a
reference to debug-e2e's Debugging Workflow (steps 2-5), keeping the
collect list so classification inputs remain clear. Add Related Skills
section to README.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: eb14496bde1e
Entire-Checkpoint: bb778fbab533
…d GitHub issues

- Remove workflow_run trigger from e2e-triage.yml (now dispatch-only)
- Remove issues permission and gh issue commands from CI mode
- Replace real-bug GitHub issues with structured CI log reports
- Add triage link to Slack failure notification in e2e.yml
- Update skill docs and README to reflect new behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 4725d29fd8ff
Remove CI mode (branch creation, PRs, triage-summary.json, CI re-runs
via gh workflow run) from the e2e-triage skill while preserving local
debugging of CI failures (downloading artifacts, analyzing them, running
tests locally). Also removes the e2e-triage.yml workflow and the triage
link from the E2E Slack failure notification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: c8ddf0fdb9df
Teach Step L1 to accept CI run references (latest, run ID, run URL)
and use scripts/download-e2e-artifacts.sh to fetch artifacts, skipping
local re-runs and jumping straight to shared analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: fb0c122d807e
Skip re-downloading when the artifact directory already exists and is
non-empty, printing a log message instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 5f75d4661043
…ge skill

Pre-create ~/.config/cursor/ in Bootstrap() so the cursor CLI doesn't crash
with ENOENT when writing cli-config.json after accepting workspace trust on CI.
Follows the same pattern used by Claude, Gemini, and Droid agents.

Update e2e-triage skill to require running real E2E tests after applying fixes,
scoped by change type: agent-specific → that agent's full suite, shared infra →
all affected agents, prompt-only → just the affected test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: fa1e4e3fb457
Split the monolithic e2e-triage skill into three focused commands
(triage-ci, debug, implement) following the agent-integration plugin
pattern. Triage-ci is report-only, implement is action-only, and the
/e2e orchestrator runs both sequentially.

- Create .claude/plugins/e2e/ with plugin.json and command wrappers
- Create .claude/skills/e2e/ with orchestrator SKILL.md and 3 procedures
- Delete old .claude/skills/e2e-triage/ and .claude/skills/debug-e2e/
- Update all /debug-e2e references to /e2e:debug in agent-integration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 2b582392c6b6
@alishakawaguchi alishakawaguchi self-assigned this Mar 7, 2026
Copilot AI review requested due to automatic review settings March 7, 2026 00:56
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 7, 2026

PR Summary

Medium Risk
Touches git hook/session checkpointing logic and introduces a new retry path on turn finalization, which could affect when/if checkpoints are created; changes are localized and covered by new tests.

Overview
Improves Cursor CLI E2E stability by running each tmux session with its own XDG_CONFIG_HOME (pre-seeded cli-config.json) to avoid parallel config races, and by waiting for a post-completion UI marker (Add a follow-up) instead of the always-visible prompt pattern.

Fixes a deferred-condensation edge case in manual commit hooks: PostCommit now records TurnCheckpointIDs when condensation is attempted (even if it fails), and finalizeAllTurnCheckpoints now retries missing checkpoints by calling CondenseSession on ErrCheckpointNotFound.

Adds focused tests covering checkpoint ID recording on condensation failure and deferred checkpoint creation (including the “still empty transcript” no-crash path).

Written by Cursor Bugbot for commit 6bbb43e. Configure here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds tooling and documentation to improve debugging/triage of flaky E2E runs (especially Cursor), including a helper script for downloading CI artifacts and new .claude skill/plugin docs for a triage→fix workflow.

Changes:

  • Add scripts/download-e2e-artifacts.sh to fetch and normalize GitHub Actions E2E artifacts locally.
  • Update Cursor E2E agent bootstrap to pre-create the Cursor config directory to avoid runtime ENOENT failures.
  • Add/update E2E triage/debug/implement skill + plugin documentation and refresh E2E README guidance.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
scripts/download-e2e-artifacts.sh New helper script to locate a run (latest/ID/URL), download artifacts, flatten wrapper dirs, and write .run-info.json.
e2e/agents/cursor_cli.go Create Cursor config directory during bootstrap to reduce flaky failures.
e2e/README.md Document local artifact download + new triage workflow references.
cmd/entire/cli/strategy/common_test.go Minor formatting/alignment in tests.
.claude/skills/e2e/triage-ci.md New CI triage procedure doc (download artifacts or rerun locally, classify flaky vs real-bug).
.claude/skills/e2e/implement.md New procedure doc for applying fixes and verifying via scoped E2E runs.
.claude/skills/e2e/debug.md Update/normalize debug procedure doc formatting.
.claude/skills/e2e/SKILL.md Add orchestrator skill definition for the E2E triage→implement pipeline.
.claude/skills/agent-integration/*.md Update references to use /e2e:debug instead of the old command name.
.claude/plugins/e2e/** Add local plugin command wrappers + plugin metadata and README.

alishakawaguchi and others added 7 commits March 6, 2026 17:09
Cursor's atomic config write (cli-config.json.tmp → cli-config.json)
races when parallel tests trigger "Workspace Trust Required"
simultaneously. Pre-seeding the file with {} in Bootstrap() avoids
the temp-file rename path entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 47c9b5f45145
The previous pre-seed fix wasn't sufficient because cursor's
write-after-trust-acceptance still uses atomic temp-file rename on the
shared ~/.config/cursor/cli-config.json. With 43 parallel tests, multiple
cursor processes race on the same .tmp file.

Fix: give each tmux session its own XDG_CONFIG_HOME + HOME pointing to an
isolated temp directory with a pre-seeded cli-config.json, following the
same pattern Claude Code uses with CLAUDE_CONFIG_DIR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: df802d707cb9
- Add per-session XDG_CONFIG_HOME with pre-seeded cli-config.json to
  prevent ENOENT race on parallel tests (keep real HOME for auth)
- Wait for "Add a follow-up" instead of PromptPattern() to avoid
  premature WaitFor settling during Thinking phase
- Clean up temp config dir on session close

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 121daaca3e20
When Cursor commits mid-turn before flushing its transcript, condensation
fails silently and the checkpoint ID is never recorded. This means the
commit trailer points to nonexistent checkpoint metadata.

Fix by separating intent from execution: record checkpoint IDs on
condensation *attempt* (not just success), and fall back to deferred
CondenseSession at stop time when UpdateCommitted returns
ErrCheckpointNotFound. No behavior change for agents that flush
transcripts before committing (Claude Code, Gemini, etc.).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: fb5a3e2fb741
Creates a local marketplace at .claude/plugins/ wrapping both the e2e and
agent-integration plugins, and registers it via extraKnownMarketplaces in
project settings so plugin subcommands (/e2e:debug, /agent-integration:research,
etc.) are discoverable by Claude Code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 13d3f0b6de12
alishakawaguchi and others added 5 commits March 9, 2026 12:28
- Standardize on `entire-logs/entire.log` in debug.md diagnostic table
- Update triage-ci command description to mention CI artifact support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: d9972bc12f95
@alishakawaguchi alishakawaguchi changed the title WIP - Fix cursor E2E flaky tests Fix cursor-cli E2E flakiness and deferred condensation bug Mar 9, 2026
@alishakawaguchi
Copy link
Copy Markdown
Contributor Author

@cursor review

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@alishakawaguchi alishakawaguchi marked this pull request as ready for review March 9, 2026 23:31
@alishakawaguchi alishakawaguchi requested a review from a team as a code owner March 9, 2026 23:31
@alishakawaguchi alishakawaguchi changed the title Fix cursor-cli E2E flakiness and deferred condensation bug Fix cursor-cli E2E flakiness Mar 16, 2026
@khaong khaong merged commit a026f26 into main Mar 17, 2026
3 checks passed
@khaong khaong deleted the alisha/e2e-triage-local-only branch March 17, 2026 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants