pr-triage: add empirical validation simulator by andreahlert · Pull Request #66125 · apache/airflow

andreahlert · 2026-04-29T21:11:03Z

Summary

Adds a small Python simulator that validates the pr-triage skill's classification + action-selection logic against a live sample of open PRs on a target repo. Companion to the consolidation work in #66126 (the decision-table refactor of classify.md + suggested-actions.md into classify-and-act.md).

The simulator implements the spec's rules as pure Python functions and diffs the output across a 30-PR sample. Used as a regression check after editing row preconditions, glossary entries, or the Real-CI guard. A clean zero-divergence run is evidence that a spec edit did not change behaviour on the production queue.

python3 dev/pr_triage_simulator.py --repo apache/airflow --first 30
python3 dev/pr_triage_simulator.py --repo apache/airflow --first 30 --sort updated-desc

Files

dev/pr_triage_simulator.py — Python simulator. Stdlib only. Read-only (no mutations). Uses gh CLI for GraphQL + REST.
dev/pr-triage-validation.md — validation log: structural audit (every old rule mapped to a covering new row in pr-triage: collapse classify.md + suggested-actions.md into ordered decision table #66126's table), six fixture walk-throughs, empirical run results, and the documented intentional behaviour changes.

What it does

Fetches a page of open PRs via the same GraphQL shape the skill uses, plus the per-page action_required REST index, plus a one-shot recent_main_failures query (failing checks across the last 10 merged PRs on the base branch), plus per-PR commits_behind via REST compare.
Derives the same intermediate state the skill uses (rollup state, real-CI presence, unresolved threads, last-failed-check startedAt, viewer-triage-marker timestamp, follow-up-ping presence, etc).
Runs both classifiers (the historical two-file spec and the consolidated decision table from pr-triage: collapse classify.md + suggested-actions.md into ordered decision table #66126) and diffs the resulting (classification, action) tuple per PR.
Reports divergences with PR URL and old/new tuples for inspection.

Why this is separate from the consolidation PR

Keeps the consolidation diff in #66126 focused on the spec change. The simulator is a maintenance tool — useful but not normative — and lives under dev/ for that reason. It is not loaded by Claude Code, does not affect the skill's runtime behaviour, and has no production consumers.

Test plan

Run against apache/airflow open queue (oldest-updated, 30 PRs) — zero divergences
Run against apache/airflow open queue (newest-updated, 30 PRs) — zero divergences
Code-review pass complete (LGTM, zero blocking + zero high findings)
No mutations performed during any run (read-only by design)

Was generative AI tooling used to co-author this PR?

Yes

Adds dev/pr_triage_simulator.py and dev/pr-triage-validation.md. The simulator implements the pr-triage skill's classification + action-selection logic as pure Python functions and diffs the output against a reference implementation across a live page of open PRs on a target repo. Read-only — uses gh CLI for GraphQL and REST, mutates nothing. Use: regression check after editing classify-and-act.md row preconditions, glossary entries, or the Real-CI guard. A clean zero-divergence run on a 30-PR sample is evidence that a row edit did not change behaviour for the production queue. python3 dev/pr_triage_simulator.py --repo apache/airflow --first 30 The validation log captures the structural audit (every old rule mapped to a covering new row), six fixture walk-throughs covering the highest-stakes paths (workflow approval, static-check failures, the unresolved-thread heuristic positive and negative, bot-only rollup, viewer-is-author defense-in-depth), the empirical run results, and the documented intentional behaviour changes between the old two-file spec and the consolidated decision table. Stdlib only. No third-party dependencies. Signed-off-by: André Ahlert <andre@aex.partners>

potiuk · 2026-04-29T22:44:01Z

Hmmm. Actually I was thinking about adding tests for the skills .. And I have not decided yet what/whether to do it. This is an interesting approach. However... I think maybe a good idea will be to iterate a bit on it in https://github.com/apache/airflow-steward ? I am just about (tomorrow ?) going to move the pr skils there and we can continue iterating there. I also thought (and already created) the https://github.com/apache/airflow-steward-testync. repo for tests fof airlfow-steward- because one of the feature of "airflow-steward" is that you can add it to an external repo as sub-module and use the skils from it ( I already have it in the private security repo of ours - but I want to add it to airflow. This has some nice features - where we have clean split between what is "airflow" and what is "generic" - so I planned to use the "test" repo to test this integration (including automated setup) and re-connecting/re-syncing - and I thought that maybe tests could be there - I also wanted to shop around if there are some existing test frameworks for skills - I presume there are quite a number :)

…ecision table (#66126) * pr-triage: collapse classify.md + suggested-actions.md into ordered decision table The skill's first-pass logic was split across classify.md (decision matrix that turned PR state into one of seven classifications) and suggested-actions.md (per-classification table that picked the action verb). Reading them in lock-step on every classification was load-bearing, but the implicit ordering between them — C2b "evaluate before C2", the 9-row sub-ladder under deterministic_flag, the Real-CI-guard reclassification path — was invisible in either file in isolation. This commit replaces both files with a single ordered decision table at classify-and-act.md plus a companion rationale.md. The table is first-match-wins across 22 rows; each row produces a (classification, action, reason) tuple. Pre-filters F1-F5b run before the table. A precondition glossary names every compound predicate (has_deterministic_signal, ci_failures_only, unresolved_threads_only, unresolved_threads_only_likely_addressed, copilot_review_stale, static_check, recent_main_failures, flagged_prs_by_author, follow_up_ping, first_time_no_real_ci) once so the rows stay short. Behaviour-preserving for every real PR shape exercised by the empirical simulator added in companion PR #66125 — 60 PRs across both ends of the open queue, zero divergences. Six theoretical edge cases shift behaviour by design (Row 1 priority for pending_workflow_approval over the refusal-skip cases; Row 2 priority for stale_copilot_review over data-anomaly skip). All documented in the validation log; none appeared in the empirical sample. Refinements bundled in: - new follow_up_ping glossary entry — old C3 had a three-clause detection rule that was not reproducible from prose alone - new first_time_no_real_ci glossary entry — old C1 had an EXPECTED-rollup fallback for first-time contributors that Row 1 was originally missing; restored - Row 16 ("no real CI ran -> rebase") gained an explicit "author NOT first-time" guard. Same outcome as the old spec (C1 caught first-time contributors before suggest-action ran) but readable in isolation - Row 22 (data anomaly) carries an evaluation-order note in the cross-cutting hard rules — the row's table position is documentary; implementations evaluate it immediately before the passing rows - pr-stats skill cross-references updated to point at the new already_triaged rows in classify-and-act.md SKILL.md Step 2 ("Filter and classify") + Step 3 ("Compute suggested action per PR") collapsed into a single Step 2 ("Filter, classify, and pick action"). Cross-references in sibling files updated. Hot-path token cost on a typical triage invocation drops from ~728 to ~340 lines. Signed-off-by: André Ahlert <andre@aex.partners> * pr-triage: fix markdownlint MD032 + MD038 in classify-and-act.md - line 7: rewrap to avoid leading + being read as list marker - reason-template-rules: replace inline backticks containing single comma + space pair (which the no-space-in-code rule rejects) with prose Signed-off-by: André Ahlert <andre@aex.partners> * pr-triage: drop space-only code span to satisfy markdownlint MD038 Signed-off-by: André Ahlert <andre@aex.partners> --------- Signed-off-by: André Ahlert <andre@aex.partners>

- mypy-dev: avoid datetime|None in max() and rename loop var that shadowed Derived - doctoc + insert-license on dev/pr-triage-validation.md Signed-off-by: André Ahlert <andre@aex.partners>

potiuk · 2026-05-10T02:42:31Z

Let's move it to "steward"

andreahlert requested review from amoghrajesh, ashb, bugraoz93, choo121600, ephraimbuddy, gopidesupavan, jason810496, jedcunningham, jscheffl, potiuk and vatsrahul1001 as code owners April 29, 2026 21:11

boring-cyborg Bot added area:dev-tools backport-to-v3-2-test labels Apr 29, 2026

andreahlert mentioned this pull request Apr 29, 2026

pr-triage: collapse classify.md + suggested-actions.md into ordered decision table #66126

Merged

6 tasks

potiuk and others added 2 commits May 5, 2026 19:43

Merge branch 'main' into feat/pr-triage-validation-simulator

a2df4ad

pr-triage simulator: fix CI static checks

400b5f5

- mypy-dev: avoid datetime|None in max() and rename loop var that shadowed Derived - doctoc + insert-license on dev/pr-triage-validation.md Signed-off-by: André Ahlert <andre@aex.partners>

potiuk closed this May 10, 2026

andreahlert deleted the feat/pr-triage-validation-simulator branch May 11, 2026 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pr-triage: add empirical validation simulator#66125

pr-triage: add empirical validation simulator#66125
andreahlert wants to merge 3 commits into
apache:mainfrom
andreahlert:feat/pr-triage-validation-simulator

andreahlert commented Apr 29, 2026 •

edited

Loading

Uh oh!

potiuk commented Apr 29, 2026

Uh oh!

potiuk commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

andreahlert commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

What it does

Why this is separate from the consolidation PR

Test plan

Was generative AI tooling used to co-author this PR?

Uh oh!

potiuk commented Apr 29, 2026

Uh oh!

potiuk commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andreahlert commented Apr 29, 2026 •

edited

Loading