Skip to content

pr-triage: add empirical validation simulator#66125

Closed
andreahlert wants to merge 3 commits into
apache:mainfrom
andreahlert:feat/pr-triage-validation-simulator
Closed

pr-triage: add empirical validation simulator#66125
andreahlert wants to merge 3 commits into
apache:mainfrom
andreahlert:feat/pr-triage-validation-simulator

Conversation

@andreahlert

@andreahlert andreahlert commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a small Python simulator that validates the pr-triage skill's classification + action-selection logic against a live sample of open PRs on a target repo. Companion to the consolidation work in #66126 (the decision-table refactor of classify.md + suggested-actions.md into classify-and-act.md).

The simulator implements the spec's rules as pure Python functions and diffs the output across a 30-PR sample. Used as a regression check after editing row preconditions, glossary entries, or the Real-CI guard. A clean zero-divergence run is evidence that a spec edit did not change behaviour on the production queue.

python3 dev/pr_triage_simulator.py --repo apache/airflow --first 30
python3 dev/pr_triage_simulator.py --repo apache/airflow --first 30 --sort updated-desc

Files

What it does

  1. Fetches a page of open PRs via the same GraphQL shape the skill uses, plus the per-page action_required REST index, plus a one-shot recent_main_failures query (failing checks across the last 10 merged PRs on the base branch), plus per-PR commits_behind via REST compare.
  2. Derives the same intermediate state the skill uses (rollup state, real-CI presence, unresolved threads, last-failed-check startedAt, viewer-triage-marker timestamp, follow-up-ping presence, etc).
  3. Runs both classifiers (the historical two-file spec and the consolidated decision table from pr-triage: collapse classify.md + suggested-actions.md into ordered decision table #66126) and diffs the resulting (classification, action) tuple per PR.
  4. Reports divergences with PR URL and old/new tuples for inspection.

Why this is separate from the consolidation PR

Keeps the consolidation diff in #66126 focused on the spec change. The simulator is a maintenance tool — useful but not normative — and lives under dev/ for that reason. It is not loaded by Claude Code, does not affect the skill's runtime behaviour, and has no production consumers.

Test plan

  • Run against apache/airflow open queue (oldest-updated, 30 PRs) — zero divergences
  • Run against apache/airflow open queue (newest-updated, 30 PRs) — zero divergences
  • Code-review pass complete (LGTM, zero blocking + zero high findings)
  • No mutations performed during any run (read-only by design)

Was generative AI tooling used to co-author this PR?
  • Yes

Adds dev/pr_triage_simulator.py and dev/pr-triage-validation.md.
The simulator implements the pr-triage skill's classification +
action-selection logic as pure Python functions and diffs the
output against a reference implementation across a live page of
open PRs on a target repo. Read-only — uses gh CLI for GraphQL
and REST, mutates nothing.

Use: regression check after editing classify-and-act.md row
preconditions, glossary entries, or the Real-CI guard. A clean
zero-divergence run on a 30-PR sample is evidence that a row
edit did not change behaviour for the production queue.

  python3 dev/pr_triage_simulator.py --repo apache/airflow --first 30

The validation log captures the structural audit (every old rule
mapped to a covering new row), six fixture walk-throughs covering
the highest-stakes paths (workflow approval, static-check
failures, the unresolved-thread heuristic positive and negative,
bot-only rollup, viewer-is-author defense-in-depth), the
empirical run results, and the documented intentional behaviour
changes between the old two-file spec and the consolidated
decision table.

Stdlib only. No third-party dependencies.

Signed-off-by: André Ahlert <andre@aex.partners>
@potiuk

potiuk commented Apr 29, 2026

Copy link
Copy Markdown
Member

Hmmm. Actually I was thinking about adding tests for the skills .. And I have not decided yet what/whether to do it. This is an interesting approach. However... I think maybe a good idea will be to iterate a bit on it in https://github.com/apache/airflow-steward ? I am just about (tomorrow ?) going to move the pr skils there and we can continue iterating there. I also thought (and already created) the https://github.com/apache/airflow-steward-testync. repo for tests fof airlfow-steward- because one of the feature of "airflow-steward" is that you can add it to an external repo as sub-module and use the skils from it ( I already have it in the private security repo of ours - but I want to add it to airflow. This has some nice features - where we have clean split between what is "airflow" and what is "generic" - so I planned to use the "test" repo to test this integration (including automated setup) and re-connecting/re-syncing - and I thought that maybe tests could be there - I also wanted to shop around if there are some existing test frameworks for skills - I presume there are quite a number :)

potiuk pushed a commit that referenced this pull request May 1, 2026
…ecision table (#66126)

* pr-triage: collapse classify.md + suggested-actions.md into ordered decision table

The skill's first-pass logic was split across classify.md (decision
matrix that turned PR state into one of seven classifications) and
suggested-actions.md (per-classification table that picked the
action verb). Reading them in lock-step on every classification
was load-bearing, but the implicit ordering between them — C2b
"evaluate before C2", the 9-row sub-ladder under
deterministic_flag, the Real-CI-guard reclassification path —
was invisible in either file in isolation.

This commit replaces both files with a single ordered decision
table at classify-and-act.md plus a companion rationale.md.
The table is first-match-wins across 22 rows; each row produces
a (classification, action, reason) tuple. Pre-filters F1-F5b run
before the table. A precondition glossary names every compound
predicate (has_deterministic_signal, ci_failures_only,
unresolved_threads_only, unresolved_threads_only_likely_addressed,
copilot_review_stale, static_check, recent_main_failures,
flagged_prs_by_author, follow_up_ping, first_time_no_real_ci)
once so the rows stay short.

Behaviour-preserving for every real PR shape exercised by the
empirical simulator added in companion PR #66125 — 60 PRs across
both ends of the open queue, zero divergences. Six theoretical
edge cases shift behaviour by design (Row 1 priority for
pending_workflow_approval over the refusal-skip cases; Row 2
priority for stale_copilot_review over data-anomaly skip). All
documented in the validation log; none appeared in the empirical
sample.

Refinements bundled in:

- new follow_up_ping glossary entry — old C3 had a three-clause
  detection rule that was not reproducible from prose alone
- new first_time_no_real_ci glossary entry — old C1 had an
  EXPECTED-rollup fallback for first-time contributors that
  Row 1 was originally missing; restored
- Row 16 ("no real CI ran -> rebase") gained an explicit
  "author NOT first-time" guard. Same outcome as the old spec
  (C1 caught first-time contributors before suggest-action ran)
  but readable in isolation
- Row 22 (data anomaly) carries an evaluation-order note in the
  cross-cutting hard rules — the row's table position is
  documentary; implementations evaluate it immediately before
  the passing rows
- pr-stats skill cross-references updated to point at the new
  already_triaged rows in classify-and-act.md

SKILL.md Step 2 ("Filter and classify") + Step 3 ("Compute
suggested action per PR") collapsed into a single Step 2
("Filter, classify, and pick action"). Cross-references in
sibling files updated. Hot-path token cost on a typical triage
invocation drops from ~728 to ~340 lines.

Signed-off-by: André Ahlert <andre@aex.partners>

* pr-triage: fix markdownlint MD032 + MD038 in classify-and-act.md

- line 7: rewrap to avoid leading + being read as list marker
- reason-template-rules: replace inline backticks containing
  single comma + space pair (which the no-space-in-code rule
  rejects) with prose

Signed-off-by: André Ahlert <andre@aex.partners>

* pr-triage: drop space-only code span to satisfy markdownlint MD038

Signed-off-by: André Ahlert <andre@aex.partners>

---------

Signed-off-by: André Ahlert <andre@aex.partners>
potiuk and others added 2 commits May 5, 2026 19:43
- mypy-dev: avoid datetime|None in max() and rename loop var that shadowed Derived
- doctoc + insert-license on dev/pr-triage-validation.md

Signed-off-by: André Ahlert <andre@aex.partners>
@potiuk

potiuk commented May 10, 2026

Copy link
Copy Markdown
Member

Let's move it to "steward"

@potiuk potiuk closed this May 10, 2026
@andreahlert andreahlert deleted the feat/pr-triage-validation-simulator branch May 11, 2026 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants