Tighten benchmark variance threshold to 10% (--variance-threshold)#22
Open
kmuralidharan91 wants to merge 1 commit into
Open
Tighten benchmark variance threshold to 10% (--variance-threshold)#22kmuralidharan91 wants to merge 1 commit into
kmuralidharan91 wants to merge 1 commit into
Conversation
Lifts the variance-confidence rule from prose-only into the script. The orchestrator's SKILL.md previously instructed the agent to flag a benchmark as noisy when (max - min) exceeds 20% of the median, but benchmark_builds.py emitted no variance metric, leaving every consumer to recompute the rule ad hoc. Changes - scripts/benchmark_builds.py: add --variance-threshold flag (default 10%), compute spread/spread_percent/high_variance per build type, and emit a Warning to stderr plus an entry in artifact "notes" auto-recommending --repeats=5 when any build type breaches the threshold and --repeats < 5. - schemas/build-benchmark.schema.json: add 1.3.0 to schema_version enum, declare optional variance_threshold_percent at the root, and declare optional spread_seconds/spread_percent/high_variance under stats. All additions are non-required so older artifacts remain valid. - skills/xcode-build-orchestrator/SKILL.md: update the benchmark-confidence prose at lines 30 and 108 to read summary.<type>.high_variance from the artifact rather than restate the threshold percentage in two places. - Sync benchmark_builds.py to the three bundled skill copies and the schema to its one bundled copy, per CONTRIBUTING.md lines 66 and 74. Wall-clock: this PR does not change build wall-clock time; it improves the signal-to-noise floor for detecting wall-clock changes between runs. The default 20% -> 10% tightening lets the orchestrator catch real ~5% changes that previously fell inside baseline noise.
There was a problem hiding this comment.
Pull request overview
Moves the “benchmark variance / confidence” rule into the benchmark artifact so downstream skills and consumers can rely on a single, script-owned variance signal when interpreting clean/cached clean/incremental timing results.
Changes:
- Add
--variance-threshold(default 10% of median) and emitspread_seconds,spread_percent, andhigh_variancein benchmarksummarystats. - Emit variance warnings (stderr + artifact
notes) when variance is high and--repeats < 5, and recordvariance_threshold_percentat the artifact root. - Bump benchmark artifact schema to include
1.3.0and the new optional variance-related fields; update orchestrator docs to usesummary.<type>.high_variance.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| skills/xcode-build-orchestrator/SKILL.md | Switches the orchestrator’s confidence check to consume summary.<type>.high_variance from artifacts. |
| skills/xcode-build-orchestrator/scripts/benchmark_builds.py | Adds variance threshold flag + variance stats emission + warning/notes behavior + schema v1.3.0. |
| skills/xcode-build-fixer/scripts/benchmark_builds.py | Sync of benchmark script changes (variance fields + schema v1.3.0). |
| skills/xcode-build-benchmark/scripts/benchmark_builds.py | Sync of benchmark script changes (variance fields + schema v1.3.0). |
| scripts/benchmark_builds.py | Canonical benchmark script updated to compute/emit variance and warn when under-repeating. |
| schemas/build-benchmark.schema.json | Extends schema_version enum to 1.3.0 and adds optional variance fields. |
| skills/xcode-build-benchmark/schemas/build-benchmark.schema.json | Bundled schema sync for the xcode-build-benchmark skill. |
Comment on lines
+30
to
+37
| parser.add_argument( | ||
| "--variance-threshold", | ||
| type=float, | ||
| default=10.0, | ||
| help="Percent of median above which the (max - min) spread is flagged as high variance. " | ||
| "Default: 10. High-variance benchmarks are not reliable for distinguishing real changes " | ||
| "from measurement noise; the script auto-recommends rerunning with --repeats=5 when the " | ||
| "threshold is exceeded.", |
Comment on lines
310
to
313
| artifact = { | ||
| "schema_version": "1.2.0" if "cached_clean" in runs else "1.1.0", | ||
| "schema_version": "1.3.0", | ||
| "variance_threshold_percent": args.variance_threshold, | ||
| "created_at": datetime.now(timezone.utc).isoformat(), |
| 2. Run `xcode-build-benchmark` to establish a baseline if no fresh benchmark exists. The benchmark script auto-detects `COMPILATION_CACHE_ENABLE_CACHING = YES` and includes cached clean builds that measure the realistic developer experience (warm cache). If the build fails to compile, check `git log` for a recent buildable commit. When working in a worktree, cherry-picking a targeted build fix from a feature branch is acceptable to reach a buildable state. If SPM packages reference gitignored directories in their `exclude:` paths (e.g., `__Snapshots__`), create those directories before building -- worktrees do not contain gitignored content and `xcodebuild -resolvePackageDependencies` will crash otherwise. | ||
| 3. Verify the benchmark artifact has non-empty `timing_summary_categories`. If empty, the timing summary parser may have failed -- re-parse the raw logs or inspect them manually. If `COMPILATION_CACHE_ENABLE_CACHING` is enabled, also verify the artifact includes `cached_clean` runs. | ||
| - **Benchmark confidence check**: For each build type (clean, cached clean, incremental), compare the min and max values. If the spread (max - min) exceeds 20% of the median, flag the benchmark as having high variance and recommend running additional repetitions (5+ runs) before drawing conclusions. High variance makes it difficult to distinguish real improvements from noise. After applying changes, only claim an improvement if the post-change median falls outside the baseline's min-max range. | ||
| - **Benchmark confidence check**: For each build type (clean, cached clean, incremental), read `summary.<type>.high_variance` from the benchmark artifact. The benchmark script flags this as `true` when the (max - min) spread exceeds the configured `--variance-threshold` (default 10% of the median) and emits an auto-recommendation to rerun with `--repeats=5` in `notes`. When any build type is flagged, treat the benchmark as inconclusive and rerun with more repetitions before drawing conclusions. After applying changes, only claim an improvement if the post-change median falls outside the baseline's min-max range. |
Comment on lines
+30
to
+37
| parser.add_argument( | ||
| "--variance-threshold", | ||
| type=float, | ||
| default=10.0, | ||
| help="Percent of median above which the (max - min) spread is flagged as high variance. " | ||
| "Default: 10. High-variance benchmarks are not reliable for distinguishing real changes " | ||
| "from measurement noise; the script auto-recommends rerunning with --repeats=5 when the " | ||
| "threshold is exceeded.", |
Comment on lines
310
to
313
| artifact = { | ||
| "schema_version": "1.2.0" if "cached_clean" in runs else "1.1.0", | ||
| "schema_version": "1.3.0", | ||
| "variance_threshold_percent": args.variance_threshold, | ||
| "created_at": datetime.now(timezone.utc).isoformat(), |
Comment on lines
310
to
313
| artifact = { | ||
| "schema_version": "1.2.0" if "cached_clean" in runs else "1.1.0", | ||
| "schema_version": "1.3.0", | ||
| "variance_threshold_percent": args.variance_threshold, | ||
| "created_at": datetime.now(timezone.utc).isoformat(), |
Comment on lines
+30
to
+37
| parser.add_argument( | ||
| "--variance-threshold", | ||
| type=float, | ||
| default=10.0, | ||
| help="Percent of median above which the (max - min) spread is flagged as high variance. " | ||
| "Default: 10. High-variance benchmarks are not reliable for distinguishing real changes " | ||
| "from measurement noise; the script auto-recommends rerunning with --repeats=5 when the " | ||
| "threshold is exceeded.", |
Comment on lines
310
to
313
| artifact = { | ||
| "schema_version": "1.2.0" if "cached_clean" in runs else "1.1.0", | ||
| "schema_version": "1.3.0", | ||
| "variance_threshold_percent": args.variance_threshold, | ||
| "created_at": datetime.now(timezone.utc).isoformat(), |
Comment on lines
13
to
+19
| "schema_version": { | ||
| "type": "string", | ||
| "enum": ["1.0.0", "1.1.0", "1.2.0"] | ||
| "enum": ["1.0.0", "1.1.0", "1.2.0", "1.3.0"] | ||
| }, | ||
| "variance_threshold_percent": { | ||
| "type": "number", | ||
| "minimum": 0 |
Comment on lines
13
to
+19
| "schema_version": { | ||
| "type": "string", | ||
| "enum": ["1.0.0", "1.1.0", "1.2.0"] | ||
| "enum": ["1.0.0", "1.1.0", "1.2.0", "1.3.0"] | ||
| }, | ||
| "variance_threshold_percent": { | ||
| "type": "number", | ||
| "minimum": 0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The orchestrator's
SKILL.mddocuments a benchmark-confidence rule — flag a benchmark as noisy when(max - min) > 20% × medianand rerun with 5+ repetitions — butscripts/benchmark_builds.pyemitted no variance metric, so every consumer of.build-benchmark/<artifact>.jsonhad to recompute the rule ad hoc.This PR moves the rule into the script:
--variance-thresholdflag (percent of median; default tightened from 20% to 10%),spread_seconds/spread_percent/high_varianceper build type,Warning:line to stderr and appends it to the artifact'snoteswhen any build type exceeds the threshold and--repeats < 5, auto-recommending--repeats=5.Why 10% instead of 20%
Worked example: a 100s-median project with clean spreads of 90/100/110s sits exactly at the current 20% threshold. A real 5% regression then produces 95/105/115s — the new median lands inside the baseline's 90–110s range, so the orchestrator's "median outside baseline min-max" rule reports it as inconclusive. With a 10% threshold the original baseline is flagged as too noisy; rerunning with
--repeats=5typically tightens the spread, and the regression is caught.What changed
scripts/benchmark_builds.py:--variance-thresholdflag (default10.0);stats_fornow returnsspread_seconds,spread_percent,high_variancewhencount >= 2andmedian > 0;mainsetsvariance_threshold_percentat the artifact root and emits one Warning per breaching build type.schema_versionis now always1.3.0.schemas/build-benchmark.schema.json: adds"1.3.0"to theschema_versionenum (older versions kept so existing artifacts still validate); declares optionalvariance_threshold_percentat root and optionalspread_seconds/spread_percent/high_varianceunderstats. All additions are non-required.skills/xcode-build-orchestrator/SKILL.mdlines 30 and 108: replace the prose-only20% of medianrule with a read ofsummary.<type>.high_variancefrom the artifact (script default 10%). Single source of truth is now the script.benchmark_builds.pyto its three bundled skill copies and the schema to its one bundled copy perCONTRIBUTING.mdlines 66 and 74.Wall-clock impact
This PR does not change build wall-clock time. It improves the signal-to-noise floor for detecting wall-clock changes between runs.
Smoke-tested against NativeSkeletonApp
Two live runs at
--variance-threshold=10(default) and--variance-threshold=20. Both runs emitted the new variance fields and the correctvariance_threshold_percentat the artifact root. NSK was stable on the day (clean spreads 2.01% and 13.73% across the two runs respectively), so neither live run breached its threshold. Re-scoring the second run's measured durations at threshold=10 produceshigh_variance: Truefor both build types, confirming the auto-recommend Warning fires as designed.Out of scope (separate finding)
While validating, I noticed
scripts/benchmark_builds.pyemitsbuild_type: "zero-change"for--touch-file-less runs, but the schema'sruns.<*>.items.build_typeenum is["clean", "cached-clean", "incremental"]. This is pre-existing onmainand unrelated to variance — happy to follow up in a separate PR if useful.