Skip to content

Tighten benchmark variance threshold to 10% (--variance-threshold)#22

Open
kmuralidharan91 wants to merge 1 commit into
AvdLee:mainfrom
kmuralidharan91:feat/variance-threshold-flag
Open

Tighten benchmark variance threshold to 10% (--variance-threshold)#22
kmuralidharan91 wants to merge 1 commit into
AvdLee:mainfrom
kmuralidharan91:feat/variance-threshold-flag

Conversation

@kmuralidharan91
Copy link
Copy Markdown

Summary

The orchestrator's SKILL.md documents a benchmark-confidence rule — flag a benchmark as noisy when (max - min) > 20% × median and rerun with 5+ repetitions — but scripts/benchmark_builds.py emitted no variance metric, so every consumer of .build-benchmark/<artifact>.json had to recompute the rule ad hoc.

This PR moves the rule into the script:

  • adds a --variance-threshold flag (percent of median; default tightened from 20% to 10%),
  • computes spread_seconds / spread_percent / high_variance per build type,
  • emits a Warning: line to stderr and appends it to the artifact's notes when any build type exceeds the threshold and --repeats < 5, auto-recommending --repeats=5.

Why 10% instead of 20%

Worked example: a 100s-median project with clean spreads of 90/100/110s sits exactly at the current 20% threshold. A real 5% regression then produces 95/105/115s — the new median lands inside the baseline's 90–110s range, so the orchestrator's "median outside baseline min-max" rule reports it as inconclusive. With a 10% threshold the original baseline is flagged as too noisy; rerunning with --repeats=5 typically tightens the spread, and the regression is caught.

What changed

  • scripts/benchmark_builds.py: --variance-threshold flag (default 10.0); stats_for now returns spread_seconds, spread_percent, high_variance when count >= 2 and median > 0; main sets variance_threshold_percent at the artifact root and emits one Warning per breaching build type. schema_version is now always 1.3.0.
  • schemas/build-benchmark.schema.json: adds "1.3.0" to the schema_version enum (older versions kept so existing artifacts still validate); declares optional variance_threshold_percent at root and optional spread_seconds / spread_percent / high_variance under stats. All additions are non-required.
  • skills/xcode-build-orchestrator/SKILL.md lines 30 and 108: replace the prose-only 20% of median rule with a read of summary.<type>.high_variance from the artifact (script default 10%). Single source of truth is now the script.
  • Sync benchmark_builds.py to its three bundled skill copies and the schema to its one bundled copy per CONTRIBUTING.md lines 66 and 74.

Wall-clock impact

This PR does not change build wall-clock time. It improves the signal-to-noise floor for detecting wall-clock changes between runs.

Smoke-tested against NativeSkeletonApp

Two live runs at --variance-threshold=10 (default) and --variance-threshold=20. Both runs emitted the new variance fields and the correct variance_threshold_percent at the artifact root. NSK was stable on the day (clean spreads 2.01% and 13.73% across the two runs respectively), so neither live run breached its threshold. Re-scoring the second run's measured durations at threshold=10 produces high_variance: True for both build types, confirming the auto-recommend Warning fires as designed.

Out of scope (separate finding)

While validating, I noticed scripts/benchmark_builds.py emits build_type: "zero-change" for --touch-file-less runs, but the schema's runs.<*>.items.build_type enum is ["clean", "cached-clean", "incremental"]. This is pre-existing on main and unrelated to variance — happy to follow up in a separate PR if useful.

Lifts the variance-confidence rule from prose-only into the script. The
orchestrator's SKILL.md previously instructed the agent to flag a benchmark
as noisy when (max - min) exceeds 20% of the median, but benchmark_builds.py
emitted no variance metric, leaving every consumer to recompute the rule
ad hoc.

Changes
- scripts/benchmark_builds.py: add --variance-threshold flag (default 10%),
  compute spread/spread_percent/high_variance per build type, and emit a
  Warning to stderr plus an entry in artifact "notes" auto-recommending
  --repeats=5 when any build type breaches the threshold and --repeats < 5.
- schemas/build-benchmark.schema.json: add 1.3.0 to schema_version enum,
  declare optional variance_threshold_percent at the root, and declare
  optional spread_seconds/spread_percent/high_variance under stats. All
  additions are non-required so older artifacts remain valid.
- skills/xcode-build-orchestrator/SKILL.md: update the benchmark-confidence
  prose at lines 30 and 108 to read summary.<type>.high_variance from the
  artifact rather than restate the threshold percentage in two places.
- Sync benchmark_builds.py to the three bundled skill copies and the
  schema to its one bundled copy, per CONTRIBUTING.md lines 66 and 74.

Wall-clock: this PR does not change build wall-clock time; it improves the
signal-to-noise floor for detecting wall-clock changes between runs. The
default 20% -> 10% tightening lets the orchestrator catch real ~5% changes
that previously fell inside baseline noise.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Moves the “benchmark variance / confidence” rule into the benchmark artifact so downstream skills and consumers can rely on a single, script-owned variance signal when interpreting clean/cached clean/incremental timing results.

Changes:

  • Add --variance-threshold (default 10% of median) and emit spread_seconds, spread_percent, and high_variance in benchmark summary stats.
  • Emit variance warnings (stderr + artifact notes) when variance is high and --repeats < 5, and record variance_threshold_percent at the artifact root.
  • Bump benchmark artifact schema to include 1.3.0 and the new optional variance-related fields; update orchestrator docs to use summary.<type>.high_variance.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
skills/xcode-build-orchestrator/SKILL.md Switches the orchestrator’s confidence check to consume summary.<type>.high_variance from artifacts.
skills/xcode-build-orchestrator/scripts/benchmark_builds.py Adds variance threshold flag + variance stats emission + warning/notes behavior + schema v1.3.0.
skills/xcode-build-fixer/scripts/benchmark_builds.py Sync of benchmark script changes (variance fields + schema v1.3.0).
skills/xcode-build-benchmark/scripts/benchmark_builds.py Sync of benchmark script changes (variance fields + schema v1.3.0).
scripts/benchmark_builds.py Canonical benchmark script updated to compute/emit variance and warn when under-repeating.
schemas/build-benchmark.schema.json Extends schema_version enum to 1.3.0 and adds optional variance fields.
skills/xcode-build-benchmark/schemas/build-benchmark.schema.json Bundled schema sync for the xcode-build-benchmark skill.

Comment on lines +30 to +37
parser.add_argument(
"--variance-threshold",
type=float,
default=10.0,
help="Percent of median above which the (max - min) spread is flagged as high variance. "
"Default: 10. High-variance benchmarks are not reliable for distinguishing real changes "
"from measurement noise; the script auto-recommends rerunning with --repeats=5 when the "
"threshold is exceeded.",
Comment on lines 310 to 313
artifact = {
"schema_version": "1.2.0" if "cached_clean" in runs else "1.1.0",
"schema_version": "1.3.0",
"variance_threshold_percent": args.variance_threshold,
"created_at": datetime.now(timezone.utc).isoformat(),
2. Run `xcode-build-benchmark` to establish a baseline if no fresh benchmark exists. The benchmark script auto-detects `COMPILATION_CACHE_ENABLE_CACHING = YES` and includes cached clean builds that measure the realistic developer experience (warm cache). If the build fails to compile, check `git log` for a recent buildable commit. When working in a worktree, cherry-picking a targeted build fix from a feature branch is acceptable to reach a buildable state. If SPM packages reference gitignored directories in their `exclude:` paths (e.g., `__Snapshots__`), create those directories before building -- worktrees do not contain gitignored content and `xcodebuild -resolvePackageDependencies` will crash otherwise.
3. Verify the benchmark artifact has non-empty `timing_summary_categories`. If empty, the timing summary parser may have failed -- re-parse the raw logs or inspect them manually. If `COMPILATION_CACHE_ENABLE_CACHING` is enabled, also verify the artifact includes `cached_clean` runs.
- **Benchmark confidence check**: For each build type (clean, cached clean, incremental), compare the min and max values. If the spread (max - min) exceeds 20% of the median, flag the benchmark as having high variance and recommend running additional repetitions (5+ runs) before drawing conclusions. High variance makes it difficult to distinguish real improvements from noise. After applying changes, only claim an improvement if the post-change median falls outside the baseline's min-max range.
- **Benchmark confidence check**: For each build type (clean, cached clean, incremental), read `summary.<type>.high_variance` from the benchmark artifact. The benchmark script flags this as `true` when the (max - min) spread exceeds the configured `--variance-threshold` (default 10% of the median) and emits an auto-recommendation to rerun with `--repeats=5` in `notes`. When any build type is flagged, treat the benchmark as inconclusive and rerun with more repetitions before drawing conclusions. After applying changes, only claim an improvement if the post-change median falls outside the baseline's min-max range.
Comment on lines +30 to +37
parser.add_argument(
"--variance-threshold",
type=float,
default=10.0,
help="Percent of median above which the (max - min) spread is flagged as high variance. "
"Default: 10. High-variance benchmarks are not reliable for distinguishing real changes "
"from measurement noise; the script auto-recommends rerunning with --repeats=5 when the "
"threshold is exceeded.",
Comment on lines 310 to 313
artifact = {
"schema_version": "1.2.0" if "cached_clean" in runs else "1.1.0",
"schema_version": "1.3.0",
"variance_threshold_percent": args.variance_threshold,
"created_at": datetime.now(timezone.utc).isoformat(),
Comment on lines 310 to 313
artifact = {
"schema_version": "1.2.0" if "cached_clean" in runs else "1.1.0",
"schema_version": "1.3.0",
"variance_threshold_percent": args.variance_threshold,
"created_at": datetime.now(timezone.utc).isoformat(),
Comment on lines +30 to +37
parser.add_argument(
"--variance-threshold",
type=float,
default=10.0,
help="Percent of median above which the (max - min) spread is flagged as high variance. "
"Default: 10. High-variance benchmarks are not reliable for distinguishing real changes "
"from measurement noise; the script auto-recommends rerunning with --repeats=5 when the "
"threshold is exceeded.",
Comment on lines 310 to 313
artifact = {
"schema_version": "1.2.0" if "cached_clean" in runs else "1.1.0",
"schema_version": "1.3.0",
"variance_threshold_percent": args.variance_threshold,
"created_at": datetime.now(timezone.utc).isoformat(),
Comment on lines 13 to +19
"schema_version": {
"type": "string",
"enum": ["1.0.0", "1.1.0", "1.2.0"]
"enum": ["1.0.0", "1.1.0", "1.2.0", "1.3.0"]
},
"variance_threshold_percent": {
"type": "number",
"minimum": 0
Comment on lines 13 to +19
"schema_version": {
"type": "string",
"enum": ["1.0.0", "1.1.0", "1.2.0"]
"enum": ["1.0.0", "1.1.0", "1.2.0", "1.3.0"]
},
"variance_threshold_percent": {
"type": "number",
"minimum": 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants