Tighten benchmark variance threshold to 10% (--variance-threshold) by kmuralidharan91 · Pull Request #22 · AvdLee/Xcode-Build-Optimization-Agent-Skill

kmuralidharan91 · 2026-05-04T04:45:51Z

Summary

The orchestrator's SKILL.md documents a benchmark-confidence rule — flag a benchmark as noisy when (max - min) > 20% × median and rerun with 5+ repetitions — but scripts/benchmark_builds.py emitted no variance metric, so every consumer of .build-benchmark/<artifact>.json had to recompute the rule ad hoc.

This PR moves the rule into the script:

adds a --variance-threshold flag (percent of median; default tightened from 20% to 10%),
computes spread_seconds / spread_percent / high_variance per build type,
emits a Warning: line to stderr and appends it to the artifact's notes when any build type exceeds the threshold and --repeats < 5, auto-recommending --repeats=5.

Why 10% instead of 20%

Worked example: a 100s-median project with clean spreads of 90/100/110s sits exactly at the current 20% threshold. A real 5% regression then produces 95/105/115s — the new median lands inside the baseline's 90–110s range, so the orchestrator's "median outside baseline min-max" rule reports it as inconclusive. With a 10% threshold the original baseline is flagged as too noisy; rerunning with --repeats=5 typically tightens the spread, and the regression is caught.

What changed

scripts/benchmark_builds.py: --variance-threshold flag (default 10.0); stats_for now returns spread_seconds, spread_percent, high_variance when count >= 2 and median > 0; main sets variance_threshold_percent at the artifact root and emits one Warning per breaching build type. schema_version is now always 1.3.0.
schemas/build-benchmark.schema.json: adds "1.3.0" to the schema_version enum (older versions kept so existing artifacts still validate); declares optional variance_threshold_percent at root and optional spread_seconds / spread_percent / high_variance under stats. All additions are non-required.
skills/xcode-build-orchestrator/SKILL.md lines 30 and 108: replace the prose-only 20% of median rule with a read of summary.<type>.high_variance from the artifact (script default 10%). Single source of truth is now the script.
Sync benchmark_builds.py to its three bundled skill copies and the schema to its one bundled copy per CONTRIBUTING.md lines 66 and 74.

Wall-clock impact

This PR does not change build wall-clock time. It improves the signal-to-noise floor for detecting wall-clock changes between runs.

Smoke-tested against NativeSkeletonApp

Two live runs at --variance-threshold=10 (default) and --variance-threshold=20. Both runs emitted the new variance fields and the correct variance_threshold_percent at the artifact root. NSK was stable on the day (clean spreads 2.01% and 13.73% across the two runs respectively), so neither live run breached its threshold. Re-scoring the second run's measured durations at threshold=10 produces high_variance: True for both build types, confirming the auto-recommend Warning fires as designed.

Out of scope (separate finding)

While validating, I noticed scripts/benchmark_builds.py emits build_type: "zero-change" for --touch-file-less runs, but the schema's runs.<*>.items.build_type enum is ["clean", "cached-clean", "incremental"]. This is pre-existing on main and unrelated to variance — happy to follow up in a separate PR if useful.

Lifts the variance-confidence rule from prose-only into the script. The orchestrator's SKILL.md previously instructed the agent to flag a benchmark as noisy when (max - min) exceeds 20% of the median, but benchmark_builds.py emitted no variance metric, leaving every consumer to recompute the rule ad hoc. Changes - scripts/benchmark_builds.py: add --variance-threshold flag (default 10%), compute spread/spread_percent/high_variance per build type, and emit a Warning to stderr plus an entry in artifact "notes" auto-recommending --repeats=5 when any build type breaches the threshold and --repeats < 5. - schemas/build-benchmark.schema.json: add 1.3.0 to schema_version enum, declare optional variance_threshold_percent at the root, and declare optional spread_seconds/spread_percent/high_variance under stats. All additions are non-required so older artifacts remain valid. - skills/xcode-build-orchestrator/SKILL.md: update the benchmark-confidence prose at lines 30 and 108 to read summary.<type>.high_variance from the artifact rather than restate the threshold percentage in two places. - Sync benchmark_builds.py to the three bundled skill copies and the schema to its one bundled copy, per CONTRIBUTING.md lines 66 and 74. Wall-clock: this PR does not change build wall-clock time; it improves the signal-to-noise floor for detecting wall-clock changes between runs. The default 20% -> 10% tightening lets the orchestrator catch real ~5% changes that previously fell inside baseline noise.

Copilot

Pull request overview

Moves the “benchmark variance / confidence” rule into the benchmark artifact so downstream skills and consumers can rely on a single, script-owned variance signal when interpreting clean/cached clean/incremental timing results.

Changes:

Add --variance-threshold (default 10% of median) and emit spread_seconds, spread_percent, and high_variance in benchmark summary stats.
Emit variance warnings (stderr + artifact notes) when variance is high and --repeats < 5, and record variance_threshold_percent at the artifact root.
Bump benchmark artifact schema to include 1.3.0 and the new optional variance-related fields; update orchestrator docs to use summary.<type>.high_variance.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
skills/xcode-build-orchestrator/SKILL.md	Switches the orchestrator’s confidence check to consume `summary.<type>.high_variance` from artifacts.
skills/xcode-build-orchestrator/scripts/benchmark_builds.py	Adds variance threshold flag + variance stats emission + warning/notes behavior + schema v1.3.0.
skills/xcode-build-fixer/scripts/benchmark_builds.py	Sync of benchmark script changes (variance fields + schema v1.3.0).
skills/xcode-build-benchmark/scripts/benchmark_builds.py	Sync of benchmark script changes (variance fields + schema v1.3.0).
scripts/benchmark_builds.py	Canonical benchmark script updated to compute/emit variance and warn when under-repeating.
schemas/build-benchmark.schema.json	Extends schema_version enum to 1.3.0 and adds optional variance fields.
skills/xcode-build-benchmark/schemas/build-benchmark.schema.json	Bundled schema sync for the xcode-build-benchmark skill.

+    parser.add_argument(
+        "--variance-threshold",
+        type=float,
+        default=10.0,
+        help="Percent of median above which the (max - min) spread is flagged as high variance. "
+        "Default: 10. High-variance benchmarks are not reliable for distinguishing real changes "
+        "from measurement noise; the script auto-recommends rerunning with --repeats=5 when the "
+        "threshold is exceeded.",


    artifact = {
-        "schema_version": "1.2.0" if "cached_clean" in runs else "1.1.0",
+        "schema_version": "1.3.0",
+        "variance_threshold_percent": args.variance_threshold,
        "created_at": datetime.now(timezone.utc).isoformat(),


 2. Run `xcode-build-benchmark` to establish a baseline if no fresh benchmark exists. The benchmark script auto-detects `COMPILATION_CACHE_ENABLE_CACHING = YES` and includes cached clean builds that measure the realistic developer experience (warm cache). If the build fails to compile, check `git log` for a recent buildable commit. When working in a worktree, cherry-picking a targeted build fix from a feature branch is acceptable to reach a buildable state. If SPM packages reference gitignored directories in their `exclude:` paths (e.g., `__Snapshots__`), create those directories before building -- worktrees do not contain gitignored content and `xcodebuild -resolvePackageDependencies` will crash otherwise.
 3. Verify the benchmark artifact has non-empty `timing_summary_categories`. If empty, the timing summary parser may have failed -- re-parse the raw logs or inspect them manually. If `COMPILATION_CACHE_ENABLE_CACHING` is enabled, also verify the artifact includes `cached_clean` runs.
-   - **Benchmark confidence check**: For each build type (clean, cached clean, incremental), compare the min and max values. If the spread (max - min) exceeds 20% of the median, flag the benchmark as having high variance and recommend running additional repetitions (5+ runs) before drawing conclusions. High variance makes it difficult to distinguish real improvements from noise. After applying changes, only claim an improvement if the post-change median falls outside the baseline's min-max range.
+   - **Benchmark confidence check**: For each build type (clean, cached clean, incremental), read `summary.<type>.high_variance` from the benchmark artifact. The benchmark script flags this as `true` when the (max - min) spread exceeds the configured `--variance-threshold` (default 10% of the median) and emits an auto-recommendation to rerun with `--repeats=5` in `notes`. When any build type is flagged, treat the benchmark as inconclusive and rerun with more repetitions before drawing conclusions. After applying changes, only claim an improvement if the post-change median falls outside the baseline's min-max range.


+    parser.add_argument(
+        "--variance-threshold",
+        type=float,
+        default=10.0,
+        help="Percent of median above which the (max - min) spread is flagged as high variance. "
+        "Default: 10. High-variance benchmarks are not reliable for distinguishing real changes "
+        "from measurement noise; the script auto-recommends rerunning with --repeats=5 when the "
+        "threshold is exceeded.",


    artifact = {
-        "schema_version": "1.2.0" if "cached_clean" in runs else "1.1.0",
+        "schema_version": "1.3.0",
+        "variance_threshold_percent": args.variance_threshold,
        "created_at": datetime.now(timezone.utc).isoformat(),


    artifact = {
-        "schema_version": "1.2.0" if "cached_clean" in runs else "1.1.0",
+        "schema_version": "1.3.0",
+        "variance_threshold_percent": args.variance_threshold,
        "created_at": datetime.now(timezone.utc).isoformat(),


+    parser.add_argument(
+        "--variance-threshold",
+        type=float,
+        default=10.0,
+        help="Percent of median above which the (max - min) spread is flagged as high variance. "
+        "Default: 10. High-variance benchmarks are not reliable for distinguishing real changes "
+        "from measurement noise; the script auto-recommends rerunning with --repeats=5 when the "
+        "threshold is exceeded.",


    artifact = {
-        "schema_version": "1.2.0" if "cached_clean" in runs else "1.1.0",
+        "schema_version": "1.3.0",
+        "variance_threshold_percent": args.variance_threshold,
        "created_at": datetime.now(timezone.utc).isoformat(),


    "schema_version": {
      "type": "string",
-      "enum": ["1.0.0", "1.1.0", "1.2.0"]
+      "enum": ["1.0.0", "1.1.0", "1.2.0", "1.3.0"]
+    },
+    "variance_threshold_percent": {
+      "type": "number",
+      "minimum": 0


    "schema_version": {
      "type": "string",
-      "enum": ["1.0.0", "1.1.0", "1.2.0"]
+      "enum": ["1.0.0", "1.1.0", "1.2.0", "1.3.0"]
+    },
+    "variance_threshold_percent": {
+      "type": "number",
+      "minimum": 0


AvdLee requested a review from Copilot May 23, 2026 18:49

Copilot started reviewing on behalf of AvdLee May 23, 2026 18:50 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten benchmark variance threshold to 10% (--variance-threshold)#22

Tighten benchmark variance threshold to 10% (--variance-threshold)#22
kmuralidharan91 wants to merge 1 commit into
AvdLee:mainfrom
kmuralidharan91:feat/variance-threshold-flag

kmuralidharan91 commented May 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kmuralidharan91 commented May 4, 2026

Summary

Why 10% instead of 20%

What changed

Wall-clock impact

Smoke-tested against NativeSkeletonApp

Out of scope (separate finding)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants