Skip to content

CI: minimize redundant build matrix to reduce cache pressure and flake exposure #6163

@hjmjohnson

Description

@hjmjohnson

ITK's CI matrix currently runs 17 build jobs per PR spread across GitHub Actions and Azure Pipelines. Several of these jobs are strict subsets of others (notably: every *Python Azure job is a superset of its non-Python sibling on the same OS/arch), and the MinSizeRel config used by all Azure Linux/Windows lanes tests an optimization profile almost no end user actually ships. The redundancy has real costs: per-branch ccache/sccache entries multiply by job count and routinely push the GitHub Actions GB-per-repo cache limit, and each extra job adds another roll of the dice on transient external fetches (the SWIG tarball mirror, pixi solver retries, ExternalData hosts) being flaky.

This issue proposes consolidating the matrix from 17 → ~12 build jobs while preserving every coverage axis we currently care about (Win/Mac/Linux × x86_64/arm64 × Python × legacy-removed × C++20 × MSVC v142/v143 × shared/static).

Current matrix (what each job tests)
# Pipeline / Job OS Arch Std / Build Libs Python Legacy Notes
CI_01 GH arm.yml linux-arm ubuntu-24.04-arm arm64 Release static Only native arm64 Linux
CI_02 GH arm.yml macos-rosetta macos-15 x86_64 Release static Only x86_64 macOS
CI_03 GH arm.yml macos-py macos-15 arm64 Release static 3.11 Overlaps Az MacOSPython
CI_04 GH pixi.yml linux ubuntu-22.04 x86_64 pixi (pixi) pixi-managed toolchain
CI_05 GH pixi.yml windows windows-2022 x86_64 pixi (pixi) pixi-managed toolchain
CI_06 GH pixi.yml macos macos-15 arm64 pixi (pixi) pixi-managed toolchain
CI_08 Az Linux ubuntu-22.04 x86_64 C++17 / MinSizeRel static subset of CI_11
CI_09 Az LinuxLegacyRemoved ubuntu-22.04 x86_64 C++17 / MinSizeRel static LEGACY_REMOVE=ON only legacy-removed lane
CI_10 Az LinuxCxx20 ubuntu-24.04 x86_64 C++20 / MinSizeRel static only C++20 lane
CI_11 Az LinuxPython ubuntu-22.04 x86_64 C++17 / MinSizeRel static 3.10
CI_12 Az MacOS macos-15 arm64 Release shared subset of CI_13 modulo shared-libs
CI_13 Az MacOSPython macos-15 arm64 Release static 3.10
CI_14 Az Windows windows-2022 x86_64 MinSizeRel shared subset of CI_15
CI_15 Az WindowsPython windows-2022 x86_64 MinSizeRel shared 3.11
CI_16 Az Batch v143 windows-2022 x86_64 Release shared rolling/batch only
CI_17 Az Batch v142 windows-2022 x86_64 Release shared only v142 toolset

Recommendations

A. Drop strict-subset jobs (3 deletions)

Per the rule "a Python build is a superset of a non-Python build on the same OS/arch/toolchain" — if a *Python job is green, the corresponding non-Python job adds no signal:

  • Delete AzurePipelinesLinux.yml job Linux (CI_08) — strict subset of LinuxPython (CI_11).
  • Delete AzurePipelinesWindows.yml (CI_14) — strict subset of WindowsPython (CI_15).
  • Delete AzurePipelinesMacOS.yml (CI_12) — overlaps MacOSPython (CI_13). To keep one shared-library macOS build, set BUILD_SHARED_LIBS=ON in MacOSPython instead of maintaining a separate lane.

B. Merge orthogonal Linux axes (1 deletion)

LinuxLegacyRemoved (CI_09) and LinuxCxx20 (CI_10) are both Ubuntu / gcc / MinSizeRel / static / no-Python lanes that differ in exactly one CMake flag each. They are orthogonal:

  • Combine into a single LinuxLegacyRemovedCxx20 job on ubuntu-24.04 with both ITK_LEGACY_REMOVE=ON and CMAKE_CXX_STANDARD=20. Failure-mode separation is rarely needed in CI; local reproduction handles the rare bisect.
  • Set BUILD_EXAMPLES:BOOL=ON on this same lane so examples coverage (currently OFF everywhere on Azure) is preserved without burdening the Python wrapping jobs that already dominate per-PR wall time. This makes LinuxLegacyRemovedCxx20 the comprehensive non-Python signal: legacy-removal + C++20 + examples + MinSizeRel canary, all in one job.

C. Standardize on Release, keep one MinSizeRel canary

MinSizeRel (-Os) tests an optimization profile our users overwhelmingly do not ship — they ship Release (-O3) or RelWithDebInfo. Currently 5 of 7 Azure jobs use MinSizeRel. The historical reason (small artifacts on free-tier runners) no longer applies on current Azure/GitHub images.

  • Switch LinuxPython, WindowsPython to Release. MacOS* already uses Release.
  • Keep the merged LinuxLegacyRemovedCxx20 on MinSizeRel as a single canary that the unusual optimizer config still builds.
  • Consider adding one Debug lane (asserts on, no NDEBUG) with budget freed up — catches a strictly more useful bug class than MinSizeRel.

D. Python-version spread (no deletions, just policy)

Python wrapping coverage doesn't need duplicating per OS — staggering the Python version across the surviving Python jobs so the union covers our supported range is sufficient:

  • Linux Python 3.10 → bump to 3.10
  • macOS Python 3.10 → bump to 3.12
  • Windows Python 3.11 → keep 3.11

This way each supported Python sees exposure on at least one OS without three jobs running 3.10.

E. Keep as-is

  • arm.yml jobs CI_01, CI_02 — only native arm64 Linux and only x86_64 macOS coverage.
  • pixi.yml jobs CI_04–CI_06 — fast PR signal with a different toolchain provenance from Azure; intentional cross-check.
  • AzurePipelinesBatch v142 + v143 (CI_16, CI_17) — only MSVC v142 coverage and run on integration only, not per-PR; cheap.

Net result: 17 → 12 PR build jobs

Before After
Build jobs per PR 17 12 (−5)
MinSizeRel jobs 5 1
Release jobs 11 10
Legacy-removed lanes 1 1
C++20 lanes 1 1 (merged with legacy)
Python wrapping coverage (OSes) 3 3
Native arm64 coverage yes yes
x86_64 macOS coverage yes yes
MSVC v142 coverage yes (batch) yes (batch)
shared-lib coverage on each OS yes yes
Estimated cache and compute savings

These are order-of-magnitude estimates; concrete numbers will need a measurement pass on a representative PR.

Cache footprint (ccache / sccache / pixi cache combined):

  • Per-job cache size for ITK on a warm build is typically 0.5–2 GB depending on platform (Windows/MSVC sccache largest, pixi-managed Linux smaller). Call it ~1.2 GB average.
  • GitHub Actions for ITK enforces a 45 GB per-repo cache cap with LRU eviction. We currently push past it routinely, which is why warm builds frequently regress to cold-cache wall times on busy days.
  • Removing 3 GH-side jobs (and reducing per-job size by switching to Release which produces fewer/smaller object files than MinSizeRel on some toolchains, marginal) should free ~3–4 GB of steady-state cache pressure → measurably fewer LRU evictions on active PR queues.
  • Azure pipeline caches are sized differently, but the same logic applies; expect ~5–7 GB less active cache surface across both systems.

Compute time (PR end-to-end wall clock):

  • Azure jobs that are dropped (CI_08, CI_12, CI_14): each runs ~45–75 minutes, including provisioning, configuration, build, and test. Three of them ≈ ~3 hours of agent-time per PR removed.
  • Merging CI_09+CI_10 saves another job's startup + configure overhead even though the build itself is similar in size, ≈ ~30–45 minutes per PR.
  • Switching MinSizeRelRelease on the surviving jobs is approximately wall-clock-neutral on build (slightly more inlining, slightly less code-size optimization) but tests run noticeably faster under Release due to better vectorization → expect ~5–10% faster ctest phases on the affected jobs.
  • Total per-PR wall-clock saved on the critical path: ~30–60 minutes (jobs run in parallel, so the saving comes from removing slow lanes, not from summing).
  • Total per-PR agent-minutes saved (billable / queue-pressure metric): ~3.5–4 hours.

Flake-rate reduction:

  • Each Azure job pulls SWIG tarball, ExternalData, pixi solver, gcc/clang from apt, etc. Empirically, the per-job flake rate from external fetches is on the order of 1–3% when mirrors are healthy and spikes much higher during outages.
  • With 17 jobs, the probability that at least one hits a transient fetch failure on a given PR is roughly 1 − (1 − p)^1715–40% depending on conditions.
  • With 12 jobs, the same calculation yields 11–30% — a meaningful drop in spurious red CI without any change to actual test coverage.

Caveat: all numbers assume current runner hardware and current cache infrastructure behavior. A short measurement pass (one PR before, one after the consolidation) would confirm.

Suggested rollout

  1. Land deletion of AzurePipelinesLinux.yml, AzurePipelinesWindows.yml, AzurePipelinesMacOS.yml in one commit. Trivially revertible.
  2. Merge LinuxLegacyRemoved + LinuxCxx20LinuxLegacyRemovedCxx20 in a second commit on ubuntu-24.04 with BUILD_EXAMPLES:BOOL=ON to preserve examples coverage on the comprehensive non-Python lane.
  3. Switch LinuxPython and WindowsPython from MinSizeRelRelease in a third commit.
  4. (Optional) Add a single LinuxDebug lane in a fourth commit using budget freed by steps 1–3.
  5. Tag a maintainer to confirm Azure-side pipeline definitions on dev.azure.com/itkrobotmacospython are updated to match the deleted YAML files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions