Skip to content

ci: artifact handoff + buildkit /proc/acpi fix + auto-versioned kernel releases#50

Merged
lacraig2 merged 5 commits into
mainfrom
workspace/ci-artifact-handoff
Jun 19, 2026
Merged

ci: artifact handoff + buildkit /proc/acpi fix + auto-versioned kernel releases#50
lacraig2 merged 5 commits into
mainfrom
workspace/ci-artifact-handoff

Conversation

@lacraig2

@lacraig2 lacraig2 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

This PR makes three related changes to .github/workflows/build.yml.

1. Build→aggregate handoff via workflow artifacts

The kernel pipeline (build matrix → aggregate) passed per-target tarballs through a shared hostPath (/home/runner/_shared/runs/$GITHUB_RUN_ID/build-output): build wrote them, aggregate read them back. That only works while all rehosting-arc runners are pinned to one node.

Now:

  • build uploads kernels-<target> via actions/upload-artifact
  • aggregate pulls them with actions/download-artifact (merge-multiple) into a workspace-local build-output/, then combines/releases exactly as before.

The obsolete per-run runs/ cleanup is removed (artifacts expire on retention; the workspace dir is ephemeral). Combine/merge/release logic is unchanged.

Why: removes the last cross-node dependency in rehosting CI, so the rehosting-arc scale set can drop its single-node nodeSelector and spread across both cluster nodes (downstream kube change).

2. Fix buildkit /proc/acpi regression (unblocks all builds)

The buildx setup pinned image=moby/buildkit:master. A recent master regressed runc on the kernel-5.4 self-hosted runners:

runc run failed: ... can't mask dir "/proc/acpi": mount ... MS_RDONLY ... invalid argument

failing the first RUN of every image build — which is why the build matrix went all-red after 2026-05-17 with no workflow change of its own. Dropping the image= pin lets buildx use its default pinned-stable buildkit (runc v1.3.3). network=host and the registry config are kept. Mirrors rehosting/penguin c35bedc5 (same fix also applied in rehosting/embedded-toolchains).

3. Auto-versioned kernel releases (like rehosting/penguin)

Releases now cut on merges to main and dev_* tags (was: manual v* tag pushes). Version is computed by reecetech/version-increment (use_api) and the release is tagged vX.Y.Z; dev_* tags publish as prereleases.

  • The v* push trigger is dropped — the release step now creates v* tags, so a v* trigger would re-fire the workflow indefinitely.
  • workflow_dispatch still runs build+aggregate (to validate the pipeline) but does not publish a release.
  • Per-ref concurrency added so concurrent main merges can't race the version bump onto the same tag.

Test plan

  • PR run: build matrix passes (buildkit fix verified live).
  • workflow_dispatch run: aggregate downloads all kernels-* artifacts, combines, and produces kernels-latest.tar.gz + kernel-devel-all.tar.gz.
  • Round-trip contents equivalence verified by local simulation of upload → merge-multiple download → combine/osi.config-merge.
  • (post-merge) downstream kube nodeSelector drop → runners spread to both nodes.

4. Slim kernel-devel-all (~75–85% smaller)

The per-target kernel-devel tree shipped the full source/build trees — tools/ (88 MB x86_64 / 267 MB arm64), prebuilt .o (183 MB), boot images/vmlinux (166 MB), .cmd (23 MB) — none of which an out-of-tree module build reads. Pruned in _in_container_build.sh to the modules_prepare result (Makefile/.config/Module.symvers, include/ incl. generated, arch//include + Makefiles, scripts/ host tools). Keeps tools/objtool and arch/powerpc/lib/crtsavres.o (igloo_driver links it on ppc). Also fixes a latent bug where the x86_64 arch/${short_arch}/boot removal was a no-op (real dir is arch/x86), so ~120 MB of bzImage/vmlinux shipped.

Measured (run 27795353512): x86_64 devel 244 MB → 25 MB compressed (605 → 111 MB uncompressed), including the kernel .c strip; powerpc → 14 MB / 61 MB.

Verified: all 13 builds green; igloo_driver modules build green against the slimmed trees (incl. .c removal) for x86_64 (4.10 + 6.13) and powerpc (6.13, exercises the crtsavres.o link). Confirmed crtsavres.o is preserved where it exists and was already absent (not removed by this change) for powerpc64/powerpc-4.10.

5. Node-agnostic kernel sources (per-node lazy cache)

prebuild used to populate the shared /home/runner/_shared/linux_sources hostPath and every build job mounted it — so build had to land on prebuild's node (the reason rehosting-arc was nodeSelector-pinned). Moved source prep into a per-node "Ensure kernel sources on this node" step in each build job: SHA-keyed under linux_sources/<key>/, flock-arbitrated (first job on a node populates, the rest reuse), atomic publish + .ready stamp, node-local bare clone for fast file:// submodule fetches, 14-day GC. prebuild is now matrix-discovery only.

Verified (run 27828830325): all 13 builds green; logs confirm exactly one job logged Populating and the other 12 Reusing cached for the same key on the same node — the flock arbitration works. This removes the last node-coupling, so dropping the nodeSelector (task 3) is now safe.

Validation status

  • buildkit fix: all 13 builds pass live (run 27762128012 / 27795353512).
  • round-trip: workflow_dispatch aggregate downloaded all kernels-*, combined, produced both tarballs; publish correctly skipped (run 27762128012).
  • kernel-devel slimming: ~88% smaller, igloo_driver builds green (x86_64 + powerpc).
  • node-agnostic sources: per-node flock cache verified (1 populate / 12 reuse, run 27828830325).
  • (post-merge) downstream kube nodeSelector drop — now unblocked (prebuild→build coupling resolved by change 5).

Luke Craig and others added 2 commits June 17, 2026 18:57
The build matrix wrote per-target kernel tarballs to a shared hostPath
(/home/runner/_shared/runs/$GITHUB_RUN_ID) and the aggregate job read them
back. That only works while all rehosting-arc runners are pinned to one node.
Pass the tarballs through actions/upload-artifact + download-artifact instead,
removing the cross-node dependency so rehosting CI can run across both cluster
nodes. Combine/merge/release logic is unchanged.
Two CI fixes to build.yml:

1. Drop the image=moby/buildkit:master pin from the buildx setup. A recent
   buildkit master regressed runc on the kernel-5.4 self-hosted runners
   ("can't mask dir /proc/acpi ... MS_RDONLY ... invalid argument"), failing
   the first RUN of every image build (this is why all build-matrix jobs
   started failing). Letting buildx use its default pinned-stable buildkit
   fixes it; network=host and the registry config are kept. Mirrors
   rehosting/penguin c35bedc5.

2. Rework kernel releasing to auto-version like rehosting/penguin:
   - Release on merges to main and on dev_* tags (was: manual v* tag pushes).
   - Compute the version with reecetech/version-increment (use_api) and tag
     the release vX.Y.Z; dev_* tags publish as prereleases.
   - Drop the v* push trigger -- the release now *creates* v* tags, which
     would otherwise re-trigger the workflow indefinitely.
   - workflow_dispatch still runs build+aggregate (to validate the pipeline)
     but does not publish a release.
   - Add per-ref concurrency so concurrent main merges can't race the version
     bump onto the same tag.
@lacraig2 lacraig2 changed the title ci: hand build→aggregate kernel artifacts via workflow artifacts ci: artifact handoff + buildkit /proc/acpi fix + auto-versioned kernel releases Jun 18, 2026
lacraig2 added 3 commits June 18, 2026 19:20
The per-target kernel-devel tree shipped the full source/build trees: tools/
(88MB x86_64, 267MB arm64), prebuilt .o objects (183MB), boot images / vmlinux
(166MB), and .cmd files (23MB) -- none of which an out-of-tree module build
(make -C $KDIR M=$PWD modules, e.g. igloo_driver) reads. Measured x86_64 devel
artifact: 244MB compressed / 605MB uncompressed; arm64 100MB / 397MB; aggregate
release asset ~1+GB.

Prune the staged tree after assembly, keeping only what kbuild needs for an
external module: Makefile/.config/Module.symvers, include/ (incl. generated),
arch/<arch>/include + Makefiles, and scripts/ host tools. Drops:
- arch/*/boot + realmode (also fixes a latent bug: the existing
  arch/${short_arch}/boot removal is a no-op for x86_64 since the real dir is
  arch/x86, so ~120MB of bzImage/vmlinux shipped).
- tools/ except tools/objtool (kept in case kbuild runs objtool on module
  objects when CONFIG_OBJTOOL=y).
- *.cmd everywhere; *.o everywhere EXCEPT arch/powerpc/lib/crtsavres.o
  (igloo_driver links it for ppc targets), scripts/, and tools/.

.c source removal is intentionally left as a follow-up. Estimated: x86_64
605MB->~100MB, arm64 397MB->~80MB uncompressed; aggregate ~1GB -> a few hundred MB.
Tier-2 follow-up to the devel slimming: an external-module build (make -C $KDIR
M=$PWD modules) compiles the module's own sources against prebuilt objects and
headers and never recompiles in-tree .c, so drop arch/<arch>/ and include/ .c
from the staged devel tree. scripts/ and tools/ sources are kept in case a host
tool needs rebuilding. Further shrinks the artifact on top of the tools/boot/.o
removals.
build jobs mounted /home/runner/_shared/linux_sources, populated only by
prebuild on prebuild's node -- so build had to land on the same node, which is
why rehosting-arc was pinned with a nodeSelector. This is the remaining
cross-node coupling after the build->aggregate artifact handoff.

Move source preparation out of prebuild (now matrix-discovery only) and into a
per-node "Ensure kernel sources on this node" step in each build job:

- Key the staged tree on the pinned linux/<ver> submodule SHAs, under
  linux_sources/<key>/. A node reuses its tree across runs and only
  re-populates when a submodule is actually bumped; distinct keys never clobber
  each other (so overlapping runs on different SHAs are safe).
- Arbitrate with flock on the shared fs: the first build job on a node
  populates; the rest block then reuse. GH 'concurrency' is cross-node and
  can't serialize same-node jobs, and this also removes the cp/rsync --delete
  race the old single shared dir hit (observed as rsync exit 24).
- Publish atomically (populate into <key>.tmp, then mv + .ready stamp) so a
  partial tree is never consumed. Node-local bare clone is kept for fast
  file:// submodule fetches. Best-effort GC of keyed trees untouched for 14d.

With this, a build job is self-sufficient on whichever node it lands on, so the
rehosting-arc nodeSelector can be dropped (task 3) to spread across both nodes.
@lacraig2 lacraig2 merged commit 4962e2d into main Jun 19, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant