ci: artifact handoff + buildkit /proc/acpi fix + auto-versioned kernel releases#50
Merged
Conversation
The build matrix wrote per-target kernel tarballs to a shared hostPath (/home/runner/_shared/runs/$GITHUB_RUN_ID) and the aggregate job read them back. That only works while all rehosting-arc runners are pinned to one node. Pass the tarballs through actions/upload-artifact + download-artifact instead, removing the cross-node dependency so rehosting CI can run across both cluster nodes. Combine/merge/release logic is unchanged.
Two CI fixes to build.yml:
1. Drop the image=moby/buildkit:master pin from the buildx setup. A recent
buildkit master regressed runc on the kernel-5.4 self-hosted runners
("can't mask dir /proc/acpi ... MS_RDONLY ... invalid argument"), failing
the first RUN of every image build (this is why all build-matrix jobs
started failing). Letting buildx use its default pinned-stable buildkit
fixes it; network=host and the registry config are kept. Mirrors
rehosting/penguin c35bedc5.
2. Rework kernel releasing to auto-version like rehosting/penguin:
- Release on merges to main and on dev_* tags (was: manual v* tag pushes).
- Compute the version with reecetech/version-increment (use_api) and tag
the release vX.Y.Z; dev_* tags publish as prereleases.
- Drop the v* push trigger -- the release now *creates* v* tags, which
would otherwise re-trigger the workflow indefinitely.
- workflow_dispatch still runs build+aggregate (to validate the pipeline)
but does not publish a release.
- Add per-ref concurrency so concurrent main merges can't race the version
bump onto the same tag.
The per-target kernel-devel tree shipped the full source/build trees: tools/
(88MB x86_64, 267MB arm64), prebuilt .o objects (183MB), boot images / vmlinux
(166MB), and .cmd files (23MB) -- none of which an out-of-tree module build
(make -C $KDIR M=$PWD modules, e.g. igloo_driver) reads. Measured x86_64 devel
artifact: 244MB compressed / 605MB uncompressed; arm64 100MB / 397MB; aggregate
release asset ~1+GB.
Prune the staged tree after assembly, keeping only what kbuild needs for an
external module: Makefile/.config/Module.symvers, include/ (incl. generated),
arch/<arch>/include + Makefiles, and scripts/ host tools. Drops:
- arch/*/boot + realmode (also fixes a latent bug: the existing
arch/${short_arch}/boot removal is a no-op for x86_64 since the real dir is
arch/x86, so ~120MB of bzImage/vmlinux shipped).
- tools/ except tools/objtool (kept in case kbuild runs objtool on module
objects when CONFIG_OBJTOOL=y).
- *.cmd everywhere; *.o everywhere EXCEPT arch/powerpc/lib/crtsavres.o
(igloo_driver links it for ppc targets), scripts/, and tools/.
.c source removal is intentionally left as a follow-up. Estimated: x86_64
605MB->~100MB, arm64 397MB->~80MB uncompressed; aggregate ~1GB -> a few hundred MB.
Tier-2 follow-up to the devel slimming: an external-module build (make -C $KDIR M=$PWD modules) compiles the module's own sources against prebuilt objects and headers and never recompiles in-tree .c, so drop arch/<arch>/ and include/ .c from the staged devel tree. scripts/ and tools/ sources are kept in case a host tool needs rebuilding. Further shrinks the artifact on top of the tools/boot/.o removals.
build jobs mounted /home/runner/_shared/linux_sources, populated only by prebuild on prebuild's node -- so build had to land on the same node, which is why rehosting-arc was pinned with a nodeSelector. This is the remaining cross-node coupling after the build->aggregate artifact handoff. Move source preparation out of prebuild (now matrix-discovery only) and into a per-node "Ensure kernel sources on this node" step in each build job: - Key the staged tree on the pinned linux/<ver> submodule SHAs, under linux_sources/<key>/. A node reuses its tree across runs and only re-populates when a submodule is actually bumped; distinct keys never clobber each other (so overlapping runs on different SHAs are safe). - Arbitrate with flock on the shared fs: the first build job on a node populates; the rest block then reuse. GH 'concurrency' is cross-node and can't serialize same-node jobs, and this also removes the cp/rsync --delete race the old single shared dir hit (observed as rsync exit 24). - Publish atomically (populate into <key>.tmp, then mv + .ready stamp) so a partial tree is never consumed. Node-local bare clone is kept for fast file:// submodule fetches. Best-effort GC of keyed trees untouched for 14d. With this, a build job is self-sufficient on whichever node it lands on, so the rehosting-arc nodeSelector can be dropped (task 3) to spread across both nodes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR makes three related changes to
.github/workflows/build.yml.1. Build→aggregate handoff via workflow artifacts
The kernel pipeline (
buildmatrix →aggregate) passed per-target tarballs through a shared hostPath (/home/runner/_shared/runs/$GITHUB_RUN_ID/build-output):buildwrote them,aggregateread them back. That only works while allrehosting-arcrunners are pinned to one node.Now:
builduploadskernels-<target>viaactions/upload-artifactaggregatepulls them withactions/download-artifact(merge-multiple) into a workspace-localbuild-output/, then combines/releases exactly as before.The obsolete per-run
runs/cleanup is removed (artifacts expire on retention; the workspace dir is ephemeral). Combine/merge/release logic is unchanged.Why: removes the last cross-node dependency in rehosting CI, so the
rehosting-arcscale set can drop its single-nodenodeSelectorand spread across both cluster nodes (downstream kube change).2. Fix buildkit
/proc/acpiregression (unblocks all builds)The buildx setup pinned
image=moby/buildkit:master. A recent master regressed runc on the kernel-5.4 self-hosted runners:failing the first
RUNof every image build — which is why the build matrix went all-red after 2026-05-17 with no workflow change of its own. Dropping theimage=pin lets buildx use its default pinned-stable buildkit (runc v1.3.3).network=hostand the registry config are kept. Mirrors rehosting/penguinc35bedc5(same fix also applied in rehosting/embedded-toolchains).3. Auto-versioned kernel releases (like rehosting/penguin)
Releases now cut on merges to main and
dev_*tags (was: manualv*tag pushes). Version is computed byreecetech/version-increment(use_api) and the release is taggedvX.Y.Z;dev_*tags publish as prereleases.v*push trigger is dropped — the release step now createsv*tags, so av*trigger would re-fire the workflow indefinitely.workflow_dispatchstill runsbuild+aggregate(to validate the pipeline) but does not publish a release.concurrencyadded so concurrent main merges can't race the version bump onto the same tag.Test plan
workflow_dispatchrun:aggregatedownloads allkernels-*artifacts, combines, and produceskernels-latest.tar.gz+kernel-devel-all.tar.gz.merge-multipledownload → combine/osi.config-merge.nodeSelectordrop → runners spread to both nodes.4. Slim kernel-devel-all (~75–85% smaller)
The per-target kernel-devel tree shipped the full source/build trees —
tools/(88 MB x86_64 / 267 MB arm64), prebuilt.o(183 MB), boot images/vmlinux (166 MB),.cmd(23 MB) — none of which an out-of-tree module build reads. Pruned in_in_container_build.shto the modules_prepare result (Makefile/.config/Module.symvers, include/ incl. generated, arch//include + Makefiles, scripts/ host tools). Keepstools/objtoolandarch/powerpc/lib/crtsavres.o(igloo_driver links it on ppc). Also fixes a latent bug where the x86_64arch/${short_arch}/bootremoval was a no-op (real dir isarch/x86), so ~120 MB of bzImage/vmlinux shipped.Measured (run 27795353512): x86_64 devel 244 MB → 25 MB compressed (605 → 111 MB uncompressed), including the kernel
.cstrip; powerpc → 14 MB / 61 MB.Verified: all 13 builds green; igloo_driver modules build green against the slimmed trees (incl.
.cremoval) for x86_64 (4.10 + 6.13) and powerpc (6.13, exercises the crtsavres.o link). Confirmedcrtsavres.ois preserved where it exists and was already absent (not removed by this change) for powerpc64/powerpc-4.10.5. Node-agnostic kernel sources (per-node lazy cache)
prebuildused to populate the shared/home/runner/_shared/linux_sourceshostPath and everybuildjob mounted it — so build had to land on prebuild's node (the reasonrehosting-arcwas nodeSelector-pinned). Moved source prep into a per-node "Ensure kernel sources on this node" step in each build job: SHA-keyed underlinux_sources/<key>/, flock-arbitrated (first job on a node populates, the rest reuse), atomic publish +.readystamp, node-local bare clone for fastfile://submodule fetches, 14-day GC.prebuildis now matrix-discovery only.Verified (run 27828830325): all 13 builds green; logs confirm exactly one job logged
Populatingand the other 12Reusing cachedfor the same key on the same node — the flock arbitration works. This removes the last node-coupling, so dropping the nodeSelector (task 3) is now safe.Validation status
kernels-*, combined, produced both tarballs; publish correctly skipped (run 27762128012).