perf(engine): bump worker count cap for high-core hosts (hf#732 PR 1/5)#756
Conversation
jrusso1020
left a comment
There was a problem hiding this comment.
Verdict: APPROVE — clean worker-cap bump
Per Rule 5: required CI all green at 327a47b1 — Lint, Build, Test, Typecheck, regression, all regression-shards (including hdr), preview-regression, Smoke: global install, player-perf, CodeQL. Windows tests + render failed but those are pre-existing red on main (confirmed across the stack); not introduced by this PR.
Originally skipped this one per Vance's ❌ signal in Slack — re-reviewing now at Miguel's request.
Math checks out
defaultSafeMaxWorkers() = Math.max(6, Math.min(16, Math.floor(cpus().length / 8)))
Worked through the table from the PR description:
cpus() |
floor/8 |
min(16, …) |
max(6, …) |
|---|---|---|---|
| 4 | 0 | 0 | 6 |
| 8 | 1 | 1 | 6 |
| 32 | 4 | 4 | 6 |
| 48 | 6 | 6 | 6 |
| 64 | 8 | 8 | 8 |
| 96 | 12 | 12 | 12 |
| 128 | 16 | 16 | 16 |
| 192 | 24 | 16 | 16 |
Boundary at 48 cores (still 6) is the right place to keep typical dev/CI hosts unchanged. The 64+ tier scales linearly until the 16-worker ceiling. PR description's "<=32-core hosts: unchanged (still 6). On 64/96/128-core hosts: 8/12/16" matches.
The /8 divisor rationale is in the docstring: "leaves headroom for each Chrome worker's SwiftShader compositor + the shader-blend thread pool" — empirically validated by the 2.22× speedup PRs (#759/#760) downstream.
ABSOLUTE_MAX_WORKERS = 24
Explicit --workers N now surfaces up to 24 (was clamped at 10). Above 24 the cap intentionally clamps — docstring at :65-69 documents the reason ("CDP-protocol dispatch through Node's main event loop and OS scheduling noise overwhelms any further parallelism"). Reasonable upper bound.
Observations (non-blocking)
-
No new test for the function — the change from
const DEFAULT_SAFE_MAX_WORKERS = 6todefaultSafeMaxWorkers()switches from a value to a function but doesn't add a test pinning the formula at boundary cores (8, 48, 64, 96, 128, 192). Existing 7parallelCoordinatortests pass per the PR description — they pin the call-site behavior but not the underlying formula. Worth a small follow-up like:it("clamps default workers to [6, 16] based on cpu count", () => { // mock cpus() expect(defaultSafeMaxWorkers()).toBe(6); // 32 cores expect(defaultSafeMaxWorkers()).toBe(8); // 64 cores expect(defaultSafeMaxWorkers()).toBe(16); // 128 cores expect(defaultSafeMaxWorkers()).toBe(16); // 192 cores (capped) });
Not gating — boundary math is simple enough to read off, and the downstream stack tests it integratively.
-
cpus()syscall on eachcalculateOptimalWorkerscall —cpus()returns an array of CPU info objects; on Linux it reads/proc/cpuinfo. Minor cost in absolute terms but it's called per render-start, not per frame. Caching the count at module-load would be tidier:const CPU_COUNT = cpus().length; function defaultSafeMaxWorkers(): number { return Math.max(6, Math.min(16, Math.floor(CPU_COUNT / 8))); }
But this loses the (admittedly hypothetical) ability to respond to hot-plug CPU changes during a long-running process. Negligible either way.
Praise
- The function-form rationale (high-core hosts surfacing their hardware to
autoconcurrency) is exactly the right framing — operators shouldn't need to know the magic--workers Nvalue for their hardware. - Inline comments explain WHY (CDP dispatch ceiling, SwiftShader headroom) not just WHAT. Future-me will understand the constants without git-blame archaeology.
- Foundation PR is correctly small and self-contained — the rest of the stack builds on this without polluting the diff.
Review by Rames Jusso (pr-review)
miguel-heygen
left a comment
There was a problem hiding this comment.
Clean, well-scoped change. Reviewed the logic and CI state:
Correctness: The defaultSafeMaxWorkers() math holds — max(6, min(16, floor(cpus/8))) returns 6 for ≤48 cores (no behavior change for typical hosts), scales linearly to 16 at 128 cores, and caps there. The /8 divisor leaves ample headroom for SwiftShader + shader-blend threads per worker. ABSOLUTE_MAX_WORKERS bump to 24 only applies to explicit --workers N requests, so auto-mode users are unaffected.
Runtime safety: cpus().length is evaluated per-call (not module-load), so hot-path or container cgroup changes are handled correctly. The os.cpus import was already present — no new dependency.
Tests: The existing 7 parallelCoordinator tests pass. The two calculateOptimalWorkers tests exercise both auto-mode and explicit-request paths. A minor nice-to-have would be a test that asserts calculateOptimalWorkers(1000, undefined, { concurrency: "auto" }) returns the expected value for the CI host's core count (or a mocked os.cpus), but that's not blocking — the function is pure math and the formula is trivially verifiable.
CI failures (file size check, windows render/tests) are all pre-existing and unrelated to this change. Required checks (build, lint, typecheck, test, CLI smoke, regressions) all pass.
LGTM. Good base for the hf#732 stack.
— Magi
Follow-up to hf#732. The hybrid layered/shader-blend path saturates well past 6 DOM workers on multi-core hosts, but the engine's `calculateOptimalWorkers` clamped both the explicit `--workers N` request and the `auto` default to a hardcoded ceiling of 10 / 6 respectively. On the 96-core validation host, w=8, w=12, w=16 all collapsed to 6-10 internally, masking whether the wall is algorithmically bound or just cpu-starved by the cap. Two changes: * `ABSOLUTE_MAX_WORKERS`: 10 -> 24. Explicit `--workers 16` now surfaces 16 DOM sessions instead of being silently truncated to 10. The new ceiling is still finite because CDP-protocol dispatch serializes through Node's main event loop; past ~24 we expect noise to dominate signal. * `DEFAULT_SAFE_MAX_WORKERS` (a constant) -> `defaultSafeMaxWorkers()` returning `Math.max(6, Math.min(16, Math.floor(cpuCount / 8)))`. On 8/16/32-core: 6 (unchanged). On 64-core: 8. On 96-core: 12. On 128-core: 16. The /8 divisor leaves headroom for each Chrome worker's SwiftShader compositor + the in-process shader-blend `worker_threads` pool, both CPU-heavy. No behavior change for typical hosts (<=32 cores). Unlocks high-core hosts to consume more parallelism, which is a prerequisite for the hybrid shader-transition path stacked on top of this PR. Tests: existing 7 parallelCoordinator tests pass unchanged. PR 1 of 5 in the hf#732 decomposition stack. -- Vai Co-Authored-By: Vai <vai@heygen.com>
327a47b to
f108f1e
Compare
## Summary PR 2 of 5 in the hf#732 decomposition stack. Adds a `worker_threads`-based pool that offloads PNG decode + alpha-blit onto a fixed-size pool. **No production wiring yet** — the pool stands alone and ships behind a later PR in the stack. ### New files - `packages/producer/src/services/pngDecodeBlitWorker.ts` — worker entry. Imports from `@hyperframes/engine/alpha-blit` (zero-import TS source, survives the `new Worker(<path>)` loader boundary). - `packages/producer/src/services/pngDecodeBlitWorkerPool.ts` — fixed-size pool with `run()` API. Uses `transferList` for buffer ownership transfer (no 16bpc HDR buffer copies). - `packages/producer/src/services/pngDecodeBlitWorkerPool.test.ts` — 6 vitest tests pinning byte-equivalence with inline path, transferList correctness, concurrent dispatch, termination semantics. All pass. ### Build wiring - `packages/cli/tsup.config.ts`: second tsup entry emits `dist/pngDecodeBlitWorker.js` next to `dist/cli.js`. Without this entry the pool's `new Worker(<path>)` would fail at runtime in the shipped CLI. - `packages/producer/build.mjs`: third esbuild entry mirrors the wiring for direct producer consumers. - `packages/engine/package.json`: adds `./alpha-blit` subpath export pointing at `src/utils/alphaBlit.ts`. ## Stack Stacked on top of #756 (PR 1: worker-count cap). No behavior change in any render. ## Test plan - [x] 6 pool tests pass - [x] Producer + engine typecheck clean - [x] oxlint clean — Vai
EXPERIMENTAL spike. Adds a Node-side WebGPU compositor backed by the `webgpu` npm package (Dawn) and wires it into the existing shader-transition worker as an opt-in alternative to the CPU blend. Default OFF — set `HF_DAWN_WEBGPU=1` or pass `--gpu-shader-blend` to the CLI to engage. The hf#677 chain (PRs #756-#760) gets us 1.95x on Mac via a CPU shader worker pool, ring-buffered pipelining, and a hybrid layered/parallel capture path. The remaining ceiling is the per-pixel JS blend itself — scalar f64 math in v8. On any host with a real GPU (Mac/Metal, Linux/Vulkan, Windows/D3D), moving the blend to the GPU via Dawn should drop blend wall time on a 854x480 rgb48le frame from ~150-910 ms (depending on shader complexity) to a few ms, projecting 3-5x end-to-end on top of the existing cascade — see the `reference_5x_shader_perf_alternatives.md` memo (option B). beginFrame is structurally Mac-unavailable (crbug.com/40656275); native WebGPU is the next-best Mac-viable lever. The Dawn binding (`webgpu@0.4.0`, dawn-gpu/node-webgpu, maintained by Dawn upstream — Corentin Wallez, Kai Ninomiya) ships a `darwin-universal` prebuilt that targets Apple Metal directly, plus Linux x64/arm64 Vulkan and Windows x64 D3D. Verified locally that the package loads on Linux x64; adapter init returns null in the sandbox (no Vulkan driver) and the worker transparently falls back to the existing CPU path. Mac numbers must be measured on Vance's laptop — there's no GPU host in CI. - `packages/producer/src/services/shaderTransitionGpu.ts` (NEW) Node-side compositor: dynamic-imports `webgpu`, requests an adapter + device, compiles a WGSL compute shader per supported transition, manages per-size GPU resources, dispatches blends, reads back to `rgb48le`. Returns structured init failures rather than throwing — the caller can always fall back to CPU. `HF_DAWN_FORCE_FAIL=1` short-circuits init for testability of the fallback path. - `packages/producer/src/services/shaderTransitionWorker.ts` Augmented: on the first message, if `HF_DAWN_WEBGPU=1`, dynamic-imports the GPU module and probes once. On success, supported shaders run on the GPU. Unsupported shaders (`glitch`, `domain-warp`, `swirl-vortex`, the rest) and any host without a GPU adapter fall through to the existing `TRANSITIONS[shader] ?? crossfade` CPU path with zero behavior change. Mid-render GPU failure disables the GPU path for the rest of the worker's life and falls back — the frame still completes. - `packages/cli/src/commands/render.ts` New `--gpu-shader-blend` boolean flag (default false). When set, the CLI exports `HF_DAWN_WEBGPU=1` into `process.env` before any worker pool spawns. Env-var plumbing chosen over threading the flag through the orchestrator -> stage -> pool -> worker chain because env vars cross the `worker_threads` boundary unchanged. - `packages/producer/build.mjs` + `packages/cli/tsup.config.ts` `webgpu` added to the `external` list in both bundlers. The Dawn binding ships a 70+ MB native `.dawn.node` binary per platform — it must be loaded from the user's node_modules at runtime, not inlined into the bundle. - `packages/producer/package.json` + `packages/cli/package.json` `webgpu@^0.4.0` added to `optionalDependencies` on both. Optional because (a) it's a native binary that may fail to install on some hosts and (b) the feature is opt-in. - `packages/producer/src/services/shaderTransitionGpu.test.ts` (NEW) 3 vitest tests: `HF_DAWN_FORCE_FAIL` short-circuit, init never throws, and PSNR-vs-CPU >= 50 dB on hosts where the adapter is available (auto-skipped on Linux sandbox with a log line). One representative shader (`crossfade`) is ported to WGSL end-to-end as proof of correctness. The harness is shader-agnostic; porting more is a mechanical add to `SHADERS_WGSL` in `shaderTransitionGpu.ts`. Unsupported shaders transparently fall through to CPU even when the flag is on. GPU compute uses f32 + u16 storage; CPU canonical uses f64. Bit-exact equality is not realistic. The default-off path keeps CI byte-equality pins intact (the existing fixtures hit the CPU path unchanged). Fixtures that exercise the GPU path must use a PSNR pin (>= 50 dB documented in the new test). The fallback CPU path is bit-exact with the canonical CPU implementation. Bundled CLI worker smoke (854x480 single crossfade blend, see the investigation log for the harness): | config | wall | notes | |---|---:|---| | CPU baseline (`HF_DAWN_WEBGPU` unset) | 67 ms | reference | | `HF_DAWN_WEBGPU=1` (fell back, no GPU) | 70 ms | +3 ms init+probe, then CPU | | `HF_DAWN_WEBGPU=1` + `HF_DAWN_FORCE_FAIL=1` | 66 ms | fallback verified | Both fallback runs produced byte-identical output to the CPU baseline checksum — fallback fidelity confirmed. **Mac numbers: Vance, please fill in.** Pull the branch, run a shader-transition fixture both with and without `--gpu-shader-blend`, and drop the wall-time pair in a comment. Anywhere the GPU path can't run, the existing CPU path runs unchanged: - `webgpu` package not installed (optional dep skipped on install) -> CPU - `webgpu` loads but `requestAdapter()` returns null (no GPU) -> CPU - WGSL pipeline compile fails -> CPU - `--gpu-shader-blend` set but shader not in `SHADERS_WGSL` -> CPU - Mid-render GPU failure -> log + disable GPU + finish on CPU Failure reason is logged once per worker (not per frame). No crash path. - Port the remaining 12 shaders to WGSL (mechanical: `flashThroughWhite`, `chromaticSplit`, `sdfIris`, `glitch`, `lightLeak`, `crossWarpMorph`, `whipPan`, `cinematicZoom`, `gravitationalLens`, `rippleWaves`, `swirlVortex`, `thermalDistortion`, `domainWarp`, `ridgedBurn`). - Once Mac wall-time confirms the lever, decide on flip-default-on or keep opt-in. - Update fixture pins for any fixture that needs to exercise the GPU path under CI (PSNR-only). - A persistent device + pipeline cache shared across workers (currently per-worker init). _PR drafted by Vai_ Co-Authored-By: Vai <vai@heygen.com>

Summary
PR 1 of 5 in the hf#732 decomposition stack. Bumps
parallelCoordinator's worker-count caps so high-core hosts can actually surface their hardware to renders:ABSOLUTE_MAX_WORKERS: 10 → 24 (explicit--workers 16now surfaces 16 DOM sessions instead of being silently clamped).DEFAULT_SAFE_MAX_WORKERSconstant →defaultSafeMaxWorkers()function returningmax(6, min(16, floor(cpus/8))). On <=32-core hosts: unchanged (still 6). On 64/96/128-core hosts: 8/12/16.No behavior change for typical hosts. Required prerequisite for the hybrid shader-transition path landed in PR 4.
Test plan
parallelCoordinatortests passStack
This is the base of the hf#732 decomposition stack:
Replaces the closed hf#732. See that issue for the original investigation; the architectural mismatch with #733's
captureHdrStageextraction made a clean rebase impossible.— Vai
Relates to #677