Skip to content

perf(engine): bump worker count cap for high-core hosts (hf#732 PR 1/5)#756

Merged
vanceingalls merged 1 commit into
mainfrom
vai/677-1-worker-cap
May 13, 2026
Merged

perf(engine): bump worker count cap for high-core hosts (hf#732 PR 1/5)#756
vanceingalls merged 1 commit into
mainfrom
vai/677-1-worker-cap

Conversation

@vanceingalls
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls commented May 12, 2026

Summary

PR 1 of 5 in the hf#732 decomposition stack. Bumps parallelCoordinator's worker-count caps so high-core hosts can actually surface their hardware to renders:

  • ABSOLUTE_MAX_WORKERS: 10 → 24 (explicit --workers 16 now surfaces 16 DOM sessions instead of being silently clamped).
  • DEFAULT_SAFE_MAX_WORKERS constant → defaultSafeMaxWorkers() function returning max(6, min(16, floor(cpus/8))). On <=32-core hosts: unchanged (still 6). On 64/96/128-core hosts: 8/12/16.

No behavior change for typical hosts. Required prerequisite for the hybrid shader-transition path landed in PR 4.

Test plan

  • Existing 7 parallelCoordinator tests pass
  • Engine typecheck clean
  • oxlint clean

Stack

This is the base of the hf#732 decomposition stack:

  1. PR 1 (this) — perf(engine): worker-count cap bump
  2. PR 2 — feat(producer): add pngDecodeBlitWorkerPool
  3. PR 3 — feat(producer): add shaderTransitionWorkerPool
  4. PR 4 — perf(producer): hybrid layered/parallel path (the 2.22× speedup)
  5. PR 5 — perf(producer): pipeline capture and shader-blend per-frame

Replaces the closed hf#732. See that issue for the original investigation; the architectural mismatch with #733's captureHdrStage extraction made a clean rebase impossible.

— Vai


Relates to #677

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: APPROVE — clean worker-cap bump

Per Rule 5: required CI all green at 327a47b1 — Lint, Build, Test, Typecheck, regression, all regression-shards (including hdr), preview-regression, Smoke: global install, player-perf, CodeQL. Windows tests + render failed but those are pre-existing red on main (confirmed across the stack); not introduced by this PR.

Originally skipped this one per Vance's ❌ signal in Slack — re-reviewing now at Miguel's request.

Math checks out

defaultSafeMaxWorkers() = Math.max(6, Math.min(16, Math.floor(cpus().length / 8)))

Worked through the table from the PR description:

cpus() floor/8 min(16, …) max(6, …)
4 0 0 6
8 1 1 6
32 4 4 6
48 6 6 6
64 8 8 8
96 12 12 12
128 16 16 16
192 24 16 16

Boundary at 48 cores (still 6) is the right place to keep typical dev/CI hosts unchanged. The 64+ tier scales linearly until the 16-worker ceiling. PR description's "<=32-core hosts: unchanged (still 6). On 64/96/128-core hosts: 8/12/16" matches.

The /8 divisor rationale is in the docstring: "leaves headroom for each Chrome worker's SwiftShader compositor + the shader-blend thread pool" — empirically validated by the 2.22× speedup PRs (#759/#760) downstream.

ABSOLUTE_MAX_WORKERS = 24

Explicit --workers N now surfaces up to 24 (was clamped at 10). Above 24 the cap intentionally clamps — docstring at :65-69 documents the reason ("CDP-protocol dispatch through Node's main event loop and OS scheduling noise overwhelms any further parallelism"). Reasonable upper bound.

Observations (non-blocking)

  1. No new test for the function — the change from const DEFAULT_SAFE_MAX_WORKERS = 6 to defaultSafeMaxWorkers() switches from a value to a function but doesn't add a test pinning the formula at boundary cores (8, 48, 64, 96, 128, 192). Existing 7 parallelCoordinator tests pass per the PR description — they pin the call-site behavior but not the underlying formula. Worth a small follow-up like:

    it("clamps default workers to [6, 16] based on cpu count", () => {
      // mock cpus()
      expect(defaultSafeMaxWorkers()).toBe(6); // 32 cores
      expect(defaultSafeMaxWorkers()).toBe(8); // 64 cores
      expect(defaultSafeMaxWorkers()).toBe(16); // 128 cores
      expect(defaultSafeMaxWorkers()).toBe(16); // 192 cores (capped)
    });

    Not gating — boundary math is simple enough to read off, and the downstream stack tests it integratively.

  2. cpus() syscall on each calculateOptimalWorkers callcpus() returns an array of CPU info objects; on Linux it reads /proc/cpuinfo. Minor cost in absolute terms but it's called per render-start, not per frame. Caching the count at module-load would be tidier:

    const CPU_COUNT = cpus().length;
    function defaultSafeMaxWorkers(): number { return Math.max(6, Math.min(16, Math.floor(CPU_COUNT / 8))); }

    But this loses the (admittedly hypothetical) ability to respond to hot-plug CPU changes during a long-running process. Negligible either way.

Praise

  • The function-form rationale (high-core hosts surfacing their hardware to auto concurrency) is exactly the right framing — operators shouldn't need to know the magic --workers N value for their hardware.
  • Inline comments explain WHY (CDP dispatch ceiling, SwiftShader headroom) not just WHAT. Future-me will understand the constants without git-blame archaeology.
  • Foundation PR is correctly small and self-contained — the rest of the stack builds on this without polluting the diff.

Review by Rames Jusso (pr-review)

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-scoped change. Reviewed the logic and CI state:

Correctness: The defaultSafeMaxWorkers() math holds — max(6, min(16, floor(cpus/8))) returns 6 for ≤48 cores (no behavior change for typical hosts), scales linearly to 16 at 128 cores, and caps there. The /8 divisor leaves ample headroom for SwiftShader + shader-blend threads per worker. ABSOLUTE_MAX_WORKERS bump to 24 only applies to explicit --workers N requests, so auto-mode users are unaffected.

Runtime safety: cpus().length is evaluated per-call (not module-load), so hot-path or container cgroup changes are handled correctly. The os.cpus import was already present — no new dependency.

Tests: The existing 7 parallelCoordinator tests pass. The two calculateOptimalWorkers tests exercise both auto-mode and explicit-request paths. A minor nice-to-have would be a test that asserts calculateOptimalWorkers(1000, undefined, { concurrency: "auto" }) returns the expected value for the CI host's core count (or a mocked os.cpus), but that's not blocking — the function is pure math and the formula is trivially verifiable.

CI failures (file size check, windows render/tests) are all pre-existing and unrelated to this change. Required checks (build, lint, typecheck, test, CLI smoke, regressions) all pass.

LGTM. Good base for the hf#732 stack.

— Magi

Follow-up to hf#732. The hybrid layered/shader-blend path saturates
well past 6 DOM workers on multi-core hosts, but the engine's
`calculateOptimalWorkers` clamped both the explicit `--workers N`
request and the `auto` default to a hardcoded ceiling of 10 / 6
respectively. On the 96-core validation host, w=8, w=12, w=16 all
collapsed to 6-10 internally, masking whether the wall is
algorithmically bound or just cpu-starved by the cap.

Two changes:

* `ABSOLUTE_MAX_WORKERS`: 10 -> 24. Explicit `--workers 16` now
  surfaces 16 DOM sessions instead of being silently truncated to 10.
  The new ceiling is still finite because CDP-protocol dispatch
  serializes through Node's main event loop; past ~24 we expect noise
  to dominate signal.

* `DEFAULT_SAFE_MAX_WORKERS` (a constant) -> `defaultSafeMaxWorkers()`
  returning `Math.max(6, Math.min(16, Math.floor(cpuCount / 8)))`.
  On 8/16/32-core: 6 (unchanged). On 64-core: 8. On 96-core: 12. On
  128-core: 16. The /8 divisor leaves headroom for each Chrome
  worker's SwiftShader compositor + the in-process shader-blend
  `worker_threads` pool, both CPU-heavy.

No behavior change for typical hosts (<=32 cores). Unlocks high-core
hosts to consume more parallelism, which is a prerequisite for the
hybrid shader-transition path stacked on top of this PR.

Tests: existing 7 parallelCoordinator tests pass unchanged.

PR 1 of 5 in the hf#732 decomposition stack.

-- Vai

Co-Authored-By: Vai <vai@heygen.com>
@vanceingalls vanceingalls force-pushed the vai/677-1-worker-cap branch from 327a47b to f108f1e Compare May 13, 2026 20:36
@vanceingalls vanceingalls merged commit 57b6858 into main May 13, 2026
39 checks passed
@vanceingalls vanceingalls deleted the vai/677-1-worker-cap branch May 13, 2026 21:04
vanceingalls added a commit that referenced this pull request May 13, 2026
## Summary

PR 2 of 5 in the hf#732 decomposition stack. Adds a `worker_threads`-based pool that offloads PNG decode + alpha-blit onto a fixed-size pool. **No production wiring yet** — the pool stands alone and ships behind a later PR in the stack.

### New files

- `packages/producer/src/services/pngDecodeBlitWorker.ts` — worker entry. Imports from `@hyperframes/engine/alpha-blit` (zero-import TS source, survives the `new Worker(<path>)` loader boundary).
- `packages/producer/src/services/pngDecodeBlitWorkerPool.ts` — fixed-size pool with `run()` API. Uses `transferList` for buffer ownership transfer (no 16bpc HDR buffer copies).
- `packages/producer/src/services/pngDecodeBlitWorkerPool.test.ts` — 6 vitest tests pinning byte-equivalence with inline path, transferList correctness, concurrent dispatch, termination semantics. All pass.

### Build wiring

- `packages/cli/tsup.config.ts`: second tsup entry emits `dist/pngDecodeBlitWorker.js` next to `dist/cli.js`. Without this entry the pool's `new Worker(<path>)` would fail at runtime in the shipped CLI.
- `packages/producer/build.mjs`: third esbuild entry mirrors the wiring for direct producer consumers.
- `packages/engine/package.json`: adds `./alpha-blit` subpath export pointing at `src/utils/alphaBlit.ts`.

## Stack

Stacked on top of #756 (PR 1: worker-count cap). No behavior change in any render.

## Test plan

- [x] 6 pool tests pass
- [x] Producer + engine typecheck clean
- [x] oxlint clean

— Vai
vanceingalls added a commit that referenced this pull request May 14, 2026
EXPERIMENTAL spike. Adds a Node-side WebGPU compositor backed by the
`webgpu` npm package (Dawn) and wires it into the existing
shader-transition worker as an opt-in alternative to the CPU blend.
Default OFF — set `HF_DAWN_WEBGPU=1` or pass `--gpu-shader-blend` to the
CLI to engage.

The hf#677 chain (PRs #756-#760) gets us 1.95x on Mac via a CPU shader
worker pool, ring-buffered pipelining, and a hybrid layered/parallel
capture path. The remaining ceiling is the per-pixel JS blend itself —
scalar f64 math in v8. On any host with a real GPU (Mac/Metal,
Linux/Vulkan, Windows/D3D), moving the blend to the GPU via Dawn
should drop blend wall time on a 854x480 rgb48le frame from
~150-910 ms (depending on shader complexity) to a few ms, projecting
3-5x end-to-end on top of the existing cascade — see the
`reference_5x_shader_perf_alternatives.md` memo (option B).

beginFrame is structurally Mac-unavailable
(crbug.com/40656275); native WebGPU is the next-best Mac-viable lever.

The Dawn binding (`webgpu@0.4.0`, dawn-gpu/node-webgpu, maintained by
Dawn upstream — Corentin Wallez, Kai Ninomiya) ships a `darwin-universal`
prebuilt that targets Apple Metal directly, plus Linux x64/arm64
Vulkan and Windows x64 D3D. Verified locally that the package loads on
Linux x64; adapter init returns null in the sandbox (no Vulkan driver)
and the worker transparently falls back to the existing CPU path.
Mac numbers must be measured on Vance's laptop — there's no GPU host
in CI.

- `packages/producer/src/services/shaderTransitionGpu.ts` (NEW)
  Node-side compositor: dynamic-imports `webgpu`, requests an adapter
  + device, compiles a WGSL compute shader per supported transition,
  manages per-size GPU resources, dispatches blends, reads back to
  `rgb48le`. Returns structured init failures rather than throwing —
  the caller can always fall back to CPU. `HF_DAWN_FORCE_FAIL=1`
  short-circuits init for testability of the fallback path.
- `packages/producer/src/services/shaderTransitionWorker.ts`
  Augmented: on the first message, if `HF_DAWN_WEBGPU=1`, dynamic-imports
  the GPU module and probes once. On success, supported shaders run on
  the GPU. Unsupported shaders (`glitch`, `domain-warp`, `swirl-vortex`,
  the rest) and any host without a GPU adapter fall through to the
  existing `TRANSITIONS[shader] ?? crossfade` CPU path with zero
  behavior change. Mid-render GPU failure disables the GPU path for the
  rest of the worker's life and falls back — the frame still completes.
- `packages/cli/src/commands/render.ts`
  New `--gpu-shader-blend` boolean flag (default false). When set, the
  CLI exports `HF_DAWN_WEBGPU=1` into `process.env` before any worker
  pool spawns. Env-var plumbing chosen over threading the flag through
  the orchestrator -> stage -> pool -> worker chain because env vars
  cross the `worker_threads` boundary unchanged.
- `packages/producer/build.mjs` + `packages/cli/tsup.config.ts`
  `webgpu` added to the `external` list in both bundlers. The Dawn
  binding ships a 70+ MB native `.dawn.node` binary per platform —
  it must be loaded from the user's node_modules at runtime, not
  inlined into the bundle.
- `packages/producer/package.json` + `packages/cli/package.json`
  `webgpu@^0.4.0` added to `optionalDependencies` on both. Optional
  because (a) it's a native binary that may fail to install on some
  hosts and (b) the feature is opt-in.
- `packages/producer/src/services/shaderTransitionGpu.test.ts` (NEW)
  3 vitest tests: `HF_DAWN_FORCE_FAIL` short-circuit, init never
  throws, and PSNR-vs-CPU >= 50 dB on hosts where the adapter is
  available (auto-skipped on Linux sandbox with a log line).

One representative shader (`crossfade`) is ported to WGSL end-to-end
as proof of correctness. The harness is shader-agnostic; porting more
is a mechanical add to `SHADERS_WGSL` in `shaderTransitionGpu.ts`.
Unsupported shaders transparently fall through to CPU even when the
flag is on.

GPU compute uses f32 + u16 storage; CPU canonical uses f64. Bit-exact
equality is not realistic. The default-off path keeps CI byte-equality
pins intact (the existing fixtures hit the CPU path unchanged). Fixtures
that exercise the GPU path must use a PSNR pin (>= 50 dB documented in
the new test). The fallback CPU path is bit-exact with the
canonical CPU implementation.

Bundled CLI worker smoke (854x480 single crossfade blend, see the
investigation log for the harness):

| config | wall | notes |
|---|---:|---|
| CPU baseline (`HF_DAWN_WEBGPU` unset) | 67 ms | reference |
| `HF_DAWN_WEBGPU=1` (fell back, no GPU) | 70 ms | +3 ms init+probe, then CPU |
| `HF_DAWN_WEBGPU=1` + `HF_DAWN_FORCE_FAIL=1` | 66 ms | fallback verified |

Both fallback runs produced byte-identical output to the CPU baseline
checksum — fallback fidelity confirmed.

**Mac numbers: Vance, please fill in.** Pull the branch, run a
shader-transition fixture both with and without `--gpu-shader-blend`,
and drop the wall-time pair in a comment.

Anywhere the GPU path can't run, the existing CPU path runs unchanged:

- `webgpu` package not installed (optional dep skipped on install) -> CPU
- `webgpu` loads but `requestAdapter()` returns null (no GPU) -> CPU
- WGSL pipeline compile fails -> CPU
- `--gpu-shader-blend` set but shader not in `SHADERS_WGSL` -> CPU
- Mid-render GPU failure -> log + disable GPU + finish on CPU

Failure reason is logged once per worker (not per frame). No crash path.

- Port the remaining 12 shaders to WGSL (mechanical: `flashThroughWhite`,
  `chromaticSplit`, `sdfIris`, `glitch`, `lightLeak`, `crossWarpMorph`,
  `whipPan`, `cinematicZoom`, `gravitationalLens`, `rippleWaves`,
  `swirlVortex`, `thermalDistortion`, `domainWarp`, `ridgedBurn`).
- Once Mac wall-time confirms the lever, decide on flip-default-on or
  keep opt-in.
- Update fixture pins for any fixture that needs to exercise the GPU
  path under CI (PSNR-only).
- A persistent device + pipeline cache shared across workers (currently
  per-worker init).

_PR drafted by Vai_

Co-Authored-By: Vai <vai@heygen.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants