perf(engine): bump worker count cap for high-core hosts (hf#732 PR 1/5) by vanceingalls · Pull Request #756 · heygen-com/hyperframes

vanceingalls · 2026-05-12T21:40:03Z

Summary

PR 1 of 5 in the hf#732 decomposition stack. Bumps parallelCoordinator's worker-count caps so high-core hosts can actually surface their hardware to renders:

ABSOLUTE_MAX_WORKERS: 10 → 24 (explicit --workers 16 now surfaces 16 DOM sessions instead of being silently clamped).
DEFAULT_SAFE_MAX_WORKERS constant → defaultSafeMaxWorkers() function returning max(6, min(16, floor(cpus/8))). On <=32-core hosts: unchanged (still 6). On 64/96/128-core hosts: 8/12/16.

No behavior change for typical hosts. Required prerequisite for the hybrid shader-transition path landed in PR 4.

Test plan

Existing 7 parallelCoordinator tests pass
Engine typecheck clean
oxlint clean

Stack

This is the base of the hf#732 decomposition stack:

PR 1 (this) — perf(engine): worker-count cap bump
PR 2 — feat(producer): add pngDecodeBlitWorkerPool
PR 3 — feat(producer): add shaderTransitionWorkerPool
PR 4 — perf(producer): hybrid layered/parallel path (the 2.22× speedup)
PR 5 — perf(producer): pipeline capture and shader-blend per-frame

Replaces the closed hf#732. See that issue for the original investigation; the architectural mismatch with #733's captureHdrStage extraction made a clean rebase impossible.

— Vai

Relates to #677

vanceingalls · 2026-05-12T21:58:15Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

jrusso1020

Verdict: APPROVE — clean worker-cap bump

Per Rule 5: required CI all green at 327a47b1 — Lint, Build, Test, Typecheck, regression, all regression-shards (including hdr), preview-regression, Smoke: global install, player-perf, CodeQL. Windows tests + render failed but those are pre-existing red on main (confirmed across the stack); not introduced by this PR.

Originally skipped this one per Vance's ❌ signal in Slack — re-reviewing now at Miguel's request.

Math checks out

defaultSafeMaxWorkers() = Math.max(6, Math.min(16, Math.floor(cpus().length / 8)))

Worked through the table from the PR description:

`cpus()`	`floor/8`	`min(16, …)`	`max(6, …)`
4	0	0	6
8	1	1	6
32	4	4	6
48	6	6	6
64	8	8	8
96	12	12	12
128	16	16	16
192	24	16	16

Boundary at 48 cores (still 6) is the right place to keep typical dev/CI hosts unchanged. The 64+ tier scales linearly until the 16-worker ceiling. PR description's "<=32-core hosts: unchanged (still 6). On 64/96/128-core hosts: 8/12/16" matches.

The /8 divisor rationale is in the docstring: "leaves headroom for each Chrome worker's SwiftShader compositor + the shader-blend thread pool" — empirically validated by the 2.22× speedup PRs (#759/#760) downstream.

`ABSOLUTE_MAX_WORKERS = 24`

Explicit --workers N now surfaces up to 24 (was clamped at 10). Above 24 the cap intentionally clamps — docstring at :65-69 documents the reason ("CDP-protocol dispatch through Node's main event loop and OS scheduling noise overwhelms any further parallelism"). Reasonable upper bound.

Observations (non-blocking)

No new test for the function — the change from const DEFAULT_SAFE_MAX_WORKERS = 6 to defaultSafeMaxWorkers() switches from a value to a function but doesn't add a test pinning the formula at boundary cores (8, 48, 64, 96, 128, 192). Existing 7 parallelCoordinator tests pass per the PR description — they pin the call-site behavior but not the underlying formula. Worth a small follow-up like:
```
it("clamps default workers to [6, 16] based on cpu count", () => {
  // mock cpus()
  expect(defaultSafeMaxWorkers()).toBe(6); // 32 cores
  expect(defaultSafeMaxWorkers()).toBe(8); // 64 cores
  expect(defaultSafeMaxWorkers()).toBe(16); // 128 cores
  expect(defaultSafeMaxWorkers()).toBe(16); // 192 cores (capped)
});
```
Not gating — boundary math is simple enough to read off, and the downstream stack tests it integratively.
cpus() syscall on each calculateOptimalWorkers call — cpus() returns an array of CPU info objects; on Linux it reads /proc/cpuinfo. Minor cost in absolute terms but it's called per render-start, not per frame. Caching the count at module-load would be tidier:
```
const CPU_COUNT = cpus().length;
function defaultSafeMaxWorkers(): number { return Math.max(6, Math.min(16, Math.floor(CPU_COUNT / 8))); }
```
But this loses the (admittedly hypothetical) ability to respond to hot-plug CPU changes during a long-running process. Negligible either way.

Praise

The function-form rationale (high-core hosts surfacing their hardware to auto concurrency) is exactly the right framing — operators shouldn't need to know the magic --workers N value for their hardware.
Inline comments explain WHY (CDP dispatch ceiling, SwiftShader headroom) not just WHAT. Future-me will understand the constants without git-blame archaeology.
Foundation PR is correctly small and self-contained — the rest of the stack builds on this without polluting the diff.

Review by Rames Jusso (pr-review)

miguel-heygen

Clean, well-scoped change. Reviewed the logic and CI state:

Correctness: The defaultSafeMaxWorkers() math holds — max(6, min(16, floor(cpus/8))) returns 6 for ≤48 cores (no behavior change for typical hosts), scales linearly to 16 at 128 cores, and caps there. The /8 divisor leaves ample headroom for SwiftShader + shader-blend threads per worker. ABSOLUTE_MAX_WORKERS bump to 24 only applies to explicit --workers N requests, so auto-mode users are unaffected.

Runtime safety: cpus().length is evaluated per-call (not module-load), so hot-path or container cgroup changes are handled correctly. The os.cpus import was already present — no new dependency.

Tests: The existing 7 parallelCoordinator tests pass. The two calculateOptimalWorkers tests exercise both auto-mode and explicit-request paths. A minor nice-to-have would be a test that asserts calculateOptimalWorkers(1000, undefined, { concurrency: "auto" }) returns the expected value for the CI host's core count (or a mocked os.cpus), but that's not blocking — the function is pure math and the formula is trivially verifiable.

CI failures (file size check, windows render/tests) are all pre-existing and unrelated to this change. Required checks (build, lint, typecheck, test, CLI smoke, regressions) all pass.

LGTM. Good base for the hf#732 stack.

— Magi

Follow-up to hf#732. The hybrid layered/shader-blend path saturates well past 6 DOM workers on multi-core hosts, but the engine's `calculateOptimalWorkers` clamped both the explicit `--workers N` request and the `auto` default to a hardcoded ceiling of 10 / 6 respectively. On the 96-core validation host, w=8, w=12, w=16 all collapsed to 6-10 internally, masking whether the wall is algorithmically bound or just cpu-starved by the cap. Two changes: * `ABSOLUTE_MAX_WORKERS`: 10 -> 24. Explicit `--workers 16` now surfaces 16 DOM sessions instead of being silently truncated to 10. The new ceiling is still finite because CDP-protocol dispatch serializes through Node's main event loop; past ~24 we expect noise to dominate signal. * `DEFAULT_SAFE_MAX_WORKERS` (a constant) -> `defaultSafeMaxWorkers()` returning `Math.max(6, Math.min(16, Math.floor(cpuCount / 8)))`. On 8/16/32-core: 6 (unchanged). On 64-core: 8. On 96-core: 12. On 128-core: 16. The /8 divisor leaves headroom for each Chrome worker's SwiftShader compositor + the in-process shader-blend `worker_threads` pool, both CPU-heavy. No behavior change for typical hosts (<=32 cores). Unlocks high-core hosts to consume more parallelism, which is a prerequisite for the hybrid shader-transition path stacked on top of this PR. Tests: existing 7 parallelCoordinator tests pass unchanged. PR 1 of 5 in the hf#732 decomposition stack. -- Vai Co-Authored-By: Vai <vai@heygen.com>

## Summary PR 2 of 5 in the hf#732 decomposition stack. Adds a `worker_threads`-based pool that offloads PNG decode + alpha-blit onto a fixed-size pool. **No production wiring yet** — the pool stands alone and ships behind a later PR in the stack. ### New files - `packages/producer/src/services/pngDecodeBlitWorker.ts` — worker entry. Imports from `@hyperframes/engine/alpha-blit` (zero-import TS source, survives the `new Worker(<path>)` loader boundary). - `packages/producer/src/services/pngDecodeBlitWorkerPool.ts` — fixed-size pool with `run()` API. Uses `transferList` for buffer ownership transfer (no 16bpc HDR buffer copies). - `packages/producer/src/services/pngDecodeBlitWorkerPool.test.ts` — 6 vitest tests pinning byte-equivalence with inline path, transferList correctness, concurrent dispatch, termination semantics. All pass. ### Build wiring - `packages/cli/tsup.config.ts`: second tsup entry emits `dist/pngDecodeBlitWorker.js` next to `dist/cli.js`. Without this entry the pool's `new Worker(<path>)` would fail at runtime in the shipped CLI. - `packages/producer/build.mjs`: third esbuild entry mirrors the wiring for direct producer consumers. - `packages/engine/package.json`: adds `./alpha-blit` subpath export pointing at `src/utils/alphaBlit.ts`. ## Stack Stacked on top of #756 (PR 1: worker-count cap). No behavior change in any render. ## Test plan - [x] 6 pool tests pass - [x] Producer + engine typecheck clean - [x] oxlint clean — Vai

EXPERIMENTAL spike. Adds a Node-side WebGPU compositor backed by the `webgpu` npm package (Dawn) and wires it into the existing shader-transition worker as an opt-in alternative to the CPU blend. Default OFF — set `HF_DAWN_WEBGPU=1` or pass `--gpu-shader-blend` to the CLI to engage. The hf#677 chain (PRs #756-#760) gets us 1.95x on Mac via a CPU shader worker pool, ring-buffered pipelining, and a hybrid layered/parallel capture path. The remaining ceiling is the per-pixel JS blend itself — scalar f64 math in v8. On any host with a real GPU (Mac/Metal, Linux/Vulkan, Windows/D3D), moving the blend to the GPU via Dawn should drop blend wall time on a 854x480 rgb48le frame from ~150-910 ms (depending on shader complexity) to a few ms, projecting 3-5x end-to-end on top of the existing cascade — see the `reference_5x_shader_perf_alternatives.md` memo (option B). beginFrame is structurally Mac-unavailable (crbug.com/40656275); native WebGPU is the next-best Mac-viable lever. The Dawn binding (`webgpu@0.4.0`, dawn-gpu/node-webgpu, maintained by Dawn upstream — Corentin Wallez, Kai Ninomiya) ships a `darwin-universal` prebuilt that targets Apple Metal directly, plus Linux x64/arm64 Vulkan and Windows x64 D3D. Verified locally that the package loads on Linux x64; adapter init returns null in the sandbox (no Vulkan driver) and the worker transparently falls back to the existing CPU path. Mac numbers must be measured on Vance's laptop — there's no GPU host in CI. - `packages/producer/src/services/shaderTransitionGpu.ts` (NEW) Node-side compositor: dynamic-imports `webgpu`, requests an adapter + device, compiles a WGSL compute shader per supported transition, manages per-size GPU resources, dispatches blends, reads back to `rgb48le`. Returns structured init failures rather than throwing — the caller can always fall back to CPU. `HF_DAWN_FORCE_FAIL=1` short-circuits init for testability of the fallback path. - `packages/producer/src/services/shaderTransitionWorker.ts` Augmented: on the first message, if `HF_DAWN_WEBGPU=1`, dynamic-imports the GPU module and probes once. On success, supported shaders run on the GPU. Unsupported shaders (`glitch`, `domain-warp`, `swirl-vortex`, the rest) and any host without a GPU adapter fall through to the existing `TRANSITIONS[shader] ?? crossfade` CPU path with zero behavior change. Mid-render GPU failure disables the GPU path for the rest of the worker's life and falls back — the frame still completes. - `packages/cli/src/commands/render.ts` New `--gpu-shader-blend` boolean flag (default false). When set, the CLI exports `HF_DAWN_WEBGPU=1` into `process.env` before any worker pool spawns. Env-var plumbing chosen over threading the flag through the orchestrator -> stage -> pool -> worker chain because env vars cross the `worker_threads` boundary unchanged. - `packages/producer/build.mjs` + `packages/cli/tsup.config.ts` `webgpu` added to the `external` list in both bundlers. The Dawn binding ships a 70+ MB native `.dawn.node` binary per platform — it must be loaded from the user's node_modules at runtime, not inlined into the bundle. - `packages/producer/package.json` + `packages/cli/package.json` `webgpu@^0.4.0` added to `optionalDependencies` on both. Optional because (a) it's a native binary that may fail to install on some hosts and (b) the feature is opt-in. - `packages/producer/src/services/shaderTransitionGpu.test.ts` (NEW) 3 vitest tests: `HF_DAWN_FORCE_FAIL` short-circuit, init never throws, and PSNR-vs-CPU >= 50 dB on hosts where the adapter is available (auto-skipped on Linux sandbox with a log line). One representative shader (`crossfade`) is ported to WGSL end-to-end as proof of correctness. The harness is shader-agnostic; porting more is a mechanical add to `SHADERS_WGSL` in `shaderTransitionGpu.ts`. Unsupported shaders transparently fall through to CPU even when the flag is on. GPU compute uses f32 + u16 storage; CPU canonical uses f64. Bit-exact equality is not realistic. The default-off path keeps CI byte-equality pins intact (the existing fixtures hit the CPU path unchanged). Fixtures that exercise the GPU path must use a PSNR pin (>= 50 dB documented in the new test). The fallback CPU path is bit-exact with the canonical CPU implementation. Bundled CLI worker smoke (854x480 single crossfade blend, see the investigation log for the harness): | config | wall | notes | |---|---:|---| | CPU baseline (`HF_DAWN_WEBGPU` unset) | 67 ms | reference | | `HF_DAWN_WEBGPU=1` (fell back, no GPU) | 70 ms | +3 ms init+probe, then CPU | | `HF_DAWN_WEBGPU=1` + `HF_DAWN_FORCE_FAIL=1` | 66 ms | fallback verified | Both fallback runs produced byte-identical output to the CPU baseline checksum — fallback fidelity confirmed. **Mac numbers: Vance, please fill in.** Pull the branch, run a shader-transition fixture both with and without `--gpu-shader-blend`, and drop the wall-time pair in a comment. Anywhere the GPU path can't run, the existing CPU path runs unchanged: - `webgpu` package not installed (optional dep skipped on install) -> CPU - `webgpu` loads but `requestAdapter()` returns null (no GPU) -> CPU - WGSL pipeline compile fails -> CPU - `--gpu-shader-blend` set but shader not in `SHADERS_WGSL` -> CPU - Mid-render GPU failure -> log + disable GPU + finish on CPU Failure reason is logged once per worker (not per frame). No crash path. - Port the remaining 12 shaders to WGSL (mechanical: `flashThroughWhite`, `chromaticSplit`, `sdfIris`, `glitch`, `lightLeak`, `crossWarpMorph`, `whipPan`, `cinematicZoom`, `gravitationalLens`, `rippleWaves`, `swirlVortex`, `thermalDistortion`, `domainWarp`, `ridgedBurn`). - Once Mac wall-time confirms the lever, decide on flip-default-on or keep opt-in. - Update fixture pins for any fixture that needs to exercise the GPU path under CI (PSNR-only). - A persistent device + pipeline cache shared across workers (currently per-worker init). _PR drafted by Vai_ Co-Authored-By: Vai <vai@heygen.com>

vanceingalls mentioned this pull request May 12, 2026

feat(producer): add shaderTransitionWorkerPool (hf#732 PR 3/5) #758

Merged

3 tasks

jrusso1020 approved these changes May 12, 2026

View reviewed changes

miguel-heygen approved these changes May 12, 2026

View reviewed changes

This was referenced May 13, 2026

perf(engine): use HeadlessExperimental.beginFrame for alpha PNG capture #787

Closed

chore(lefthook): grandfather 3 pre-existing over-cap engine files #788

Closed

vanceingalls force-pushed the vai/677-1-worker-cap branch from 327a47b to f108f1e Compare May 13, 2026 20:36

vanceingalls merged commit 57b6858 into main May 13, 2026
39 checks passed

vanceingalls deleted the vai/677-1-worker-cap branch May 13, 2026 21:04

This was referenced May 14, 2026

perf(producer): native WebGPU shader-blend via Dawn (opt-in, Mac-Metal) #831

Draft

perf(engine): faster shader transitions via page-side WebGL compositing #832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(engine): bump worker count cap for high-core hosts (hf#732 PR 1/5)#756

perf(engine): bump worker count cap for high-core hosts (hf#732 PR 1/5)#756
vanceingalls merged 1 commit into
mainfrom
vai/677-1-worker-cap

vanceingalls commented May 12, 2026 •

edited

Loading

Uh oh!

vanceingalls commented May 12, 2026 •

edited

Loading

Uh oh!

jrusso1020 left a comment

Uh oh!

miguel-heygen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vanceingalls commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Stack

Uh oh!

vanceingalls commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrusso1020 left a comment

Choose a reason for hiding this comment

Verdict: APPROVE — clean worker-cap bump

Math checks out

ABSOLUTE_MAX_WORKERS = 24

Observations (non-blocking)

Praise

Uh oh!

miguel-heygen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vanceingalls commented May 12, 2026 •

edited

Loading

vanceingalls commented May 12, 2026 •

edited

Loading

`ABSOLUTE_MAX_WORKERS = 24`