Skip to content

fix(moq-video): probe NVIDIA driver libs before NVENC init (avoid abort on GPU-less box)#1844

Closed
kixelated wants to merge 1 commit into
devfrom
claude/moq-video-dev-audit-o7687o
Closed

fix(moq-video): probe NVIDIA driver libs before NVENC init (avoid abort on GPU-less box)#1844
kixelated wants to merge 1 commit into
devfrom
claude/moq-video-dev-audit-o7687o

Conversation

@kixelated

Copy link
Copy Markdown
Collaborator

What

Reviewing the just-merged #1840 surfaced a pre-existing robustness bug that my NVENC H.265 change widened: Kind::Auto aborts the process on a box without the NVIDIA driver instead of falling back to software.

Root cause (verified empirically on a GPU-less Linux box)

  • cudarc (fallback-dynamic-loading) and nvidia-video-codec-sdk (dynamic-loading) resolve their entry points via dlopen and panic! when the library is missing, rather than returning an error. cudarc's culib() calls panic_no_lib_found(...).
  • The workspace builds with panic = "abort" (both [profile.dev] and [profile.release]).
  • So Nvenc::openCudaContext::new(0) panics → process aborts, before backend::open can fall through to openh264. The .map_err(...)? never runs.

This defeats the crate's central packaging goal ("one portable binary reaches the GPU at runtime, falls back to software where the driver is absent"). It's reachable via the default publish_capture path (Options::kind defaults to Auto). #1840 made it newly reachable for H.265, since NVENC now advertises that codec.

Proof: a temporary Auto+H264 test aborted with a cudarc panic backtrace at CudaContext::new; this box has no libcuda in ldconfig.

Fix

Nvenc::open now dlopen-probes libcuda / libnvidia-encode up front (via libloading, already in the tree through cudarc) and returns a clean Err if either is absent, so the fallback chain proceeds to openh264. A driver-present-but-GPU-absent box still fails through the normal CUresult path, which was already handled.

Probing presence (not catch_unwind) is the right tool here precisely because panic = "abort" makes unwinding-based recovery impossible.

Test

auto_h264_falls_back_without_driver (Linux + nvenc feature) asserts backend::open with Auto returns a backend rather than aborting. It holds on a GPU box (NVENC opens) and a driverless box (openh264 fallback) alike, guarding the panic regression. Full cargo test -p moq-video --features nvenc: 14 pass.

Related (not fixed here)

The VAAPI backend almost certainly has the identical latent bug: cros-libva dlopens libva and likely panics on a miss, so Auto+--features vaapi on a libva-less box would also abort. I couldn't build/verify it in this environment (no libva headers for cros-libva's bindgen), so I left it out rather than fix blind. Tracked as a follow-up in #1837. (Written by Claude)

🤖 Generated with Claude Code


Generated by Claude Code

`Kind::Auto` (the default used by `publish_capture`) aborted the process
on a box without the NVIDIA driver instead of falling back to software.
cudarc and nvidia-video-codec-sdk resolve their entry points via dlopen
and `panic!` when the library is missing rather than returning an error,
and the workspace builds with `panic = "abort"`, so `CudaContext::new`
took the whole process down before `backend::open` could try openh264.
This defeated the crate's "one portable binary falls back to software on
a GPU-less machine" goal, and was newly reachable for H.265 once NVENC
started advertising it.

`Nvenc::open` now dlopen-probes libcuda and libnvidia-encode first and
returns a clean error if either is absent, so the fallback chain proceeds
to openh264. A driver-present-but-GPU-absent box still fails through the
normal `CUresult` path, which was already handled. Add a regression test
(`auto_h264_falls_back_without_driver`) that asserts `Auto` returns a
backend rather than aborting.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KVFm4YtH5u71sZzW6uaZC5

Copy link
Copy Markdown
Collaborator Author

Closing as superseded by #1819 ("make hardware encoders always-on"), which landed on dev after this PR opened and independently includes the same fix: a driver_libs_present() probe of libcuda / libnvidia-encode before NVENC init (same panic-under-panic = "abort" rationale) plus a missing_driver_errors_instead_of_panicking regression test. The driver-probe is already on dev, so there's nothing left here to merge.

Note: #1819 made the Linux hardware encoders always-on (dropped the nvenc/vaapi feature gates), so the VAAPI equivalent of this bug is now reachable on every libva-less box too. Tracked in #1837. (Written by Claude)


Generated by Claude Code

@kixelated kixelated closed this Jun 21, 2026
auto-merge was automatically disabled June 21, 2026 03:12

Pull request was closed

@kixelated kixelated deleted the claude/moq-video-dev-audit-o7687o branch June 26, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants