Skip to content

feat(core): Add NVENC Encoding Support via VA-API#427

Draft
efortin wants to merge 49 commits intoelFarto:masterfrom
efortin:feat/nvenc-support
Draft

feat(core): Add NVENC Encoding Support via VA-API#427
efortin wants to merge 49 commits intoelFarto:masterfrom
efortin:feat/nvenc-support

Conversation

@efortin
Copy link
Copy Markdown

@efortin efortin commented Apr 2, 2026

TL;DR

Disclaimer: I had a Windows + WSL long-running Ubuntu setup but was sad to reintroduce this at home when I switched to native Linux. Instead of going back to Windows, I decided to fix my Steam Remote Play setup with AI. It works, it's tested, but it carries the energy of 3AM debugging and "just one more fix". Review accordingly.

This PR adds VAEntrypointEncSlice (hardware encoding) to nvidia-vaapi-driver by wrapping NVIDIA's NVENC API. Any application using VA-API for encoding — Steam Remote Play, ffmpeg, GStreamer, OBS, Chromium — can now use NVIDIA hardware encoding on Linux.

For Blackwell GPUs (RTX 50xx) where NVIDIA dropped 32-bit CUDA support, a shared memory bridge delegates encoding to a 64-bit helper daemon. This is the exact scenario that breaks Steam Remote Play for every NVIDIA user on Linux.

What was broken

Steam Remote Play encoding pipeline on NVIDIA Linux:
1. Try NVENC direct → "NVENC - No CUDA support" (32-bit CUDA broken)
2. Try VA-API encode → fails (nvidia-vaapi-driver doesn't support it)
3. Fallback to libx264 software → 20fps, unusable

This has been open for 2+ years. Issue #116 (45+ thumbs up). Affects every NVIDIA GPU user on Linux who wants Steam Remote Play.

What this PR does

1. VA-API encode support (H.264 + HEVC)

Adds VAEntrypointEncSlice for:

  • H.264: Constrained Baseline, Main, High
  • HEVC: Main, Main10 (10-bit)

After this, vainfo shows encode entrypoints alongside the existing decode entrypoints. ffmpeg h264_vaapi and hevc_vaapi work out of the box.

2. Shared memory bridge (when CUDA is unavailable)

On Blackwell GPUs, 32-bit cuInit() fails with error 100. Steam's encoding runs in a 32-bit process (steamui.so).

Solution: a 64-bit helper daemon (nvenc-helper) that does the CUDA/NVENC work. The 32-bit driver communicates via shared memory (for frame pixels) and a Unix socket (for control and bitstream).

Steam 32-bit → vaDeriveImage → writes NV12 directly to shared memory
  → 16-byte signal via Unix socket
    → nvenc-helper 64-bit: cuMemcpy2D from SHM → persistent GPU buffer → NVENC
    ← HEVC/H.264 bitstream via socket (~10-30KB)
  ← VA-API coded buffer filled
← Steam streams to client

The bridge activates only when cuInit() fails. On systems where CUDA works (64-bit, or 32-bit pre-Blackwell), the driver uses NVENC directly — no helper, no overhead.

3. Everything else that was needed

Getting from "vainfo shows EncSlice" to "Steam Remote Play actually works" required fixing a cascade of issues:

Fix Why
vaDeriveImage implementation Steam writes captured frames through derived images, not vaPutImage
DRM surface allocation without CUDA GPU-backed surfaces via kernel DRM ioctls, no CUDA needed
NV12 pitch/height alignment Encoder uses 1088 (MB-aligned), surface has 1080 — copy only 1080 lines
Periodic IDR keyframes (every 60 frames) Steam sets intra_period=3600 — client can't recover from packet loss
IDR on idr_pic_flag from picture params Forward client keyframe requests to NVENC
Dead client detection via poll() timeout Helper was blocking forever on dead connections
NVIDIA opaque fds vs DMA-BUF fds cuImportExternalMemory needs nvFd, not drmFd

Test results

71 automated tests across 4 suites, plus manual Steam validation.

Automated C test suite (meson test)

Suite Tests Status
test_encode — full encode cycles, leak checks 11 All PASS
test_encode_config — capabilities, error paths, surfaces 35 All PASS
test_ipc_fuzz — IPC protocol robustness/security 8 All PASS
test_gstreamer — GStreamer VA-API encode pipelines 17 All PASS

GStreamer integration tests

End-to-end encode through gst-launch-1.0 with vaapih264enc / vaapih265enc:

Test Status
H.264 320x240 30 frames → fakesink PASS
H.264 1080p 60 frames → mp4 (5MB) PASS
H.264 720p CBR 2Mbps 90 frames PASS
H.264 CBR bitrate accuracy (5.6Mbps vs 5Mbps target, within 30%) PASS
H.264 VBR bitrate control (4.0Mbps, target 5Mbps) PASS
H.264 256x256 small resolution PASS
H.264 4K 5 frames PASS
HEVC 320x240 30 frames → fakesink PASS
HEVC 1080p 60 frames → mp4 PASS
HEVC 4K 5 frames PASS
H.264 encode → decode round-trip PASS
vaapih264dec still available PASS
vaapih265dec still available PASS
10 sequential H.264 pipeline restarts PASS
H.264 1080p60 300 frames sustained PASS

Security testing

Test Result
AddressSanitizer (ASAN) No buffer overflows, no use-after-free, no double-free in our code. 128-byte leak detected in original project's list.c (pre-existing, not introduced by this PR)
UndefinedBehaviorSanitizer (UBSAN) No integer overflow, no null dereference, no alignment issues
IPC fuzz: invalid command (0xFF) PASS — helper rejects cleanly
IPC fuzz: zero-size init payload PASS — rejected, no crash
IPC fuzz: truncated message (disconnect mid-transfer) PASS — helper survives
IPC fuzz: payload_size=0xFFFFFFFF (4GB malloc bomb) PASS — capped at 64MB, rejected
IPC fuzz: encode without init PASS — rejected with error status
IPC fuzz: 50 rapid connect/disconnect cycles PASS — helper stays alive, no fd leak
IPC fuzz: CMD_CLOSE without init PASS — accepted gracefully
IPC fuzz: double init (re-init encoder) PASS — old encoder destroyed, new one created
Helper stability after all fuzz tests 17 fds (stable), 46MB RSS (no growth), 0.2% CPU idle

Manual integration tests

Test Status
vainfo encode entrypoints PASS — 5 EncSlice profiles
H.264 1080p30 (ffmpeg) PASS — High profile, valid output
HEVC 1080p30 (ffmpeg) PASS — Main profile, valid output
HEVC Main10 10-bit PASS — yuv420p10le
1440p60 stress (60s) PASS — 3600 frames, no crash
Bitrate control (CBR 5Mbps) PASS — within 20% of target
NVDEC decode regression PASS — unchanged
GPU encode (nvidia-smi) PASS — 12% encoder util, 159fps
Sequential encodes (leak check) PASS — 10 runs, 0 errors
32-bit driver init PASS — 5 encode, 0 decode entrypoints
Steam Remote Play (Mac Steam Link) PASS — VAAPI H264, 60fps, 0% loss
Steam Remote Play (Legion Go) PASS — VAAPI HEVC, 60fps
nvenc-helper systemd service PASS — auto-start, auto-restart

Performance optimizations

The shared memory bridge went through several optimization rounds:

Optimization Encode time What changed
Baseline (socket transfer) ~8ms 3MB frame sent over Unix socket per frame
Shared memory (memfd) ~6ms Frame data in SHM, only 16-byte signal over socket
SHM zero-copy redirect ~5ms vaDeriveImage maps directly to SHM, skip memcpy
Eliminate redundant memset ~4ms Only zero 8 padding rows, not entire 3MB buffer
Persistent CUDA buffer + cuMemcpy2D ~3ms GPU DMA engine handles host→device + pitch in HW
CUDA context kept active per session ~2.8ms Eliminate per-frame cuCtxPushCurrent/PopCurrent

Final pipeline (1080p NV12):

Steam writes NV12 → SHM (zero-copy via vaDeriveImage)
  → 16-byte signal via socket
  → Helper: 2× cuMemcpy2D (host→device, DMA engine) → persistent CUDA buffer
  → NVENC encodes from VRAM (no PCIe upload at encode time)
  → Bitstream back via socket (~10-30KB)

Code hardening

All code reviewed for production reliability:

  • Zero warnings at -Dwarning_level=3, zero cppcheck issues
  • All CUDA/NVENC return values checked (no silent failures)
  • Socket frame_size capped at 64MB (prevents malloc bomb from corrupt data)
  • File descriptors tracked and closed (no fd leaks, verified with /proc/pid/fd)
  • Dead client detection via poll() with 5s timeout
  • Derived image buffer ownership tracked (sentinel prevents double-free)
  • DMA-BUF fds properly closed on partial import failure
  • NVIDIA opaque fds closed in surface destroy
  • Pre-allocated bitstream output buffer (no per-frame malloc)
  • CUDA context kept pushed for entire client session (no per-frame sync)

What Steam actually uses

From streaming logs, Steam's ffmpeg VA-API encode pipeline uses:

VA-API feature Used by Steam Status
Sequence params (resolution, bitrate, framerate, GOP) Yes Fully mapped to NVENC
Picture params (coded_buf, idr_pic_flag) Yes Working, IDR forwarded
Rate control misc (bits_per_second, target_percentage) Yes Applied to NVENC RC
Framerate misc Yes Applied
HRD misc (buffer_size) Yes Applied to NVENC vbvBufferSize
Packed headers (SEQ+PIC+SLICE+MISC) Yes Accepted (NVENC generates its own, no warning)
Quality level quality=0 (default) VAConfigAttribEncQualityRange reported
vaDeriveImage + vaMapBuffer Yes (every frame) Implemented, zero-copy SHM redirect
vaExportSurfaceHandle No Implemented but Steam doesn't call it
vaPutImage No Implemented but Steam uses vaDeriveImage instead

Known limitations

No B-frames

frameIntervalP=1 always. NVENC with enablePTD=1 and B-frames returns NV_ENC_ERR_NEED_MORE_INPUT for reordered frames, producing empty coded buffers. ffmpeg 6.x vaapi_encode asserts on empty coded buffers. With enablePTD=0, NVENC requires full DPB (Decoded Picture Buffer) reference frame management which Intel drivers handle in hardware but NVENC delegates to the caller.

Not a problem for streaming (B-frames add latency). For offline transcoding with B-frames, use h264_nvenc/hevc_nvenc directly.

Packed headers

Driver advertises full packed header support (SEQ+PIC+SLICE+MISC). NVENC generates its own SPS/PPS/VPS headers internally. Application-provided packed headers are accepted and silently skipped.

32-bit encode-only

When the shared memory bridge is active (CUDA unavailable), only encoding works — no hardware decode. Steam only needs encode on the server side, so this is fine.

HDR

VA-API encode specification does not include color metadata fields (colour_primaries, transfer_characteristics) in sequence parameter structs. Intel drivers have the same limitation — HDR metadata only passes through packed headers (which NVENC generates internally). HDR encode requires direct NVENC (hevc_nvenc with -color_primaries bt2020).

How to test

git clone https://github.com/efortin/nvidia-vaapi-driver
cd nvidia-vaapi-driver && git checkout feat/nvenc-support

Then follow the step-by-step guide for your distro:

No environment variables needed — just launch Steam.

Hardware tested

  • GPU: NVIDIA GeForce RTX 5070 Ti (Blackwell, 16GB GDDR7)
  • Driver: 580.126.09 / 580.126.18 (open kernel modules)
  • OS: Ubuntu 24.04 LTS, Fedora 43
  • CUDA: 13.0
  • Steam client: 32-bit (steamui.so)
  • Clients: macOS Steam Link, SteamOS Legion Go

efortin added 22 commits April 2, 2026 22:00
Wrap NVIDIA's NVENC API behind the VA-API encoding interface, enabling
any application using VA-API for encoding (Steam Remote Play, GStreamer,
ffmpeg h264_vaapi/hevc_vaapi) to use NVIDIA hardware encoding on Linux.

Supported profiles: H.264 (Baseline/Main/High), HEVC (Main/Main10).

Uses low-latency P4 preset with no B-frames for synchronous encode,
optimal for game streaming. Gracefully degrades to decode-only if
libnvidia-encode.so is unavailable.
- Fix NVENC session leak on cuCtxPopCurrent failure in nvCreateContext
- Fix coded buffer object leak on bitstreamData allocation failure
- Fix EOS flush ordering: flush encoder before freeing output buffer
- Fix integer overflow in rate control bitrate calculation (uint32 * uint32)
- Fix nvPutImage to respect src/dest offset parameters per VA-API spec
- Add missing bitrate extraction from HEVC sequence parameters
- Remove dead NVENCInputSurface struct and unused macros
- Remove unused drv parameter from nvRenderPictureEncode
- Normalize naming convention: hevc_enc_ -> hevcenc_ to match h264enc_
Add meson cross-file for building a 32-bit (i386) version of the driver,
needed by Steam Remote Play which uses a 32-bit ffmpeg for VA-API encode.

Usage: meson setup build32 --cross-file cross-i386.txt && meson compile -C build32
Install: cp build32/nvidia_drv_video.so /usr/lib/i386-linux-gnu/dri/

Note: 32-bit CUDA (cuInit) fails on driver 580+ with Blackwell GPUs,
blocking the 32-bit encode path until NVIDIA fixes their 32-bit driver.
32-bit CUDA is broken on driver 580+ with Blackwell GPUs (cuInit returns
error 100). This blocks the 32-bit VA-API driver from using NVENC directly.

Add a 64-bit helper daemon (nvenc-helper) that runs as a separate process
where CUDA works. The 32-bit driver detects CUDA failure, enters
encode-only mode, and forwards encode operations to the helper via a
Unix domain socket at $XDG_RUNTIME_DIR/nvenc-helper.sock.

Architecture:
  32-bit steam → 32-bit steamui.so → 32-bit libavcodec → 32-bit libva
    → 32-bit nvidia_drv_video.so (encode-only, no CUDA)
      → Unix socket → 64-bit nvenc-helper
        → 64-bit CUDA + NVENC (works)
      ← encoded bitstream
    ← VA-API coded buffer

The helper uses NVENC's own input buffer management (nvEncCreateInputBuffer
+ nvEncLockInputBuffer) instead of CUDA memory, making the data path:
socket recv → memcpy into NVENC buffer → hardware encode → bitstream back.

The helper auto-starts on first encode and exits after 30s idle.

When CUDA is available (64-bit), the direct NVENC path is used as before
with zero overhead — the IPC path is only activated when cuInit fails.
- Remove 30s idle timeout from accept loop — helper now runs until
  SIGTERM/SIGINT (was causing premature exit before any client connects)
- Always enable logging to stderr for diagnostics
- Continue listening after accept() errors instead of exiting
- Log "Ready for next client" between sessions
- Add multi-path helper discovery in the driver (libexec, local/libexec)
- Try connect to running helper before attempting to start a new one
Guard cuCtxPushCurrent/cuCtxPopCurrent in nvCreateSurfaces2 behind
cudaAvailable check. In encode-only IPC mode, surfaces only need
host-side metadata — no GPU memory allocation required.

This fixes "Failed to create surface: 1 (operation failed)" that
Steam's 32-bit ffmpeg hit when trying to use our encode-only driver.
Steam Remote Play client needs a complete IDR keyframe with SPS/PPS/VPS
headers to start decoding. Without FORCEIDR, the first frame was encoded
as a non-IDR which the client couldn't decode, causing "Didn't get
keyframe" errors and 99% frame loss.

Also add periodic frame count logging to helper for diagnostics.
Without a timeout, the helper blocks forever in recv_all() when a client
dies without sending CMD_CLOSE. This prevents new clients from connecting
since the helper is single-threaded.

Add SO_RCVTIMEO of 5 seconds on client sockets. If no data arrives for
5s, the recv fails, the helper cleans up the encoder and goes back to
accept() for the next client.
Steam captures the desktop via OpenGL and passes GPU-resident NV12
surfaces to the VA-API encoder as DMA-BUF file descriptors through
vaCreateSurfaces attrib_list. The previous IPC path sent empty pixel
data because vaPutImage is never called in this flow.

New architecture:
1. nvCreateSurfaces2: parse attrib_list for VASurfaceAttribMemoryType
   (DRM_PRIME/DRM_PRIME_2) and VASurfaceAttribExternalBufferDescriptor.
   Extract DMA-BUF fd, dup() it, store in NVSurface.
2. nvEndPictureEncodeIPC: if surface has importedDmaBufFd, send it to
   the 64-bit helper via SCM_RIGHTS Unix socket ancillary data.
3. nvenc-helper CMD_ENCODE_DMABUF: receive the fd, import into CUDA
   via cuImportExternalMemory, map to CUdeviceptr, register with NVENC,
   encode, return bitstream. Full GPU zero-copy — no host memory touch.

This is the true GPU-accelerated path: Steam's OpenGL capture → DMA-BUF
→ CUDA import (64-bit helper) → NVENC encode → bitstream back via IPC.
No pixel data crosses the socket, only the fd and encoded output.
The 32-bit driver couldn't provide pixel data to the encoder because
Steam renders captured frames into VA-API surfaces via OpenGL/DMA-BUF,
not through vaPutImage. Without GPU-backed surfaces, the frames were
empty.

Fix: initialize the NVIDIA DRM direct backend even in IPC mode.
The DRM backend allocates GPU memory and exports DMA-BUF fds without
needing CUDA (it uses kernel DRM ioctls). This gives surfaces real
GPU backing that Steam can render into via OpenGL.

Changes:
- direct-export-buf.c: skip CUDA import in alloc_backing_image when
  cudaAvailable is false; skip CUDA calls in findGPUIndexFromFd
- vabackend.c: init DRM backend in IPC mode; realise surfaces before
  encoding; use backing image DMA-BUF fd for IPC encode; guard
  vaExportSurfaceHandle CUDA calls; clean up DRM resources on terminate
- Handle surface destroy with backing images in IPC mode
The DRM backend produces separate DMA-BUF fds per plane (Y, UV) as
tiled GPU textures. NVENC needs a single linear NV12 CUdeviceptr.

Previous approach tried cuImportExternalMemory with a single fd as a
flat buffer → CUDA error 999 (the fd is a tiled texture, not linear).

New approach matches the direct encode path:
1. Send all plane fds via SCM_RIGHTS (up to 4)
2. Helper imports each fd → CUexternalMemory → CUmipmappedArray → CUarray
3. cuMemcpy2D each plane from CUarray to a linear CUdeviceptr
4. Register linear buffer with NVENC, encode, return bitstream
5. Clean up all CUDA resources

This is the same import→copy→encode pipeline as the working 64-bit
direct path, just running in the helper process.
The DRM backend produces two types of fds per allocation:
- nvFd: NVIDIA-specific opaque handle (for CUDA import)
- drmFd: DMA-BUF fd (for DRM/EGL/OpenGL export)

cuImportExternalMemory with CU_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD
requires the NVIDIA opaque fd (nvFd), not the DMA-BUF fd (drmFd).
Sending drmFd caused CUDA error 999 (unknown).

Fix: store nvFd and memorySize in BackingImage when CUDA is unavailable
(IPC mode). Send dup'd nvFds to the helper for CUDA import. The helper
can now successfully import the GPU memory into its 64-bit CUDA context.
Steam's OpenGL capture pipeline renders into VA-API surfaces BEFORE
calling vaBeginPicture/vaEndPicture. If the surface has no GPU memory
at creation time, the capture renders into nothing → green screen.

Allocate backing images immediately in nvCreateSurfaces2 when in IPC
encode-only mode. This gives surfaces real GPU memory (via DRM ioctls)
that Steam can export via vaExportSurfaceHandle, import into OpenGL
as a render target, and render captured frames into.

The encode path then reads the same GPU memory via CUDA import in
the 64-bit helper.
Steam's ffmpeg calls vaDeriveImage to map VA-API surfaces to CPU memory,
then writes the captured NV12 desktop frame into the mapped buffer.
Without this, the surfaces have no pixel data → green screen.

Implement vaDeriveImage in the IPC (no-CUDA) path:
- Allocate a host-memory buffer on the surface (hostPixelData)
- Return a VAImage backed by this shared buffer
- Steam's vaMapBuffer returns the host pointer
- Steam writes captured frame → host buffer
- nvEndPictureEncodeIPC sends host buffer to helper via IPC
- Helper encodes via NVENC's own input buffer (nvEncLockInputBuffer)

The derived image buffer is marked as non-owning (sentinel offset=-1)
so nvDestroyImage doesn't free the surface's memory.

This completes the pixel data pipeline:
  Steam OpenGL capture → vaDeriveImage → vaMapBuffer → write NV12
  → vaEndPicture → IPC send pixel data → helper encodes → bitstream
vaDeriveImage writes captured pixels to surface->hostPixelData, but
the encode path was checking DMA-BUF first and finding the (empty)
GPU backing image. The GPU surface has no pixel data because Steam
writes via vaDeriveImage to host memory, not to the GPU surface.

Reverse priority: check hostPixelData first (has actual captured
pixels from vaDeriveImage), fall back to DMA-BUF only if no host
data is available.
Steam requests IDR keyframes via idr_pic_flag in picture params when
the client loses sync (packet loss, reconnection). Without forwarding
this flag, the encoder never produces new keyframes after the first
frame, and the client can't recover → "Didn't get keyframe" loop.

- Parse idr_pic_flag from H.264/HEVC picture parameter buffers
- Store as forceIDR flag on NVENCContext
- Pass through IPC protocol (new force_idr field in encode params)
- Helper's encoder_encode uses it for NV_ENC_PIC_FLAG_FORCEIDR
- Also fix the direct 64-bit encode path to respect forceIDR
Steam reuses the same surface for every frame. vaDeriveImage maps the
surface's hostPixelData, Steam writes captured pixels into it, then
vaEndPicture sends the data to the helper. But Steam can start writing
the NEXT frame while the IPC send is still transmitting the current
frame (~3MB @ 1080p) → visual tearing and overlay artifacts.

Fix: memcpy the frame into a snapshot buffer before sending via IPC.
The snapshot is a consistent image that won't be modified during
transmission. Adds ~3MB memcpy per frame (~1ms at DDR5 bandwidth)
which is negligible vs the 7ms encode time.
The encoder is initialized at MB-aligned height (e.g. 1088 for 1080p)
but the surface and vaDeriveImage host buffer contain exactly
surface_height pixels (1080). The helper was copying enc->height
(1088) lines from a buffer with only 1080 → buffer overread causing
horizontal line artifacts across the entire image.

Fix:
- vabackend.c: send surface->width/height to IPC, not nvencCtx dimensions
- nvenc-helper: encoder_encode takes explicit frame_width/frame_height,
  copies only that many lines, zero-pads the MB-aligned remainder.
  Chroma offset calculated from frame_height (actual data position),
  destination chroma at dstPitch * enc->height (encoder's full height).
install.sh handles the full build + install:
- Builds 64-bit driver + nvenc-helper
- Cross-compiles 32-bit driver (if i386 arch enabled)
- Installs both drivers to system dri paths
- Installs nvenc-helper to /usr/libexec
- Creates and enables systemd user service for nvenc-helper
- Verifies installation

No environment variables needed — libva auto-detects the NVIDIA
driver from the DRM device, and NVD_BACKEND defaults to direct.
Steam sets intra_period=3600 (60 seconds between keyframes). When a
single packet is lost, the client requests a new keyframe but has to
wait up to 60 seconds → stream freezes and Steam restarts the encoder.

Force an IDR every 60 frames (~1 second at 60fps) so the client can
recover from packet loss within 1 second. This matches the behavior
of other streaming-optimized encoders (OBS, Moonlight/Sunshine).
Replace the 3MB socket send/recv per frame with shared memory (memfd).
The helper creates a shm region on CMD_INIT and sends the fd to the
client via SCM_RIGHTS. The client mmap's it and writes frames directly.
Only a small CMD_ENCODE_SHM header (16 bytes) goes over the socket.

Before: snapshot memcpy(3MB) + send_all(3MB) + recv_all(3MB) + NVENC copy
After:  memcpy(3MB to shm)  + send(16 bytes) + NVENC copy from shm

Saves ~6ms per frame at 1080p by eliminating 2 full-frame socket
transfers. Falls back to socket path if shm creation fails.
@efortin efortin marked this pull request as draft April 3, 2026 00:48
efortin added 6 commits April 3, 2026 08:05
- tests/encoding-tests.md: 12 test cases covering 64-bit encode,
  32-bit IPC encode, Steam Remote Play, systemd service, decode
  regression, stress test, 10-bit, bitrate control, leak check
- Document B-frame limitation: ffmpeg 6.x vaapi_encode asserts on
  empty coded buffers from NV_ENC_ERR_NEED_MORE_INPUT. Verified by
  testing — enabling B-frames via ip_period>1 causes assertion failure.
  Users needing B-frames should use h264_nvenc/hevc_nvenc directly.
- Improve B-frame documentation in nvenc.c with explanation of why
  and alternative for offline transcoding
The IPC encode helper is only used when cuInit() fails, not based on
process architecture. A 32-bit process on Turing/Ampere/Ada where
cuInit works will use direct NVENC, same as 64-bit.

The decision path:
  cuInit(0) succeeds → cudaAvailable=true → direct NVENC (no IPC)
  cuInit(0) fails    → cudaAvailable=false → IPC helper bridge

Updated comments throughout to say "CUDA unavailable" instead of
"32-bit" to avoid implying the bridge is always used for 32-bit.
Full documentation covering:
- Problem statement (VA-API encode missing, 32-bit CUDA broken)
- Two encode paths: direct NVENC vs shared memory bridge
- Path selection logic (cuInit success/fail, not architecture)
- Data flow diagrams for shared memory frame transfer
- Control protocol (Unix socket commands)
- Surface management in bridge mode
- All edge cases: encoder height padding, IDR recovery, frame
  tearing, dead client detection, object ID growth, B-frame
  limitation, DMA-BUF path
- Supported profiles, installation, debugging
Critical segfault fixes:
- Check cuMemAlloc/cuMemcpy2D returns in DMABUF path (was crashing
  silently on allocation failure)
- Cap frame_size from socket to 64MB max (prevents malloc bomb from
  malicious/corrupt data)
- Use fixed drain buffer instead of malloc(untrusted_size)
- Add NULL check for buf->ptr in nvMapBuffer
- Close shm_fd when shm_fd_out is NULL (fd leak)

Leak fixes:
- Don't send fd=-1 via SCM_RIGHTS (undefined behavior) — use
  send_response() for shm fallback path
- Close unclaimed DMABUF fds on partial import failure
- Close nvFds[] in destroyBackingImage for IPC mode

Correctness:
- Zero NVENC input buffer luma (0) and chroma (128=neutral UV)
  separately instead of blanket memset that could over-zero
- Make IDR interval a #define (NVENC_HELPER_IDR_INTERVAL=60)
- Fix stale "30s idle timeout" comment in helper header
- Reduce hot-path logging (picture params only logged for first 3
  frames to avoid 60fps log flood)

Documentation:
- Add edge case table: 15 potential failure scenarios with behavior
  and mitigation
- Add known non-working scenarios table: 7 unsupported cases with
  reasons
@efortin efortin changed the title Feat/nvenc support feat(core): Add NVENC Encoding Support via VA-API Apr 3, 2026
efortin added 15 commits April 3, 2026 09:21
Before (per frame at 1080p NV12):
  Driver: memcpy 3MB hostPixelData → shmPtr (snapshot)
  Helper: memset 3MB (full buffer clear) + line-by-line memcpy 3MB
  Total: 9MB memory bandwidth per frame, 540MB/s at 60fps

After (per frame at 1080p NV12):
  Driver: zero copy (vaDeriveImage maps directly to SHM)
  Helper: bulk memcpy 3MB (when pitches match) + memset 8 rows only
  Total: 3MB memory bandwidth per frame, 180MB/s at 60fps (3x reduction)

Changes:
- vaDeriveImage: redirect surface hostPixelData to SHM region after
  encoder init. Steam writes directly to shared memory. Zero copy.
- hostPixelIsShm flag: prevents free() on mmap'd SHM pointer
- encoder_encode: fast path when srcPitch == dstPitch (single memcpy
  instead of 1080 individual line copies)
- encoder_encode: only zero MB-alignment padding rows (8 rows for
  1080→1088) instead of clearing entire 3MB buffer every frame
- Skip redundant memcpy in EndPicture when hostPixelData IS shmPtr
Parse H.264/HEVC slice_type from VA-API slice parameter buffers and
map to NVENC picture types (I/P/B/IDR). The picType field is stored
on NVENCContext for each frame.

B-frames remain disabled (frameIntervalP=1, enablePTD=1) because:
1. NVENC with enablePTD=0 requires full DPB reference frame management
   (reference picture lists, reference frame marking) which Intel's
   VA-API driver handles internally with its hardware encoder
2. NVENC with enablePTD=1 handles references but returns
   NV_ENC_ERR_NEED_MORE_INPUT for B-frames → ffmpeg 6.x asserts
3. LOW_LATENCY tuning internally overrides frameIntervalP to 1

The slice type parsing infrastructure is ready for when full DPB
management is implemented. For now, -bf 2 gracefully falls back to
IPP (no crash, no B-frames in output).

Tested: verified enablePTD=0 with explicit picture types — NVENC
encodes all frames as I-only because DPB references aren't managed.
Full DPB management is tracked as a future enhancement.
tests/test_encode.c: 11 self-contained tests covering:
- Entrypoints: H.264 + HEVC VAEntrypointEncSlice present
- Config: RTFormat YUV420, rate control CQP/CBR/VBR
- Lifecycle: create/destroy config, surfaces, context (no leak)
- H.264 encode: High, Main, ConstrainedBaseline (1 frame each)
- HEVC encode: Main profile (1 frame)
- Stress: 10 sequential create/encode/destroy cycles
- Coded buffer reuse: 5 frames with same coded buffer
- Regression: VLD decode entrypoints still present

Build: gcc -o test_encode tests/test_encode.c -lva -lva-drm -lm
Run: ./test_encode [h264|hevc]

Inspired by Intel VA-API driver's GTest-based test suite but
implemented in pure C for compatibility with the project.
Replace nvEncLockInputBuffer (host memory) + line-by-line memcpy with
a persistent CUDA device buffer registered once with NVENC.

Before (per frame):
  nvEncLockInputBuffer → host pointer
  1620× memcpy (1080 luma + 540 chroma lines, pitch conversion)
  nvEncUnlockInputBuffer → DMA upload to GPU
  Total: ~3-4ms (host memcpy + PCIe transfer)

After (per frame):
  2× cuMemcpy2D (luma + chroma, host→device, pitch conversion in HW)
  nvEncMapInputResource (already in VRAM)
  nvEncEncodePicture (reads from VRAM, no PCIe upload)
  nvEncUnmapInputResource
  Total: ~1-2ms (GPU DMA engine handles pitch + transfer)

Benefits:
- Single CUDA call replaces 1080 individual memcpy calls per plane
- GPU DMA engine handles pitch conversion in hardware
- NVENC reads from device memory (no PCIe upload at encode time)
- Persistent buffer avoids per-frame alloc/register/unregister
- Falls back to host path if CUDA alloc or NVENC register fails
…tion

Inspired by Intel's i965 test infrastructure (gtest-based), add a C test
framework with equivalent coverage:

tests/test_common.h:
  - EXPECT_STATUS, EXPECT_TRUE, EXPECT_NOT_NULL macros
  - TestTimer for performance benchmarks
  - test_has_entrypoint() helper for parametrized profile testing
  - Global VA display setup/teardown

tests/test_encode_config.c (34 tests):
  - Encode entrypoints: H264 CB/Main/High, HEVC Main/Main10 present
  - Decode entrypoints: MPEG2, AV1, JPEG, VP9 correctly reported
  - Config attributes: RTFormat, RateControl, PackedHeaders, MaxRefFrames
  - Error paths: invalid entrypoint, encode on decode-only profile
  - Config creation: all 5 encode profiles create+destroy
  - Surface creation: NV12, P010, 16x16, 4K, 16 simultaneous

meson.build:
  - test() targets for both test_encode and test_encode_config
  - 60s timeout per suite
  - Only built for native (not cross-compiled i386)

Total: 45 tests across 2 suites, all passing via `meson test`.
- Remove steps/ development notes (not for PR)
- Remove encode_handlers.h (merged into nvenc.h)
- Strip verbose block comments — project uses terse inline //
- Strip struct field comments in nvenc.h (match existing headers)
- Remove explanatory paragraphs from nvenc.c (B-frame, version, etc.)
- Remove file-level comment blocks from h264_encode.c, hevc_encode.c
- Use void* for encode handler signatures to avoid circular includes

Net: -427 lines, cleaner match to elFarto's code style.
- Fix GCC statement expression in CHECK_CUDA_RESULT_HELPER macro,
  replace with inline function (ISO C compliant)
- Fix variadic macro warnings: replace HELPER_LOG macro with proper
  va_list function (no ##__VA_ARGS__ GNU extension)
- Add const qualifiers to encode handler local variables (cppcheck)
- Remove unused variable surfObj from nvBeginPicture
- Remove stale debug LOG from nvBeginPicture encode path

Zero warnings with -Dwarning_level=3.
Zero cppcheck issues (excluding false positive unusedFunction).
Address gaps found in VA-API spec compliance audit:

- Add VAConfigAttribEncQualityRange (reports 7 levels, maps to NVENC P1-P7)
- Pass HRD buffer_size and initial_buffer_fullness to NVENC vbvBufferSize/
  vbvInitialDelay (was read but ignored, now applied in encoder init)
- Handle VAEncMiscParameterTypeHRD in HEVC path (was H.264 only)
- Add test for quality range attribute

Audit summary: 16/16 VA-API pipeline steps PASS. Remaining architectural
limitations (B-frames, packed header injection) documented in known
limitations.
Steam requests packed headers 0xd (SEQ+PIC+SLICE+MISC) but we only
reported 0x3 (SEQ+PIC), causing:
  ffmpeg warning: Driver does not support some wanted packed headers

NVENC generates all headers internally. We accept and silently skip
application-provided packed header buffers in nvRenderPictureEncode.
Advertising full support prevents the warning without changing behavior.
CUDA context: keep pushed for entire client session instead of
push/pop per frame. Eliminates GPU sync overhead (~0.5ms/frame).

Bitstream buffer: pre-allocate 4MB once in encoder_init, realloc
if needed. Eliminates 60 malloc+free per second.

Socket hardening:
- umask(0077) before bind to prevent permission race window
- listen backlog 2→8 for burst connection handling
- Remove SO_RCVTIMEO (could break large frame recv)
- Use poll(5000ms) in command loop for dead client detection
…deps

Detects driver version from dpkg (e.g. 580) and automatically installs:
- Build deps: meson, ninja, gcc, pkg-config, libva/drm/egl/ffnvcodec-dev
- 32-bit deps: gcc-multilib, i386 dev libs, libnvidia-compute/encode-XXX:i386
- Enables i386 architecture if needed

No more manual apt commands before running install.sh.
@efortin efortin marked this pull request as ready for review April 4, 2026 09:18
@efortin efortin marked this pull request as draft April 4, 2026 09:19
efortin added 6 commits April 4, 2026 11:19
- Remove reference to deleted encode_handlers.h
- Fix test count: 35 config tests (not 34)
- Fix SHM pipeline description: zero-copy, no memcpy
- Fix dead client detection: poll() not SO_RCVTIMEO
- Add CUDA context optimization to perf table (2.8ms)
- Add pre-allocated bitstream buffer to hardening list
- Clarify B-frame limitation: explain both enablePTD paths
- Add HDR limitation section
- Add cppcheck/warning_level=3 to hardening
- Fix PR elFarto#425 comparison: B-frames attempted, not fully working
- Update disclaimer wording
…ompat)

GStreamer's vaapih264enc/vaapih265enc calls vaQuerySurfaceAttributes on
encode configs and expects MinWidth/MinHeight/MaxWidth/MaxHeight with
VA_SURFACE_ATTRIB_GETTABLE flag set. Without these, GStreamer fails to
negotiate caps and refuses to encode.

Add all 5 required surface attributes with correct flags:
- VASurfaceAttribMinWidth/Height (16)
- VASurfaceAttribMaxWidth/Height (4096)
- VASurfaceAttribPixelFormat (NV12 or P010, GETTABLE+SETTABLE)

Tested: gst-launch-1.0 vaapih264enc and vaapih265enc both produce
valid 1080p output.
Ubuntu-only install script replaced by step-by-step markdown guides
for both Ubuntu and Fedora, covering 64-bit/32-bit build, nvenc-helper
service setup, and verification.
15 tests covering H.264/HEVC encode through gst-launch-1.0 pipelines:
prerequisites, file output, CBR bitrate, small/4K resolution, decode
regression, encode→decode round-trip, and stress (sequential restarts,
sustained 1080p60).
@Dravenoid
Copy link
Copy Markdown

Hi, saw the Reddit post and wanted to help test on Blackwell.
Setup: RTX 5060 Ti 16GB, driver 595.58.03, CachyOS (Arch-based), KDE Plasma 6 Wayland. Using a OnePlus 12 (PJD110/CPH2573) on Android 14.

Good news, the IPC encoder initializes and runs on Blackwell:
[nvenc-helper] Client connected (fd=17)
[nvenc-helper] Init: 2560x1440 codec=1 profile=17 bitrate=29880000
[nvenc-helper] GPU buffer: 5529600 bytes, pitch=2560 (persistent CUDA+NVENC)
[nvenc-helper] Encoder initialized: 2560x1440 HEVC 8-bit (gpu=yes)
[nvenc-helper] Encoded 300 frames
[nvenc-helper] Encoded 600 frames
[nvenc-helper] Encoder closed (encoded 770 frames)
vainfo shows VAEntrypointEncSlice for H264, HEVC, and HEVC10. All good there.

The issue is the client never gets past the connecting screen.
Steam logs infinite. Didn't get keyframe, resending lost data notification and never renders anything. The slow framerate entry from that session stands out:
Slow framerate: game 0.00, capture 21.41, convert 18.76, encode 37.82, network 18.89, decode 2.52, display -13237907.00 (encode)
The encode 37.82ms is way higher than x264 sessions (~5ms), and display -13237907.00 looks like a timestamp overflow somewhere in the IPC path. Network packet quality was also stuck at ~46% the entire session despite being on LAN, not sure if that's related.

steam_remote_play_logs.txt

Happy to test anything and pull more logs if it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants