feat(core): Add NVENC Encoding Support via VA-API by efortin · Pull Request #427 · elFarto/nvidia-vaapi-driver

efortin · 2026-04-02T23:14:20Z

TL;DR

Disclaimer: I had a Windows + WSL long-running Ubuntu setup but was sad to reintroduce this at home when I switched to native Linux. Instead of going back to Windows, I decided to fix my Steam Remote Play setup with AI. It works, it's tested, but it carries the energy of 3AM debugging and "just one more fix". Review accordingly.

This PR adds VAEntrypointEncSlice (hardware encoding) to nvidia-vaapi-driver by wrapping NVIDIA's NVENC API. Any application using VA-API for encoding — Steam Remote Play, ffmpeg, GStreamer, OBS, Chromium — can now use NVIDIA hardware encoding on Linux.

For Blackwell GPUs (RTX 50xx) where NVIDIA dropped 32-bit CUDA support, a shared memory bridge delegates encoding to a 64-bit helper daemon. This is the exact scenario that breaks Steam Remote Play for every NVIDIA user on Linux.

What was broken

Steam Remote Play encoding pipeline on NVIDIA Linux:
1. Try NVENC direct → "NVENC - No CUDA support" (32-bit CUDA broken)
2. Try VA-API encode → fails (nvidia-vaapi-driver doesn't support it)
3. Fallback to libx264 software → 20fps, unusable

This has been open for 2+ years. Issue #116 (45+ thumbs up). Affects every NVIDIA GPU user on Linux who wants Steam Remote Play.

What this PR does

1. VA-API encode support (H.264 + HEVC)

Adds VAEntrypointEncSlice for:

H.264: Constrained Baseline, Main, High
HEVC: Main, Main10 (10-bit)

After this, vainfo shows encode entrypoints alongside the existing decode entrypoints. ffmpeg h264_vaapi and hevc_vaapi work out of the box.

2. Shared memory bridge (when CUDA is unavailable)

On Blackwell GPUs, 32-bit cuInit() fails with error 100. Steam's encoding runs in a 32-bit process (steamui.so).

Solution: a 64-bit helper daemon (nvenc-helper) that does the CUDA/NVENC work. The 32-bit driver communicates via shared memory (for frame pixels) and a Unix socket (for control and bitstream).

Steam 32-bit → vaDeriveImage → writes NV12 directly to shared memory
  → 16-byte signal via Unix socket
    → nvenc-helper 64-bit: cuMemcpy2D from SHM → persistent GPU buffer → NVENC
    ← HEVC/H.264 bitstream via socket (~10-30KB)
  ← VA-API coded buffer filled
← Steam streams to client

The bridge activates only when cuInit() fails. On systems where CUDA works (64-bit, or 32-bit pre-Blackwell), the driver uses NVENC directly — no helper, no overhead.

3. Everything else that was needed

Getting from "vainfo shows EncSlice" to "Steam Remote Play actually works" required fixing a cascade of issues:

Fix	Why
`vaDeriveImage` implementation	Steam writes captured frames through derived images, not `vaPutImage`
DRM surface allocation without CUDA	GPU-backed surfaces via kernel DRM ioctls, no CUDA needed
NV12 pitch/height alignment	Encoder uses 1088 (MB-aligned), surface has 1080 — copy only 1080 lines
Periodic IDR keyframes (every 60 frames)	Steam sets `intra_period=3600` — client can't recover from packet loss
IDR on `idr_pic_flag` from picture params	Forward client keyframe requests to NVENC
Dead client detection via poll() timeout	Helper was blocking forever on dead connections
NVIDIA opaque fds vs DMA-BUF fds	`cuImportExternalMemory` needs `nvFd`, not `drmFd`

Test results

71 automated tests across 4 suites, plus manual Steam validation.

Automated C test suite (`meson test`)

Suite	Tests	Status
`test_encode` — full encode cycles, leak checks	11	All PASS
`test_encode_config` — capabilities, error paths, surfaces	35	All PASS
`test_ipc_fuzz` — IPC protocol robustness/security	8	All PASS
`test_gstreamer` — GStreamer VA-API encode pipelines	17	All PASS

GStreamer integration tests

End-to-end encode through gst-launch-1.0 with vaapih264enc / vaapih265enc:

Test	Status
H.264 320x240 30 frames → fakesink	PASS
H.264 1080p 60 frames → mp4 (5MB)	PASS
H.264 720p CBR 2Mbps 90 frames	PASS
H.264 CBR bitrate accuracy (5.6Mbps vs 5Mbps target, within 30%)	PASS
H.264 VBR bitrate control (4.0Mbps, target 5Mbps)	PASS
H.264 256x256 small resolution	PASS
H.264 4K 5 frames	PASS
HEVC 320x240 30 frames → fakesink	PASS
HEVC 1080p 60 frames → mp4	PASS
HEVC 4K 5 frames	PASS
H.264 encode → decode round-trip	PASS
vaapih264dec still available	PASS
vaapih265dec still available	PASS
10 sequential H.264 pipeline restarts	PASS
H.264 1080p60 300 frames sustained	PASS

Security testing

Test	Result
AddressSanitizer (ASAN)	No buffer overflows, no use-after-free, no double-free in our code. 128-byte leak detected in original project's `list.c` (pre-existing, not introduced by this PR)
UndefinedBehaviorSanitizer (UBSAN)	No integer overflow, no null dereference, no alignment issues
IPC fuzz: invalid command (0xFF)	PASS — helper rejects cleanly
IPC fuzz: zero-size init payload	PASS — rejected, no crash
IPC fuzz: truncated message (disconnect mid-transfer)	PASS — helper survives
IPC fuzz: payload_size=0xFFFFFFFF (4GB malloc bomb)	PASS — capped at 64MB, rejected
IPC fuzz: encode without init	PASS — rejected with error status
IPC fuzz: 50 rapid connect/disconnect cycles	PASS — helper stays alive, no fd leak
IPC fuzz: CMD_CLOSE without init	PASS — accepted gracefully
IPC fuzz: double init (re-init encoder)	PASS — old encoder destroyed, new one created
Helper stability after all fuzz tests	17 fds (stable), 46MB RSS (no growth), 0.2% CPU idle

Manual integration tests

Test	Status
vainfo encode entrypoints	PASS — 5 EncSlice profiles
H.264 1080p30 (ffmpeg)	PASS — High profile, valid output
HEVC 1080p30 (ffmpeg)	PASS — Main profile, valid output
HEVC Main10 10-bit	PASS — yuv420p10le
1440p60 stress (60s)	PASS — 3600 frames, no crash
Bitrate control (CBR 5Mbps)	PASS — within 20% of target
NVDEC decode regression	PASS — unchanged
GPU encode (nvidia-smi)	PASS — 12% encoder util, 159fps
Sequential encodes (leak check)	PASS — 10 runs, 0 errors
32-bit driver init	PASS — 5 encode, 0 decode entrypoints
Steam Remote Play (Mac Steam Link)	PASS — VAAPI H264, 60fps, 0% loss
Steam Remote Play (Legion Go)	PASS — VAAPI HEVC, 60fps
nvenc-helper systemd service	PASS — auto-start, auto-restart

Performance optimizations

The shared memory bridge went through several optimization rounds:

Optimization	Encode time	What changed
Baseline (socket transfer)	~8ms	3MB frame sent over Unix socket per frame
Shared memory (memfd)	~6ms	Frame data in SHM, only 16-byte signal over socket
SHM zero-copy redirect	~5ms	`vaDeriveImage` maps directly to SHM, skip memcpy
Eliminate redundant memset	~4ms	Only zero 8 padding rows, not entire 3MB buffer
Persistent CUDA buffer + cuMemcpy2D	~3ms	GPU DMA engine handles host→device + pitch in HW
CUDA context kept active per session	~2.8ms	Eliminate per-frame cuCtxPushCurrent/PopCurrent

Final pipeline (1080p NV12):

Steam writes NV12 → SHM (zero-copy via vaDeriveImage)
  → 16-byte signal via socket
  → Helper: 2× cuMemcpy2D (host→device, DMA engine) → persistent CUDA buffer
  → NVENC encodes from VRAM (no PCIe upload at encode time)
  → Bitstream back via socket (~10-30KB)

Code hardening

All code reviewed for production reliability:

Zero warnings at -Dwarning_level=3, zero cppcheck issues
All CUDA/NVENC return values checked (no silent failures)
Socket frame_size capped at 64MB (prevents malloc bomb from corrupt data)
File descriptors tracked and closed (no fd leaks, verified with /proc/pid/fd)
Dead client detection via poll() with 5s timeout
Derived image buffer ownership tracked (sentinel prevents double-free)
DMA-BUF fds properly closed on partial import failure
NVIDIA opaque fds closed in surface destroy
Pre-allocated bitstream output buffer (no per-frame malloc)
CUDA context kept pushed for entire client session (no per-frame sync)

What Steam actually uses

From streaming logs, Steam's ffmpeg VA-API encode pipeline uses:

VA-API feature	Used by Steam	Status
Sequence params (resolution, bitrate, framerate, GOP)	Yes	Fully mapped to NVENC
Picture params (coded_buf, idr_pic_flag)	Yes	Working, IDR forwarded
Rate control misc (bits_per_second, target_percentage)	Yes	Applied to NVENC RC
Framerate misc	Yes	Applied
HRD misc (buffer_size)	Yes	Applied to NVENC vbvBufferSize
Packed headers (SEQ+PIC+SLICE+MISC)	Yes	Accepted (NVENC generates its own, no warning)
Quality level	quality=0 (default)	VAConfigAttribEncQualityRange reported
vaDeriveImage + vaMapBuffer	Yes (every frame)	Implemented, zero-copy SHM redirect
vaExportSurfaceHandle	No	Implemented but Steam doesn't call it
vaPutImage	No	Implemented but Steam uses vaDeriveImage instead

Known limitations

No B-frames

frameIntervalP=1 always. NVENC with enablePTD=1 and B-frames returns NV_ENC_ERR_NEED_MORE_INPUT for reordered frames, producing empty coded buffers. ffmpeg 6.x vaapi_encode asserts on empty coded buffers. With enablePTD=0, NVENC requires full DPB (Decoded Picture Buffer) reference frame management which Intel drivers handle in hardware but NVENC delegates to the caller.

Not a problem for streaming (B-frames add latency). For offline transcoding with B-frames, use h264_nvenc/hevc_nvenc directly.

Packed headers

Driver advertises full packed header support (SEQ+PIC+SLICE+MISC). NVENC generates its own SPS/PPS/VPS headers internally. Application-provided packed headers are accepted and silently skipped.

32-bit encode-only

When the shared memory bridge is active (CUDA unavailable), only encoding works — no hardware decode. Steam only needs encode on the server side, so this is fine.

HDR

VA-API encode specification does not include color metadata fields (colour_primaries, transfer_characteristics) in sequence parameter structs. Intel drivers have the same limitation — HDR metadata only passes through packed headers (which NVENC generates internally). HDR encode requires direct NVENC (hevc_nvenc with -color_primaries bt2020).

How to test

git clone https://github.com/efortin/nvidia-vaapi-driver
cd nvidia-vaapi-driver && git checkout feat/nvenc-support

Then follow the step-by-step guide for your distro:

No environment variables needed — just launch Steam.

Hardware tested

GPU: NVIDIA GeForce RTX 5070 Ti (Blackwell, 16GB GDDR7)
Driver: 580.126.09 / 580.126.18 (open kernel modules)
OS: Ubuntu 24.04 LTS, Fedora 43
CUDA: 13.0
Steam client: 32-bit (steamui.so)
Clients: macOS Steam Link, SteamOS Legion Go

Wrap NVIDIA's NVENC API behind the VA-API encoding interface, enabling any application using VA-API for encoding (Steam Remote Play, GStreamer, ffmpeg h264_vaapi/hevc_vaapi) to use NVIDIA hardware encoding on Linux. Supported profiles: H.264 (Baseline/Main/High), HEVC (Main/Main10). Uses low-latency P4 preset with no B-frames for synchronous encode, optimal for game streaming. Gracefully degrades to decode-only if libnvidia-encode.so is unavailable.

- Fix NVENC session leak on cuCtxPopCurrent failure in nvCreateContext - Fix coded buffer object leak on bitstreamData allocation failure - Fix EOS flush ordering: flush encoder before freeing output buffer - Fix integer overflow in rate control bitrate calculation (uint32 * uint32) - Fix nvPutImage to respect src/dest offset parameters per VA-API spec - Add missing bitrate extraction from HEVC sequence parameters - Remove dead NVENCInputSurface struct and unused macros - Remove unused drv parameter from nvRenderPictureEncode - Normalize naming convention: hevc_enc_ -> hevcenc_ to match h264enc_

Add meson cross-file for building a 32-bit (i386) version of the driver, needed by Steam Remote Play which uses a 32-bit ffmpeg for VA-API encode. Usage: meson setup build32 --cross-file cross-i386.txt && meson compile -C build32 Install: cp build32/nvidia_drv_video.so /usr/lib/i386-linux-gnu/dri/ Note: 32-bit CUDA (cuInit) fails on driver 580+ with Blackwell GPUs, blocking the 32-bit encode path until NVIDIA fixes their 32-bit driver.

32-bit CUDA is broken on driver 580+ with Blackwell GPUs (cuInit returns error 100). This blocks the 32-bit VA-API driver from using NVENC directly. Add a 64-bit helper daemon (nvenc-helper) that runs as a separate process where CUDA works. The 32-bit driver detects CUDA failure, enters encode-only mode, and forwards encode operations to the helper via a Unix domain socket at $XDG_RUNTIME_DIR/nvenc-helper.sock. Architecture: 32-bit steam → 32-bit steamui.so → 32-bit libavcodec → 32-bit libva → 32-bit nvidia_drv_video.so (encode-only, no CUDA) → Unix socket → 64-bit nvenc-helper → 64-bit CUDA + NVENC (works) ← encoded bitstream ← VA-API coded buffer The helper uses NVENC's own input buffer management (nvEncCreateInputBuffer + nvEncLockInputBuffer) instead of CUDA memory, making the data path: socket recv → memcpy into NVENC buffer → hardware encode → bitstream back. The helper auto-starts on first encode and exits after 30s idle. When CUDA is available (64-bit), the direct NVENC path is used as before with zero overhead — the IPC path is only activated when cuInit fails.

- Remove 30s idle timeout from accept loop — helper now runs until SIGTERM/SIGINT (was causing premature exit before any client connects) - Always enable logging to stderr for diagnostics - Continue listening after accept() errors instead of exiting - Log "Ready for next client" between sessions - Add multi-path helper discovery in the driver (libexec, local/libexec) - Try connect to running helper before attempting to start a new one

Guard cuCtxPushCurrent/cuCtxPopCurrent in nvCreateSurfaces2 behind cudaAvailable check. In encode-only IPC mode, surfaces only need host-side metadata — no GPU memory allocation required. This fixes "Failed to create surface: 1 (operation failed)" that Steam's 32-bit ffmpeg hit when trying to use our encode-only driver.

Steam Remote Play client needs a complete IDR keyframe with SPS/PPS/VPS headers to start decoding. Without FORCEIDR, the first frame was encoded as a non-IDR which the client couldn't decode, causing "Didn't get keyframe" errors and 99% frame loss. Also add periodic frame count logging to helper for diagnostics.

Without a timeout, the helper blocks forever in recv_all() when a client dies without sending CMD_CLOSE. This prevents new clients from connecting since the helper is single-threaded. Add SO_RCVTIMEO of 5 seconds on client sockets. If no data arrives for 5s, the recv fails, the helper cleans up the encoder and goes back to accept() for the next client.

Steam captures the desktop via OpenGL and passes GPU-resident NV12 surfaces to the VA-API encoder as DMA-BUF file descriptors through vaCreateSurfaces attrib_list. The previous IPC path sent empty pixel data because vaPutImage is never called in this flow. New architecture: 1. nvCreateSurfaces2: parse attrib_list for VASurfaceAttribMemoryType (DRM_PRIME/DRM_PRIME_2) and VASurfaceAttribExternalBufferDescriptor. Extract DMA-BUF fd, dup() it, store in NVSurface. 2. nvEndPictureEncodeIPC: if surface has importedDmaBufFd, send it to the 64-bit helper via SCM_RIGHTS Unix socket ancillary data. 3. nvenc-helper CMD_ENCODE_DMABUF: receive the fd, import into CUDA via cuImportExternalMemory, map to CUdeviceptr, register with NVENC, encode, return bitstream. Full GPU zero-copy — no host memory touch. This is the true GPU-accelerated path: Steam's OpenGL capture → DMA-BUF → CUDA import (64-bit helper) → NVENC encode → bitstream back via IPC. No pixel data crosses the socket, only the fd and encoded output.

The 32-bit driver couldn't provide pixel data to the encoder because Steam renders captured frames into VA-API surfaces via OpenGL/DMA-BUF, not through vaPutImage. Without GPU-backed surfaces, the frames were empty. Fix: initialize the NVIDIA DRM direct backend even in IPC mode. The DRM backend allocates GPU memory and exports DMA-BUF fds without needing CUDA (it uses kernel DRM ioctls). This gives surfaces real GPU backing that Steam can render into via OpenGL. Changes: - direct-export-buf.c: skip CUDA import in alloc_backing_image when cudaAvailable is false; skip CUDA calls in findGPUIndexFromFd - vabackend.c: init DRM backend in IPC mode; realise surfaces before encoding; use backing image DMA-BUF fd for IPC encode; guard vaExportSurfaceHandle CUDA calls; clean up DRM resources on terminate - Handle surface destroy with backing images in IPC mode

The DRM backend produces separate DMA-BUF fds per plane (Y, UV) as tiled GPU textures. NVENC needs a single linear NV12 CUdeviceptr. Previous approach tried cuImportExternalMemory with a single fd as a flat buffer → CUDA error 999 (the fd is a tiled texture, not linear). New approach matches the direct encode path: 1. Send all plane fds via SCM_RIGHTS (up to 4) 2. Helper imports each fd → CUexternalMemory → CUmipmappedArray → CUarray 3. cuMemcpy2D each plane from CUarray to a linear CUdeviceptr 4. Register linear buffer with NVENC, encode, return bitstream 5. Clean up all CUDA resources This is the same import→copy→encode pipeline as the working 64-bit direct path, just running in the helper process.

The DRM backend produces two types of fds per allocation: - nvFd: NVIDIA-specific opaque handle (for CUDA import) - drmFd: DMA-BUF fd (for DRM/EGL/OpenGL export) cuImportExternalMemory with CU_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD requires the NVIDIA opaque fd (nvFd), not the DMA-BUF fd (drmFd). Sending drmFd caused CUDA error 999 (unknown). Fix: store nvFd and memorySize in BackingImage when CUDA is unavailable (IPC mode). Send dup'd nvFds to the helper for CUDA import. The helper can now successfully import the GPU memory into its 64-bit CUDA context.

Steam's OpenGL capture pipeline renders into VA-API surfaces BEFORE calling vaBeginPicture/vaEndPicture. If the surface has no GPU memory at creation time, the capture renders into nothing → green screen. Allocate backing images immediately in nvCreateSurfaces2 when in IPC encode-only mode. This gives surfaces real GPU memory (via DRM ioctls) that Steam can export via vaExportSurfaceHandle, import into OpenGL as a render target, and render captured frames into. The encode path then reads the same GPU memory via CUDA import in the 64-bit helper.

Steam's ffmpeg calls vaDeriveImage to map VA-API surfaces to CPU memory, then writes the captured NV12 desktop frame into the mapped buffer. Without this, the surfaces have no pixel data → green screen. Implement vaDeriveImage in the IPC (no-CUDA) path: - Allocate a host-memory buffer on the surface (hostPixelData) - Return a VAImage backed by this shared buffer - Steam's vaMapBuffer returns the host pointer - Steam writes captured frame → host buffer - nvEndPictureEncodeIPC sends host buffer to helper via IPC - Helper encodes via NVENC's own input buffer (nvEncLockInputBuffer) The derived image buffer is marked as non-owning (sentinel offset=-1) so nvDestroyImage doesn't free the surface's memory. This completes the pixel data pipeline: Steam OpenGL capture → vaDeriveImage → vaMapBuffer → write NV12 → vaEndPicture → IPC send pixel data → helper encodes → bitstream

vaDeriveImage writes captured pixels to surface->hostPixelData, but the encode path was checking DMA-BUF first and finding the (empty) GPU backing image. The GPU surface has no pixel data because Steam writes via vaDeriveImage to host memory, not to the GPU surface. Reverse priority: check hostPixelData first (has actual captured pixels from vaDeriveImage), fall back to DMA-BUF only if no host data is available.

Steam requests IDR keyframes via idr_pic_flag in picture params when the client loses sync (packet loss, reconnection). Without forwarding this flag, the encoder never produces new keyframes after the first frame, and the client can't recover → "Didn't get keyframe" loop. - Parse idr_pic_flag from H.264/HEVC picture parameter buffers - Store as forceIDR flag on NVENCContext - Pass through IPC protocol (new force_idr field in encode params) - Helper's encoder_encode uses it for NV_ENC_PIC_FLAG_FORCEIDR - Also fix the direct 64-bit encode path to respect forceIDR

Steam reuses the same surface for every frame. vaDeriveImage maps the surface's hostPixelData, Steam writes captured pixels into it, then vaEndPicture sends the data to the helper. But Steam can start writing the NEXT frame while the IPC send is still transmitting the current frame (~3MB @ 1080p) → visual tearing and overlay artifacts. Fix: memcpy the frame into a snapshot buffer before sending via IPC. The snapshot is a consistent image that won't be modified during transmission. Adds ~3MB memcpy per frame (~1ms at DDR5 bandwidth) which is negligible vs the 7ms encode time.

The encoder is initialized at MB-aligned height (e.g. 1088 for 1080p) but the surface and vaDeriveImage host buffer contain exactly surface_height pixels (1080). The helper was copying enc->height (1088) lines from a buffer with only 1080 → buffer overread causing horizontal line artifacts across the entire image. Fix: - vabackend.c: send surface->width/height to IPC, not nvencCtx dimensions - nvenc-helper: encoder_encode takes explicit frame_width/frame_height, copies only that many lines, zero-pads the MB-aligned remainder. Chroma offset calculated from frame_height (actual data position), destination chroma at dstPitch * enc->height (encoder's full height).

install.sh handles the full build + install: - Builds 64-bit driver + nvenc-helper - Cross-compiles 32-bit driver (if i386 arch enabled) - Installs both drivers to system dri paths - Installs nvenc-helper to /usr/libexec - Creates and enables systemd user service for nvenc-helper - Verifies installation No environment variables needed — libva auto-detects the NVIDIA driver from the DRM device, and NVD_BACKEND defaults to direct.

Steam sets intra_period=3600 (60 seconds between keyframes). When a single packet is lost, the client requests a new keyframe but has to wait up to 60 seconds → stream freezes and Steam restarts the encoder. Force an IDR every 60 frames (~1 second at 60fps) so the client can recover from packet loss within 1 second. This matches the behavior of other streaming-optimized encoders (OBS, Moonlight/Sunshine).

Replace the 3MB socket send/recv per frame with shared memory (memfd). The helper creates a shm region on CMD_INIT and sends the fd to the client via SCM_RIGHTS. The client mmap's it and writes frames directly. Only a small CMD_ENCODE_SHM header (16 bytes) goes over the socket. Before: snapshot memcpy(3MB) + send_all(3MB) + recv_all(3MB) + NVENC copy After: memcpy(3MB to shm) + send(16 bytes) + NVENC copy from shm Saves ~6ms per frame at 1080p by eliminating 2 full-frame socket transfers. Falls back to socket path if shm creation fails.

- tests/encoding-tests.md: 12 test cases covering 64-bit encode, 32-bit IPC encode, Steam Remote Play, systemd service, decode regression, stress test, 10-bit, bitrate control, leak check - Document B-frame limitation: ffmpeg 6.x vaapi_encode asserts on empty coded buffers from NV_ENC_ERR_NEED_MORE_INPUT. Verified by testing — enabling B-frames via ip_period>1 causes assertion failure. Users needing B-frames should use h264_nvenc/hevc_nvenc directly. - Improve B-frame documentation in nvenc.c with explanation of why and alternative for offline transcoding

The IPC encode helper is only used when cuInit() fails, not based on process architecture. A 32-bit process on Turing/Ampere/Ada where cuInit works will use direct NVENC, same as 64-bit. The decision path: cuInit(0) succeeds → cudaAvailable=true → direct NVENC (no IPC) cuInit(0) fails → cudaAvailable=false → IPC helper bridge Updated comments throughout to say "CUDA unavailable" instead of "32-bit" to avoid implying the bridge is always used for 32-bit.

Full documentation covering: - Problem statement (VA-API encode missing, 32-bit CUDA broken) - Two encode paths: direct NVENC vs shared memory bridge - Path selection logic (cuInit success/fail, not architecture) - Data flow diagrams for shared memory frame transfer - Control protocol (Unix socket commands) - Surface management in bridge mode - All edge cases: encoder height padding, IDR recovery, frame tearing, dead client detection, object ID growth, B-frame limitation, DMA-BUF path - Supported profiles, installation, debugging

Critical segfault fixes: - Check cuMemAlloc/cuMemcpy2D returns in DMABUF path (was crashing silently on allocation failure) - Cap frame_size from socket to 64MB max (prevents malloc bomb from malicious/corrupt data) - Use fixed drain buffer instead of malloc(untrusted_size) - Add NULL check for buf->ptr in nvMapBuffer - Close shm_fd when shm_fd_out is NULL (fd leak) Leak fixes: - Don't send fd=-1 via SCM_RIGHTS (undefined behavior) — use send_response() for shm fallback path - Close unclaimed DMABUF fds on partial import failure - Close nvFds[] in destroyBackingImage for IPC mode Correctness: - Zero NVENC input buffer luma (0) and chroma (128=neutral UV) separately instead of blanket memset that could over-zero - Make IDR interval a #define (NVENC_HELPER_IDR_INTERVAL=60) - Fix stale "30s idle timeout" comment in helper header - Reduce hot-path logging (picture params only logged for first 3 frames to avoid 60fps log flood) Documentation: - Add edge case table: 15 potential failure scenarios with behavior and mitigation - Add known non-working scenarios table: 7 unsupported cases with reasons

Before (per frame at 1080p NV12): Driver: memcpy 3MB hostPixelData → shmPtr (snapshot) Helper: memset 3MB (full buffer clear) + line-by-line memcpy 3MB Total: 9MB memory bandwidth per frame, 540MB/s at 60fps After (per frame at 1080p NV12): Driver: zero copy (vaDeriveImage maps directly to SHM) Helper: bulk memcpy 3MB (when pitches match) + memset 8 rows only Total: 3MB memory bandwidth per frame, 180MB/s at 60fps (3x reduction) Changes: - vaDeriveImage: redirect surface hostPixelData to SHM region after encoder init. Steam writes directly to shared memory. Zero copy. - hostPixelIsShm flag: prevents free() on mmap'd SHM pointer - encoder_encode: fast path when srcPitch == dstPitch (single memcpy instead of 1080 individual line copies) - encoder_encode: only zero MB-alignment padding rows (8 rows for 1080→1088) instead of clearing entire 3MB buffer every frame - Skip redundant memcpy in EndPicture when hostPixelData IS shmPtr

Parse H.264/HEVC slice_type from VA-API slice parameter buffers and map to NVENC picture types (I/P/B/IDR). The picType field is stored on NVENCContext for each frame. B-frames remain disabled (frameIntervalP=1, enablePTD=1) because: 1. NVENC with enablePTD=0 requires full DPB reference frame management (reference picture lists, reference frame marking) which Intel's VA-API driver handles internally with its hardware encoder 2. NVENC with enablePTD=1 handles references but returns NV_ENC_ERR_NEED_MORE_INPUT for B-frames → ffmpeg 6.x asserts 3. LOW_LATENCY tuning internally overrides frameIntervalP to 1 The slice type parsing infrastructure is ready for when full DPB management is implemented. For now, -bf 2 gracefully falls back to IPP (no crash, no B-frames in output). Tested: verified enablePTD=0 with explicit picture types — NVENC encodes all frames as I-only because DPB references aren't managed. Full DPB management is tracked as a future enhancement.

tests/test_encode.c: 11 self-contained tests covering: - Entrypoints: H.264 + HEVC VAEntrypointEncSlice present - Config: RTFormat YUV420, rate control CQP/CBR/VBR - Lifecycle: create/destroy config, surfaces, context (no leak) - H.264 encode: High, Main, ConstrainedBaseline (1 frame each) - HEVC encode: Main profile (1 frame) - Stress: 10 sequential create/encode/destroy cycles - Coded buffer reuse: 5 frames with same coded buffer - Regression: VLD decode entrypoints still present Build: gcc -o test_encode tests/test_encode.c -lva -lva-drm -lm Run: ./test_encode [h264|hevc] Inspired by Intel VA-API driver's GTest-based test suite but implemented in pure C for compatibility with the project.

Replace nvEncLockInputBuffer (host memory) + line-by-line memcpy with a persistent CUDA device buffer registered once with NVENC. Before (per frame): nvEncLockInputBuffer → host pointer 1620× memcpy (1080 luma + 540 chroma lines, pitch conversion) nvEncUnlockInputBuffer → DMA upload to GPU Total: ~3-4ms (host memcpy + PCIe transfer) After (per frame): 2× cuMemcpy2D (luma + chroma, host→device, pitch conversion in HW) nvEncMapInputResource (already in VRAM) nvEncEncodePicture (reads from VRAM, no PCIe upload) nvEncUnmapInputResource Total: ~1-2ms (GPU DMA engine handles pitch + transfer) Benefits: - Single CUDA call replaces 1080 individual memcpy calls per plane - GPU DMA engine handles pitch conversion in hardware - NVENC reads from device memory (no PCIe upload at encode time) - Persistent buffer avoids per-frame alloc/register/unregister - Falls back to host path if CUDA alloc or NVENC register fails

…tion Inspired by Intel's i965 test infrastructure (gtest-based), add a C test framework with equivalent coverage: tests/test_common.h: - EXPECT_STATUS, EXPECT_TRUE, EXPECT_NOT_NULL macros - TestTimer for performance benchmarks - test_has_entrypoint() helper for parametrized profile testing - Global VA display setup/teardown tests/test_encode_config.c (34 tests): - Encode entrypoints: H264 CB/Main/High, HEVC Main/Main10 present - Decode entrypoints: MPEG2, AV1, JPEG, VP9 correctly reported - Config attributes: RTFormat, RateControl, PackedHeaders, MaxRefFrames - Error paths: invalid entrypoint, encode on decode-only profile - Config creation: all 5 encode profiles create+destroy - Surface creation: NV12, P010, 16x16, 4K, 16 simultaneous meson.build: - test() targets for both test_encode and test_encode_config - 60s timeout per suite - Only built for native (not cross-compiled i386) Total: 45 tests across 2 suites, all passing via `meson test`.

…ections

- Remove steps/ development notes (not for PR) - Remove encode_handlers.h (merged into nvenc.h) - Strip verbose block comments — project uses terse inline // - Strip struct field comments in nvenc.h (match existing headers) - Remove explanatory paragraphs from nvenc.c (B-frame, version, etc.) - Remove file-level comment blocks from h264_encode.c, hevc_encode.c - Use void* for encode handler signatures to avoid circular includes Net: -427 lines, cleaner match to elFarto's code style.

- Fix GCC statement expression in CHECK_CUDA_RESULT_HELPER macro, replace with inline function (ISO C compliant) - Fix variadic macro warnings: replace HELPER_LOG macro with proper va_list function (no ##__VA_ARGS__ GNU extension) - Add const qualifiers to encode handler local variables (cppcheck) - Remove unused variable surfObj from nvBeginPicture - Remove stale debug LOG from nvBeginPicture encode path Zero warnings with -Dwarning_level=3. Zero cppcheck issues (excluding false positive unusedFunction).

Address gaps found in VA-API spec compliance audit: - Add VAConfigAttribEncQualityRange (reports 7 levels, maps to NVENC P1-P7) - Pass HRD buffer_size and initial_buffer_fullness to NVENC vbvBufferSize/ vbvInitialDelay (was read but ignored, now applied in encoder init) - Handle VAEncMiscParameterTypeHRD in HEVC path (was H.264 only) - Add test for quality range attribute Audit summary: 16/16 VA-API pipeline steps PASS. Remaining architectural limitations (B-frames, packed header injection) documented in known limitations.

Steam requests packed headers 0xd (SEQ+PIC+SLICE+MISC) but we only reported 0x3 (SEQ+PIC), causing: ffmpeg warning: Driver does not support some wanted packed headers NVENC generates all headers internally. We accept and silently skip application-provided packed header buffers in nvRenderPictureEncode. Advertising full support prevents the warning without changing behavior.

CUDA context: keep pushed for entire client session instead of push/pop per frame. Eliminates GPU sync overhead (~0.5ms/frame). Bitstream buffer: pre-allocate 4MB once in encoder_init, realloc if needed. Eliminates 60 malloc+free per second. Socket hardening: - umask(0077) before bind to prevent permission race window - listen backlog 2→8 for burst connection handling - Remove SO_RCVTIMEO (could break large frame recv) - Use poll(5000ms) in command loop for dead client detection

…deps Detects driver version from dpkg (e.g. 580) and automatically installs: - Build deps: meson, ninja, gcc, pkg-config, libva/drm/egl/ffnvcodec-dev - 32-bit deps: gcc-multilib, i386 dev libs, libnvidia-compute/encode-XXX:i386 - Enables i386 architecture if needed No more manual apt commands before running install.sh.

- Remove reference to deleted encode_handlers.h - Fix test count: 35 config tests (not 34) - Fix SHM pipeline description: zero-copy, no memcpy - Fix dead client detection: poll() not SO_RCVTIMEO - Add CUDA context optimization to perf table (2.8ms) - Add pre-allocated bitstream buffer to hardening list - Clarify B-frame limitation: explain both enablePTD paths - Add HDR limitation section - Add cppcheck/warning_level=3 to hardening - Fix PR elFarto#425 comparison: B-frames attempted, not fully working - Update disclaimer wording

…ompat) GStreamer's vaapih264enc/vaapih265enc calls vaQuerySurfaceAttributes on encode configs and expects MinWidth/MinHeight/MaxWidth/MaxHeight with VA_SURFACE_ATTRIB_GETTABLE flag set. Without these, GStreamer fails to negotiate caps and refuses to encode. Add all 5 required surface attributes with correct flags: - VASurfaceAttribMinWidth/Height (16) - VASurfaceAttribMaxWidth/Height (4096) - VASurfaceAttribPixelFormat (NV12 or P010, GETTABLE+SETTABLE) Tested: gst-launch-1.0 vaapih264enc and vaapih265enc both produce valid 1080p output.

Ubuntu-only install script replaced by step-by-step markdown guides for both Ubuntu and Fedora, covering 64-bit/32-bit build, nvenc-helper service setup, and verification.

15 tests covering H.264/HEVC encode through gst-launch-1.0 pipelines: prerequisites, file output, CBR bitrate, small/4K resolution, decode regression, encode→decode round-trip, and stress (sequential restarts, sustained 1080p60).

Dravenoid · 2026-04-10T03:47:18Z

Hi, saw the Reddit post and wanted to help test on Blackwell.
Setup: RTX 5060 Ti 16GB, driver 595.58.03, CachyOS (Arch-based), KDE Plasma 6 Wayland. Using a OnePlus 12 (PJD110/CPH2573) on Android 14.

Good news, the IPC encoder initializes and runs on Blackwell:
[nvenc-helper] Client connected (fd=17)
[nvenc-helper] Init: 2560x1440 codec=1 profile=17 bitrate=29880000
[nvenc-helper] GPU buffer: 5529600 bytes, pitch=2560 (persistent CUDA+NVENC)
[nvenc-helper] Encoder initialized: 2560x1440 HEVC 8-bit (gpu=yes)
[nvenc-helper] Encoded 300 frames
[nvenc-helper] Encoded 600 frames
[nvenc-helper] Encoder closed (encoded 770 frames)
vainfo shows VAEntrypointEncSlice for H264, HEVC, and HEVC10. All good there.

The issue is the client never gets past the connecting screen.
Steam logs infinite. Didn't get keyframe, resending lost data notification and never renders anything. The slow framerate entry from that session stands out:
Slow framerate: game 0.00, capture 21.41, convert 18.76, encode 37.82, network 18.89, decode 2.52, display -13237907.00 (encode)
The encode 37.82ms is way higher than x264 sessions (~5ms), and display -13237907.00 looks like a timestamp overflow somewhere in the IPC path. Network packet quality was also stuck at ~46% the entire session despite being on LAN, not sure if that's related.

steam_remote_play_logs.txt

Happy to test anything and pull more logs if it helps.

efortin added 22 commits April 2, 2026 22:00

chore: remove stray file accidentally committed

6b17691

efortin marked this pull request as draft April 3, 2026 00:48

efortin added 6 commits April 3, 2026 08:05

docs: add PR summary with vibe coded disclaimer

e17df5e

docs: add PR summary with vibe coded disclaimer

0f39226

efortin mentioned this pull request Apr 3, 2026

[Feature request] NVENC support #116

Open

efortin changed the title ~~Feat/nvenc support~~ feat(core): Add NVENC Encoding Support via VA-API Apr 3, 2026

efortin added 15 commits April 3, 2026 09:21

docs: update PR summary with test suite, performance, and hardening s…

6cc717d

…ections

refactor: restore type safety and inline field comments in nvenc.h

7670bb1

docs: add Steam feature usage table to PR summary

137b9e8

docs: update packed header status in PR summary

6e11b5c

efortin marked this pull request as ready for review April 4, 2026 09:18

efortin marked this pull request as draft April 4, 2026 09:19

efortin added 6 commits April 4, 2026 11:19

chore: remove docs folder — documentation lives in PR description

2d00dff

chore: remove markdown from tests — only C test code

6bb9e83

docs: replace install.sh with per-distro install guides

eaa29c8

Ubuntu-only install script replaced by step-by-step markdown guides for both Ubuntu and Fedora, covering 64-bit/32-bit build, nvenc-helper service setup, and verification.

test: add GStreamer VA-API encode integration tests

3a58095

15 tests covering H.264/HEVC encode through gst-launch-1.0 pipelines: prerequisites, file output, CBR bitrate, small/4K resolution, decode regression, encode→decode round-trip, and stress (sequential restarts, sustained 1080p60).

moi952 mentioned this pull request Apr 9, 2026

Black screen with Remote Play on NVIDIA – request to test commit from nvidia-vaapi-driver (H264/H265 support) ublue-os/bazzite#4603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): Add NVENC Encoding Support via VA-API#427

feat(core): Add NVENC Encoding Support via VA-API#427
efortin wants to merge 49 commits intoelFarto:masterfrom
efortin:feat/nvenc-support

efortin commented Apr 2, 2026 •

edited

Loading

Uh oh!

Dravenoid commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

efortin commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

What was broken

What this PR does

1. VA-API encode support (H.264 + HEVC)

2. Shared memory bridge (when CUDA is unavailable)

3. Everything else that was needed

Test results

Automated C test suite (meson test)

GStreamer integration tests

Security testing

Manual integration tests

Performance optimizations

Code hardening

What Steam actually uses

Known limitations

No B-frames

Packed headers

32-bit encode-only

HDR

How to test

Hardware tested

Uh oh!

Dravenoid commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

efortin commented Apr 2, 2026 •

edited

Loading

Automated C test suite (`meson test`)