[ROCm] Add a HIP build path for AMD GPUs (multi-arch wave32/64) by jeffdaily · Pull Request #89 · fixstars/libSGM

jeffdaily · 2026-06-04T06:15:26Z

This adds an opt-in ROCm/HIP build (-DUSE_HIP=ON) alongside the existing CUDA path, which is left byte-identical. The work was authored with the assistance of Claude, an AI assistant by Anthropic.

The mechanical part is a single compat header (src/cuda_to_hip.h): on a HIP build it includes <hip/hip_runtime.h> and aliases the small CUDA runtime surface the host code uses (malloc/free/memcpy/memset, device-attribute query, the eight aggregation streams, the error type/string); on a CUDA build it is a thin passthrough to <cuda_runtime.h>. The .cu and the two device-image .cpp now include it instead of <cuda_runtime.h>. CMake gains a USE_HIP option that enables the HIP language, marks the .cu as LANGUAGE HIP, and gates find_package(CUDAToolkit) off. The HIP architecture is auto-detected by enable_language(HIP) from the build machine (gfx90a fallback; -DCMAKE_HIP_ARCHITECTURES overrides), and the README documents the ROCm build.

Two substantive correctness items, both validated bit-exactly by the shipped gtest suite:

Wavefront width, correct for a multi-arch build. The design overloads one WARP_SIZE constant for shuffle-subgroup partitioning in the eight path-aggregation kernels, the winner-take-all per-lane layout (REDUCTION_PER_THREAD = MAX_DISPARITY / WARP_SIZE), and device block sizing (BLOCK_SIZE = WARP_SIZE * N). The DEVICE width is keyed per arch off the __GFX*__ macros (64 on wave64 GCN (gfx8/gfx9, e.g. gfx90a), 32 on RDNA and on CUDA), so each device slice of a gfx90a;gfx1100 build compiles with its own correct width. The HOST must not bake in a compile-time width: hipcc's host pass can target several arches at once and there is no single right answer. The previous approach injected one SGM_HOST_WARP_SIZE from a CMake arch scan, which a mixed gfx90a;gfx1100 build resolved to 64 -- the host then launched bdim=512 blocks against the gfx1100 device kernel built for 32-lane geometry, an out-of-range warp/lane mismatch on the gfx1100 slice. Now every host launcher derives its block/grid dims from a runtime warpSize query (host_utility.h device_warp_size(), via hipDeviceGetAttribute, cached per device), so host and device agree on every arch in the build. WARP_SIZE retains a host-pass value only so hipcc can parse the __global__ bodies into launch stubs; it never drives runtime launch geometry.
SIMD video intrinsics. median_filter.cu's vectorized 2x/4x median path uses __vcmpgtu2/4, __vminu2/4, __vmaxu2/4, which ROCm 7.x HIP does not provide. They are software-emulated under USE_HIP with the exact CUDA per-lane semantics (two 16-bit halfwords / four 8-bit bytes; the compare yields an all-ones mask per lane on greater-than), so the median selection network is bit-identical to the scalar reference.

Also, HIP leaves CUDA_VERSION undefined, which routed the WTA inter-lane smem handoff to the weaker __threadfence_block() fallback; on wave64 the warp lanes write smem_cost_sum and read each other's writes, so it now uses the __syncwarp() barrier (which HIP provides) on the HIP path.

Review order: src/cuda_to_hip.h, then src/constants.h (device per-arch WARP_SIZE) and src/host_utility.h (the runtime warpSize query the launchers use), then the launchers in src/cost_aggregation.cu and src/winner_takes_all.cu, then src/median_filter.cu (the intrinsic emulation), then the CMake wiring.

Test Plan:
Built multi-arch and run on a real AMD gfx90a (CDNA, wave64), ROCm 7.2.1:

git submodule update --init
cmake -S . -B build-hip -DUSE_HIP=ON \
  -DCMAKE_HIP_ARCHITECTURES="gfx90a;gfx1100" \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \
  -DENABLE_TESTS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-hip -j
roc-obj-ls build-hip/test/sgm-test   # emits both gfx90a and gfx1100 objects
HIP_VISIBLE_DEVICES=1 ./build-hip/test/sgm-test

The two-arch binary compiles clean and carries both gfx90a and gfx1100 code objects. On gfx90a the runtime query selects warpSize=64; AMD_LOG_LEVEL=3 confirms the native gfx90a code object loads (no JIT). The gtest suite computes a warp-size-agnostic CPU reference per kernel and asserts bit-exact equality. All 67 tests across 9 suites PASS, including the wave-sensitive CostAggregationTest (18 params: SGM_32U/64U x disp {64,128,256} x min_disp {0,+/-16}, 8-path), WinnerTakesAllTestP (12 params), MedianFilterTest (the emulated intrinsics), and IntegrationTest.RandomU8 (the full census -> aggregation -> WTA -> median -> consistency pipeline). The run is deterministic: two back-to-back runs are identical (67/67 both times). The gfx1100 run validates separately on RDNA3 hardware (gfx1100, wave32).

This adds an opt-in ROCm/HIP build (-DUSE_HIP=ON) alongside the existing CUDA path, which is left byte-identical. The work was authored with the assistance of Claude, an AI assistant by Anthropic. The mechanical part is a single compat header (src/cuda_to_hip.h): on a HIP build it includes <hip/hip_runtime.h> and aliases the small CUDA runtime surface the host code uses (malloc/free/memcpy/memset, device-attribute query, the eight aggregation streams, the error type/string); on a CUDA build it is a thin passthrough to <cuda_runtime.h>. The .cu and the two device-image .cpp now include it instead of <cuda_runtime.h>. CMake gains a USE_HIP option that enables the HIP language, marks the .cu as LANGUAGE HIP, and gates find_package(CUDAToolkit) off. The HIP architecture is auto-detected by enable_language(HIP) from the build machine (gfx90a fallback; -DCMAKE_HIP_ARCHITECTURES overrides), and the README documents the ROCm build. Two substantive correctness items, both validated bit-exactly by the shipped gtest suite: 1. Wavefront width, correct for a multi-arch build. The design overloads one WARP_SIZE constant for shuffle-subgroup partitioning in the eight path-aggregation kernels, the winner-take-all per-lane layout (REDUCTION_PER_THREAD = MAX_DISPARITY / WARP_SIZE), and device block sizing (BLOCK_SIZE = WARP_SIZE * N). The DEVICE width is keyed per arch off the __GFX*__ macros (64 on wave64 GCN (gfx8/gfx9, e.g. gfx90a), 32 on RDNA and on CUDA), so each device slice of a gfx90a;gfx1100 build compiles with its own correct width. The HOST must not bake in a compile-time width: hipcc's host pass can target several arches at once and there is no single right answer. The previous approach injected one SGM_HOST_WARP_SIZE from a CMake arch scan, which a mixed gfx90a;gfx1100 build resolved to 64 -- the host then launched bdim=512 blocks against the gfx1100 device kernel built for 32-lane geometry, an out-of-range warp/lane mismatch on the gfx1100 slice. Now every host launcher derives its block/grid dims from a runtime warpSize query (host_utility.h device_warp_size(), via hipDeviceGetAttribute, cached per device), so host and device agree on every arch in the build. WARP_SIZE retains a host-pass value only so hipcc can parse the __global__ bodies into launch stubs; it never drives runtime launch geometry. 2. SIMD video intrinsics. median_filter.cu's vectorized 2x/4x median path uses __vcmpgtu2/4, __vminu2/4, __vmaxu2/4, which ROCm 7.x HIP does not provide. They are software-emulated under USE_HIP with the exact CUDA per-lane semantics (two 16-bit halfwords / four 8-bit bytes; the compare yields an all-ones mask per lane on greater-than), so the median selection network is bit-identical to the scalar reference. Also, HIP leaves CUDA_VERSION undefined, which routed the WTA inter-lane smem handoff to the weaker __threadfence_block() fallback; on wave64 the warp lanes write smem_cost_sum and read each other's writes, so it now uses the __syncwarp() barrier (which HIP provides) on the HIP path. Review order: src/cuda_to_hip.h, then src/constants.h (device per-arch WARP_SIZE) and src/host_utility.h (the runtime warpSize query the launchers use), then the launchers in src/cost_aggregation.cu and src/winner_takes_all.cu, then src/median_filter.cu (the intrinsic emulation), then the CMake wiring. Test Plan: Built multi-arch and run on a real AMD gfx90a (CDNA, wave64), ROCm 7.2.1: ``` git submodule update --init cmake -S . -B build-hip -DUSE_HIP=ON \ -DCMAKE_HIP_ARCHITECTURES="gfx90a;gfx1100" \ -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \ -DENABLE_TESTS=ON -DCMAKE_BUILD_TYPE=Release cmake --build build-hip -j roc-obj-ls build-hip/test/sgm-test # emits both gfx90a and gfx1100 objects HIP_VISIBLE_DEVICES=1 ./build-hip/test/sgm-test ``` The two-arch binary compiles clean and carries both gfx90a and gfx1100 code objects. On gfx90a the runtime query selects warpSize=64; AMD_LOG_LEVEL=3 confirms the native gfx90a code object loads (no JIT). The gtest suite computes a warp-size-agnostic CPU reference per kernel and asserts bit-exact equality. All 67 tests across 9 suites PASS, including the wave-sensitive CostAggregationTest (18 params: SGM_32U/64U x disp {64,128,256} x min_disp {0,+/-16}, 8-path), WinnerTakesAllTestP (12 params), MedianFilterTest (the emulated intrinsics), and IntegrationTest.RandomU8 (the full census -> aggregation -> WTA -> median -> consistency pipeline). The run is deterministic: two back-to-back runs are identical (67/67 both times). The gfx1100 run validates separately on RDNA3 hardware (gfx1100, wave32).

…ch wave32/64)

jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 4, 2026

libSGM: open upstream PR fixstars/libSGM#89 (HIP build path, multi-ar…

bc3d1e0

…ch wave32/64)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Add a HIP build path for AMD GPUs (multi-arch wave32/64)#89

[ROCm] Add a HIP build path for AMD GPUs (multi-arch wave32/64)#89
jeffdaily wants to merge 1 commit into
fixstars:masterfrom
jeffdaily:moat-port

jeffdaily commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeffdaily commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant