Skip to content

[ROCm] Add a HIP build path for AMD GPUs (multi-arch wave32/64)#89

Open
jeffdaily wants to merge 1 commit into
fixstars:masterfrom
jeffdaily:moat-port
Open

[ROCm] Add a HIP build path for AMD GPUs (multi-arch wave32/64)#89
jeffdaily wants to merge 1 commit into
fixstars:masterfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown

This adds an opt-in ROCm/HIP build (-DUSE_HIP=ON) alongside the existing CUDA path, which is left byte-identical. The work was authored with the assistance of Claude, an AI assistant by Anthropic.

The mechanical part is a single compat header (src/cuda_to_hip.h): on a HIP build it includes <hip/hip_runtime.h> and aliases the small CUDA runtime surface the host code uses (malloc/free/memcpy/memset, device-attribute query, the eight aggregation streams, the error type/string); on a CUDA build it is a thin passthrough to <cuda_runtime.h>. The .cu and the two device-image .cpp now include it instead of <cuda_runtime.h>. CMake gains a USE_HIP option that enables the HIP language, marks the .cu as LANGUAGE HIP, and gates find_package(CUDAToolkit) off. The HIP architecture is auto-detected by enable_language(HIP) from the build machine (gfx90a fallback; -DCMAKE_HIP_ARCHITECTURES overrides), and the README documents the ROCm build.

Two substantive correctness items, both validated bit-exactly by the shipped gtest suite:

  1. Wavefront width, correct for a multi-arch build. The design overloads one WARP_SIZE constant for shuffle-subgroup partitioning in the eight path-aggregation kernels, the winner-take-all per-lane layout (REDUCTION_PER_THREAD = MAX_DISPARITY / WARP_SIZE), and device block sizing (BLOCK_SIZE = WARP_SIZE * N). The DEVICE width is keyed per arch off the __GFX*__ macros (64 on wave64 GCN (gfx8/gfx9, e.g. gfx90a), 32 on RDNA and on CUDA), so each device slice of a gfx90a;gfx1100 build compiles with its own correct width. The HOST must not bake in a compile-time width: hipcc's host pass can target several arches at once and there is no single right answer. The previous approach injected one SGM_HOST_WARP_SIZE from a CMake arch scan, which a mixed gfx90a;gfx1100 build resolved to 64 -- the host then launched bdim=512 blocks against the gfx1100 device kernel built for 32-lane geometry, an out-of-range warp/lane mismatch on the gfx1100 slice. Now every host launcher derives its block/grid dims from a runtime warpSize query (host_utility.h device_warp_size(), via hipDeviceGetAttribute, cached per device), so host and device agree on every arch in the build. WARP_SIZE retains a host-pass value only so hipcc can parse the __global__ bodies into launch stubs; it never drives runtime launch geometry.

  2. SIMD video intrinsics. median_filter.cu's vectorized 2x/4x median path uses __vcmpgtu2/4, __vminu2/4, __vmaxu2/4, which ROCm 7.x HIP does not provide. They are software-emulated under USE_HIP with the exact CUDA per-lane semantics (two 16-bit halfwords / four 8-bit bytes; the compare yields an all-ones mask per lane on greater-than), so the median selection network is bit-identical to the scalar reference.

Also, HIP leaves CUDA_VERSION undefined, which routed the WTA inter-lane smem handoff to the weaker __threadfence_block() fallback; on wave64 the warp lanes write smem_cost_sum and read each other's writes, so it now uses the __syncwarp() barrier (which HIP provides) on the HIP path.

Review order: src/cuda_to_hip.h, then src/constants.h (device per-arch WARP_SIZE) and src/host_utility.h (the runtime warpSize query the launchers use), then the launchers in src/cost_aggregation.cu and src/winner_takes_all.cu, then src/median_filter.cu (the intrinsic emulation), then the CMake wiring.

Test Plan:
Built multi-arch and run on a real AMD gfx90a (CDNA, wave64), ROCm 7.2.1:

git submodule update --init
cmake -S . -B build-hip -DUSE_HIP=ON \
  -DCMAKE_HIP_ARCHITECTURES="gfx90a;gfx1100" \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \
  -DENABLE_TESTS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-hip -j
roc-obj-ls build-hip/test/sgm-test   # emits both gfx90a and gfx1100 objects
HIP_VISIBLE_DEVICES=1 ./build-hip/test/sgm-test

The two-arch binary compiles clean and carries both gfx90a and gfx1100 code objects. On gfx90a the runtime query selects warpSize=64; AMD_LOG_LEVEL=3 confirms the native gfx90a code object loads (no JIT). The gtest suite computes a warp-size-agnostic CPU reference per kernel and asserts bit-exact equality. All 67 tests across 9 suites PASS, including the wave-sensitive CostAggregationTest (18 params: SGM_32U/64U x disp {64,128,256} x min_disp {0,+/-16}, 8-path), WinnerTakesAllTestP (12 params), MedianFilterTest (the emulated intrinsics), and IntegrationTest.RandomU8 (the full census -> aggregation -> WTA -> median -> consistency pipeline). The run is deterministic: two back-to-back runs are identical (67/67 both times). The gfx1100 run validates separately on RDNA3 hardware (gfx1100, wave32).

This adds an opt-in ROCm/HIP build (-DUSE_HIP=ON) alongside the existing
CUDA path, which is left byte-identical. The work was authored with the
assistance of Claude, an AI assistant by Anthropic.

The mechanical part is a single compat header (src/cuda_to_hip.h): on a HIP
build it includes <hip/hip_runtime.h> and aliases the small CUDA runtime
surface the host code uses (malloc/free/memcpy/memset, device-attribute query,
the eight aggregation streams, the error type/string); on a CUDA build it is a
thin passthrough to <cuda_runtime.h>. The .cu and the two device-image .cpp now
include it instead of <cuda_runtime.h>. CMake gains a USE_HIP option that
enables the HIP language, marks the .cu as LANGUAGE HIP, and gates
find_package(CUDAToolkit) off. The HIP architecture is auto-detected by
enable_language(HIP) from the build machine (gfx90a fallback;
-DCMAKE_HIP_ARCHITECTURES overrides), and the README documents the ROCm build.

Two substantive correctness items, both validated bit-exactly by the shipped
gtest suite:

1. Wavefront width, correct for a multi-arch build. The design overloads one
   WARP_SIZE constant for shuffle-subgroup partitioning in the eight
   path-aggregation kernels, the winner-take-all per-lane layout
   (REDUCTION_PER_THREAD = MAX_DISPARITY / WARP_SIZE), and device block sizing
   (BLOCK_SIZE = WARP_SIZE * N). The DEVICE width is keyed per arch off the
   __GFX*__ macros (64 on wave64 GCN (gfx8/gfx9, e.g. gfx90a), 32 on RDNA and on CUDA), so each
   device slice of a gfx90a;gfx1100 build compiles with its own correct width.
   The HOST must not bake in a compile-time width: hipcc's host pass can target
   several arches at once and there is no single right answer. The previous
   approach injected one SGM_HOST_WARP_SIZE from a CMake arch scan, which a
   mixed gfx90a;gfx1100 build resolved to 64 -- the host then launched bdim=512
   blocks against the gfx1100 device kernel built for 32-lane geometry, an
   out-of-range warp/lane mismatch on the gfx1100 slice. Now every host launcher
   derives its block/grid dims from a runtime warpSize query
   (host_utility.h device_warp_size(), via hipDeviceGetAttribute, cached per
   device), so host and device agree on every arch in the build. WARP_SIZE
   retains a host-pass value only so hipcc can parse the __global__ bodies into
   launch stubs; it never drives runtime launch geometry.

2. SIMD video intrinsics. median_filter.cu's vectorized 2x/4x median path uses
   __vcmpgtu2/4, __vminu2/4, __vmaxu2/4, which ROCm 7.x HIP does not provide.
   They are software-emulated under USE_HIP with the exact CUDA per-lane
   semantics (two 16-bit halfwords / four 8-bit bytes; the compare yields an
   all-ones mask per lane on greater-than), so the median selection network is
   bit-identical to the scalar reference.

Also, HIP leaves CUDA_VERSION undefined, which routed the WTA inter-lane
smem handoff to the weaker __threadfence_block() fallback; on wave64 the warp
lanes write smem_cost_sum and read each other's writes, so it now uses the
__syncwarp() barrier (which HIP provides) on the HIP path.

Review order: src/cuda_to_hip.h, then src/constants.h (device per-arch
WARP_SIZE) and src/host_utility.h (the runtime warpSize query the launchers
use), then the launchers in src/cost_aggregation.cu and src/winner_takes_all.cu,
then src/median_filter.cu (the intrinsic emulation), then the CMake wiring.

Test Plan:
Built multi-arch and run on a real AMD gfx90a (CDNA, wave64), ROCm 7.2.1:

```
git submodule update --init
cmake -S . -B build-hip -DUSE_HIP=ON \
  -DCMAKE_HIP_ARCHITECTURES="gfx90a;gfx1100" \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \
  -DENABLE_TESTS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-hip -j
roc-obj-ls build-hip/test/sgm-test   # emits both gfx90a and gfx1100 objects
HIP_VISIBLE_DEVICES=1 ./build-hip/test/sgm-test
```

The two-arch binary compiles clean and carries both gfx90a and gfx1100 code
objects. On gfx90a the runtime query selects warpSize=64; AMD_LOG_LEVEL=3
confirms the native gfx90a code object loads (no JIT). The gtest suite computes
a warp-size-agnostic CPU reference per kernel and asserts bit-exact equality.
All 67 tests across 9 suites PASS, including the wave-sensitive
CostAggregationTest (18 params: SGM_32U/64U x disp {64,128,256} x min_disp
{0,+/-16}, 8-path), WinnerTakesAllTestP (12 params), MedianFilterTest (the
emulated intrinsics), and IntegrationTest.RandomU8 (the full census ->
aggregation -> WTA -> median -> consistency pipeline). The run is
deterministic: two back-to-back runs are identical (67/67 both times). The
gfx1100 run validates separately on RDNA3 hardware (gfx1100, wave32).
jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant