[ROCm] Add a HIP build path for AMD GPUs (multi-arch wave32/64)#89
Open
jeffdaily wants to merge 1 commit into
Open
[ROCm] Add a HIP build path for AMD GPUs (multi-arch wave32/64)#89jeffdaily wants to merge 1 commit into
jeffdaily wants to merge 1 commit into
Conversation
This adds an opt-in ROCm/HIP build (-DUSE_HIP=ON) alongside the existing
CUDA path, which is left byte-identical. The work was authored with the
assistance of Claude, an AI assistant by Anthropic.
The mechanical part is a single compat header (src/cuda_to_hip.h): on a HIP
build it includes <hip/hip_runtime.h> and aliases the small CUDA runtime
surface the host code uses (malloc/free/memcpy/memset, device-attribute query,
the eight aggregation streams, the error type/string); on a CUDA build it is a
thin passthrough to <cuda_runtime.h>. The .cu and the two device-image .cpp now
include it instead of <cuda_runtime.h>. CMake gains a USE_HIP option that
enables the HIP language, marks the .cu as LANGUAGE HIP, and gates
find_package(CUDAToolkit) off. The HIP architecture is auto-detected by
enable_language(HIP) from the build machine (gfx90a fallback;
-DCMAKE_HIP_ARCHITECTURES overrides), and the README documents the ROCm build.
Two substantive correctness items, both validated bit-exactly by the shipped
gtest suite:
1. Wavefront width, correct for a multi-arch build. The design overloads one
WARP_SIZE constant for shuffle-subgroup partitioning in the eight
path-aggregation kernels, the winner-take-all per-lane layout
(REDUCTION_PER_THREAD = MAX_DISPARITY / WARP_SIZE), and device block sizing
(BLOCK_SIZE = WARP_SIZE * N). The DEVICE width is keyed per arch off the
__GFX*__ macros (64 on wave64 GCN (gfx8/gfx9, e.g. gfx90a), 32 on RDNA and on CUDA), so each
device slice of a gfx90a;gfx1100 build compiles with its own correct width.
The HOST must not bake in a compile-time width: hipcc's host pass can target
several arches at once and there is no single right answer. The previous
approach injected one SGM_HOST_WARP_SIZE from a CMake arch scan, which a
mixed gfx90a;gfx1100 build resolved to 64 -- the host then launched bdim=512
blocks against the gfx1100 device kernel built for 32-lane geometry, an
out-of-range warp/lane mismatch on the gfx1100 slice. Now every host launcher
derives its block/grid dims from a runtime warpSize query
(host_utility.h device_warp_size(), via hipDeviceGetAttribute, cached per
device), so host and device agree on every arch in the build. WARP_SIZE
retains a host-pass value only so hipcc can parse the __global__ bodies into
launch stubs; it never drives runtime launch geometry.
2. SIMD video intrinsics. median_filter.cu's vectorized 2x/4x median path uses
__vcmpgtu2/4, __vminu2/4, __vmaxu2/4, which ROCm 7.x HIP does not provide.
They are software-emulated under USE_HIP with the exact CUDA per-lane
semantics (two 16-bit halfwords / four 8-bit bytes; the compare yields an
all-ones mask per lane on greater-than), so the median selection network is
bit-identical to the scalar reference.
Also, HIP leaves CUDA_VERSION undefined, which routed the WTA inter-lane
smem handoff to the weaker __threadfence_block() fallback; on wave64 the warp
lanes write smem_cost_sum and read each other's writes, so it now uses the
__syncwarp() barrier (which HIP provides) on the HIP path.
Review order: src/cuda_to_hip.h, then src/constants.h (device per-arch
WARP_SIZE) and src/host_utility.h (the runtime warpSize query the launchers
use), then the launchers in src/cost_aggregation.cu and src/winner_takes_all.cu,
then src/median_filter.cu (the intrinsic emulation), then the CMake wiring.
Test Plan:
Built multi-arch and run on a real AMD gfx90a (CDNA, wave64), ROCm 7.2.1:
```
git submodule update --init
cmake -S . -B build-hip -DUSE_HIP=ON \
-DCMAKE_HIP_ARCHITECTURES="gfx90a;gfx1100" \
-DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \
-DENABLE_TESTS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-hip -j
roc-obj-ls build-hip/test/sgm-test # emits both gfx90a and gfx1100 objects
HIP_VISIBLE_DEVICES=1 ./build-hip/test/sgm-test
```
The two-arch binary compiles clean and carries both gfx90a and gfx1100 code
objects. On gfx90a the runtime query selects warpSize=64; AMD_LOG_LEVEL=3
confirms the native gfx90a code object loads (no JIT). The gtest suite computes
a warp-size-agnostic CPU reference per kernel and asserts bit-exact equality.
All 67 tests across 9 suites PASS, including the wave-sensitive
CostAggregationTest (18 params: SGM_32U/64U x disp {64,128,256} x min_disp
{0,+/-16}, 8-path), WinnerTakesAllTestP (12 params), MedianFilterTest (the
emulated intrinsics), and IntegrationTest.RandomU8 (the full census ->
aggregation -> WTA -> median -> consistency pipeline). The run is
deterministic: two back-to-back runs are identical (67/67 both times). The
gfx1100 run validates separately on RDNA3 hardware (gfx1100, wave32).
jeffdaily
added a commit
to jeffdaily/moat
that referenced
this pull request
Jun 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds an opt-in ROCm/HIP build (
-DUSE_HIP=ON) alongside the existing CUDA path, which is left byte-identical. The work was authored with the assistance of Claude, an AI assistant by Anthropic.The mechanical part is a single compat header (
src/cuda_to_hip.h): on a HIP build it includes<hip/hip_runtime.h>and aliases the small CUDA runtime surface the host code uses (malloc/free/memcpy/memset, device-attribute query, the eight aggregation streams, the error type/string); on a CUDA build it is a thin passthrough to<cuda_runtime.h>. The.cuand the two device-image.cppnow include it instead of<cuda_runtime.h>. CMake gains aUSE_HIPoption that enables the HIP language, marks the.cuasLANGUAGE HIP, and gatesfind_package(CUDAToolkit)off. The HIP architecture is auto-detected byenable_language(HIP)from the build machine (gfx90a fallback;-DCMAKE_HIP_ARCHITECTURESoverrides), and the README documents the ROCm build.Two substantive correctness items, both validated bit-exactly by the shipped gtest suite:
Wavefront width, correct for a multi-arch build. The design overloads one
WARP_SIZEconstant for shuffle-subgroup partitioning in the eight path-aggregation kernels, the winner-take-all per-lane layout (REDUCTION_PER_THREAD = MAX_DISPARITY / WARP_SIZE), and device block sizing (BLOCK_SIZE = WARP_SIZE * N). The DEVICE width is keyed per arch off the__GFX*__macros (64 on wave64 GCN (gfx8/gfx9, e.g. gfx90a), 32 on RDNA and on CUDA), so each device slice of agfx90a;gfx1100build compiles with its own correct width. The HOST must not bake in a compile-time width: hipcc's host pass can target several arches at once and there is no single right answer. The previous approach injected oneSGM_HOST_WARP_SIZEfrom a CMake arch scan, which a mixedgfx90a;gfx1100build resolved to 64 -- the host then launchedbdim=512blocks against the gfx1100 device kernel built for 32-lane geometry, an out-of-range warp/lane mismatch on the gfx1100 slice. Now every host launcher derives its block/grid dims from a runtimewarpSizequery (host_utility.hdevice_warp_size(), viahipDeviceGetAttribute, cached per device), so host and device agree on every arch in the build.WARP_SIZEretains a host-pass value only so hipcc can parse the__global__bodies into launch stubs; it never drives runtime launch geometry.SIMD video intrinsics.
median_filter.cu's vectorized 2x/4x median path uses__vcmpgtu2/4,__vminu2/4,__vmaxu2/4, which ROCm 7.x HIP does not provide. They are software-emulated underUSE_HIPwith the exact CUDA per-lane semantics (two 16-bit halfwords / four 8-bit bytes; the compare yields an all-ones mask per lane on greater-than), so the median selection network is bit-identical to the scalar reference.Also, HIP leaves
CUDA_VERSIONundefined, which routed the WTA inter-lane smem handoff to the weaker__threadfence_block()fallback; on wave64 the warp lanes writesmem_cost_sumand read each other's writes, so it now uses the__syncwarp()barrier (which HIP provides) on the HIP path.Review order:
src/cuda_to_hip.h, thensrc/constants.h(device per-archWARP_SIZE) andsrc/host_utility.h(the runtimewarpSizequery the launchers use), then the launchers insrc/cost_aggregation.cuandsrc/winner_takes_all.cu, thensrc/median_filter.cu(the intrinsic emulation), then the CMake wiring.Test Plan:
Built multi-arch and run on a real AMD gfx90a (CDNA, wave64), ROCm 7.2.1:
The two-arch binary compiles clean and carries both gfx90a and gfx1100 code objects. On gfx90a the runtime query selects warpSize=64; AMD_LOG_LEVEL=3 confirms the native gfx90a code object loads (no JIT). The gtest suite computes a warp-size-agnostic CPU reference per kernel and asserts bit-exact equality. All 67 tests across 9 suites PASS, including the wave-sensitive CostAggregationTest (18 params: SGM_32U/64U x disp {64,128,256} x min_disp {0,+/-16}, 8-path), WinnerTakesAllTestP (12 params), MedianFilterTest (the emulated intrinsics), and IntegrationTest.RandomU8 (the full census -> aggregation -> WTA -> median -> consistency pipeline). The run is deterministic: two back-to-back runs are identical (67/67 both times). The gfx1100 run validates separately on RDNA3 hardware (gfx1100, wave32).