Skip to content

Commit de873d1

Browse files
authored
Fix race condition in mxfp8 CUDA kernels (#4278)
A race condition was present in two kernels due to a bad ordering between a syncthreads and an async-proxy fence. The fence is needed because it makes sure that the calling thread's writes to shmem are visible in the async proxy. However, the operation we're synchronizing with is the TMA write issues by thread 0, hence we need to establish a causality link between _all_ the fences performed by _all_ threads, and the issuing of the TMA load by thread 0. Thus the syncthreads must be inserted in between these two operations. The CUDA programming guide is very explicit about this, in [section 10.29.1. Using TMA to transfer one-dimensional arrays](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-one-dimensional-arrays).
1 parent 00ef369 commit de873d1

1 file changed

Lines changed: 2 additions & 3 deletions

File tree

torchao/csrc/cuda/mx_kernels/mx_block_rearrange_2d_M_groups.cu

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -316,10 +316,9 @@ __global__ void mx_blocked_layout_2d_M_groups_kernel(
316316
) = data;
317317
}
318318
}
319-
__syncthreads();
320-
321319
// Fence to ensure SMEM writes are visible to TMA async proxy
322320
ptx::fence_proxy_async_shared_cta();
321+
__syncthreads();
323322

324323
// Compute output tile coordinates
325324
const int chunk_sf_row_tile = (superblock_idx_in_group * CHUNKS_PER_TB) + chunk_idx;
@@ -785,8 +784,8 @@ __global__ void mx_blocked_layout_2d_simple_kernel(
785784
}
786785

787786
// Ensure threads finish their smem writes and use explicit fence to ensure visibility to async proxy for TMA
788-
__syncthreads();
789787
ptx::fence_proxy_async_shared_cta();
788+
__syncthreads();
790789

791790
if (is_master_thread) {
792791
// Issue separate 1D TMA stores for each valid SF tile

0 commit comments

Comments
 (0)