Commit de873d1
authored
Fix race condition in mxfp8 CUDA kernels (#4278)
A race condition was present in two kernels due to a bad ordering between a syncthreads and an async-proxy fence. The fence is needed because it makes sure that the calling thread's writes to shmem are visible in the async proxy. However, the operation we're synchronizing with is the TMA write issues by thread 0, hence we need to establish a causality link between _all_ the fences performed by _all_ threads, and the issuing of the TMA load by thread 0. Thus the syncthreads must be inserted in between these two operations.
The CUDA programming guide is very explicit about this, in [section 10.29.1. Using TMA to transfer one-dimensional arrays](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-one-dimensional-arrays).1 parent 00ef369 commit de873d1
1 file changed
Lines changed: 2 additions & 3 deletions
Lines changed: 2 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
316 | 316 | | |
317 | 317 | | |
318 | 318 | | |
319 | | - | |
320 | | - | |
321 | 319 | | |
322 | 320 | | |
| 321 | + | |
323 | 322 | | |
324 | 323 | | |
325 | 324 | | |
| |||
785 | 784 | | |
786 | 785 | | |
787 | 786 | | |
788 | | - | |
789 | 787 | | |
| 788 | + | |
790 | 789 | | |
791 | 790 | | |
792 | 791 | | |
| |||
0 commit comments