Releases · Dao-AILab/flash-attention

08 Apr 08:32

github-actions

fa4-v4.0.0.beta8

15270e6

fa4-v4.0.0.beta8 Pre-release

Pre-release

What's Changed

fix noisy logger by @drisspg in #2414
[AMD ROCm] Fix NaN in FMHA BWD when seq_q=0 by @rocking5566 in #2421
Add FA4 CI: GitHub Actions workflow with Apptainer on B200 runner by @Johnsonms in #2393
Fix some bugs of CI by @Johnsonms in #2423
[ROCM] Fix windows issues by @micmelesse in #2385
fix: add [cu13] extra to dev install instructions for CUDA 13 / B200 systems by @Johnsonms in #2430
Fix: disable 2-CTA backward mode when block_sparse_tensors is used by @jduprat in #2433
CI: extend FA4 test matrix with causal/non-causal correctness and fwd+bwd benchmark seqlen 1K-32K by @Johnsonms in #2428

Full Changelog: fa4-v4.0.0.beta7...fa4-v4.0.0.beta8

Contributors

jduprat, Johnsonms, and 3 other contributors

Assets 4

01 Apr 08:35

github-actions

fa4-v4.0.0.beta7

f6a16e1

fa4-v4.0.0.beta7 Pre-release

Pre-release

What's Changed

fix: use LSE accum strides from params instead of hardcoded ones by @ZeronSix in #2388
[Sm75] Add README link for initial Turing support by @ssiu in #2379
[Cute,Sm100,Bwd] refine bwd swizzle for deterministic by @jayhshah in #2390
Fix edge case when tag has no delta from previous by @drisspg in #2394
[AMD ROCm] Update CK and add RDNA 3/4 support by @rocking5566 in #2400
[Ai-assisted] CLC work stealing by @drisspg in #2218
Various bug fixes / enable subtile > 2 by @drisspg in #2411
Add to varlen by @drisspg in #2346
Allow compact block sparse index tensors by @jduprat in #2417

New Contributors

@ZeronSix made their first contribution in #2388
@ssiu made their first contribution in #2379

Full Changelog: fa4-v4.0.0.beta5...fa4-v4.0.0.beta7

Contributors

jduprat, ZeronSix, and 4 other contributors

Assets 4

25 Mar 08:21

github-actions

fa4-v4.0.0.beta6

6362bd3

fa4-v4.0.0.beta6 Pre-release

Pre-release

What's Changed

[Cute][Testing] Minor improvements on pytest-xdist workflow by @Alkaid-Benetnash in #2311
Nicer headdim error message by @drisspg in #2227
[Fwd,Sm100] Extract named barriers by @drisspg in #2309
Change 2cta opt in to have min seqlen > 2*m_block_size by @drisspg in #2320
[CuteDSL][SM90] varlen bwd works by @KareemMusleh in #2275
Add Logging helper by @drisspg in #2327
[CuTeDSL][Sm80] basic fix for new api by @zhuochenKIDD in #2297
fix: duplicate softmax_scale param by @NanoCode012 in #2328
Fix FA2 + FA4 co-existence by @drisspg in #2331
[Cute,Sm100] Introduce a flexible lambda-based R2P masking by @Alkaid-Benetnash in #2313
[Cute, SM90, bwd] Wire seqused_q/k through backward pass by @NJX-njx in #2315
SM120 forward pass (Blackwell GeForce / DGX Spark) by @blake-snc in #2329
[cutlass] Allow compilation of cutlass FA3 for sm100 via enable_sm90 by @henrylhtsang in #2332
[Cute] fix: rename logging module to avoid circular import at building by @Luosuu in #2335
BUG: SeqlenInfo.create has a tile parameter that defaults to 128 by @risan-raja in #2337
[Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch by @MatthewBonanni in #2338
[AMD] Migrate to Triton Backend to Aiter by @micmelesse in #2230
[Bwd,Sm120] Add SM120 backward pass support by @blake-snc in #2330
[Bwd, SM80] Fix tdKrdS typo by @henrylhtsang in #2341
Add SM120 varlen attention support by @blake-snc in #2333
fix the create_ragged_tensor_for_tma issue by @rainj-me in #2345
[Sm90] Fix test_mask_mod and bwd block-sparse kwarg mismatch by @henrylhtsang in #2365
[Cute, Testing] Fix aot + tvm-ffi EnvStream related parameter mismatch by @Alkaid-Benetnash in #2369
[Cute, Testing] Bump cutedsl to 4.4.2 and remove prior aot cache management workarounds by @Alkaid-Benetnash in #2370
[Cute] fix: FA4 paged attention kv load for DeepSeek (192,128) on SM100 by @Luosuu in #2368
[AMD ROCm] Update ROCm/CK backend to align with latest ComposableKernel API changes by @rocking5566 in #2363
[ROCm] Auto-detect Triton backend if C++ extension is missing by @Soddentrough in #2343
[Fwd,Sm90] Add paged KV attention support (tma and cp.async) by @henrylhtsang in #2360
[CuTe,Flex] limit vec_size to 2 for score mod when not on Sm100 by @reubenconducts in #2371
Support 2CTA for sliding window hdim 192 by @Inodayy in #2347
[Cute,Fwd,Sm100] support irregular qhead / kvhead ratios by @timmy-feng in #2186
benchmarks: add MFU% column to benchmark output by @Johnsonms in #2377
Update flow to enable beta weekly releases by @drisspg in #2378

New Contributors

@NJX-njx made their first contribution in #2315
@blake-snc made their first contribution in #2329
@Luosuu made their first contribution in #2335
@risan-raja made their first contribution in #2337
@MatthewBonanni made their first contribution in #2338
@rainj-me made their first contribution in #2345
@Soddentrough made their first contribution in #2343
@Inodayy made their first contribution in #2347
@Johnsonms made their first contribution in #2377

Full Changelog: fa4-v4.0.0.beta4...fa4-v4.0.0.beta6

Contributors

Johnsonms, Alkaid-Benetnash, and 17 other contributors

Assets 4

23 Mar 16:50

github-actions

fa4-v4.0.0.beta5

6362bd3

fa4-v4.0.0.beta5 Pre-release

Pre-release

What's Changed

[Cute][Testing] Minor improvements on pytest-xdist workflow by @Alkaid-Benetnash in #2311
Nicer headdim error message by @drisspg in #2227
[Fwd,Sm100] Extract named barriers by @drisspg in #2309
Change 2cta opt in to have min seqlen > 2*m_block_size by @drisspg in #2320
[CuteDSL][SM90] varlen bwd works by @KareemMusleh in #2275
Add Logging helper by @drisspg in #2327
[CuTeDSL][Sm80] basic fix for new api by @zhuochenKIDD in #2297
fix: duplicate softmax_scale param by @NanoCode012 in #2328
Fix FA2 + FA4 co-existence by @drisspg in #2331
[Cute,Sm100] Introduce a flexible lambda-based R2P masking by @Alkaid-Benetnash in #2313
[Cute, SM90, bwd] Wire seqused_q/k through backward pass by @NJX-njx in #2315
SM120 forward pass (Blackwell GeForce / DGX Spark) by @blake-snc in #2329
[cutlass] Allow compilation of cutlass FA3 for sm100 via enable_sm90 by @henrylhtsang in #2332
[Cute] fix: rename logging module to avoid circular import at building by @Luosuu in #2335
BUG: SeqlenInfo.create has a tile parameter that defaults to 128 by @risan-raja in #2337
[Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch by @MatthewBonanni in #2338
[AMD] Migrate to Triton Backend to Aiter by @micmelesse in #2230
[Bwd,Sm120] Add SM120 backward pass support by @blake-snc in #2330
[Bwd, SM80] Fix tdKrdS typo by @henrylhtsang in #2341
Add SM120 varlen attention support by @blake-snc in #2333
fix the create_ragged_tensor_for_tma issue by @rainj-me in #2345
[Sm90] Fix test_mask_mod and bwd block-sparse kwarg mismatch by @henrylhtsang in #2365
[Cute, Testing] Fix aot + tvm-ffi EnvStream related parameter mismatch by @Alkaid-Benetnash in #2369
[Cute, Testing] Bump cutedsl to 4.4.2 and remove prior aot cache management workarounds by @Alkaid-Benetnash in #2370
[Cute] fix: FA4 paged attention kv load for DeepSeek (192,128) on SM100 by @Luosuu in #2368
[AMD ROCm] Update ROCm/CK backend to align with latest ComposableKernel API changes by @rocking5566 in #2363
[ROCm] Auto-detect Triton backend if C++ extension is missing by @Soddentrough in #2343
[Fwd,Sm90] Add paged KV attention support (tma and cp.async) by @henrylhtsang in #2360
[CuTe,Flex] limit vec_size to 2 for score mod when not on Sm100 by @reubenconducts in #2371
Support 2CTA for sliding window hdim 192 by @Inodayy in #2347
[Cute,Fwd,Sm100] support irregular qhead / kvhead ratios by @timmy-feng in #2186
benchmarks: add MFU% column to benchmark output by @Johnsonms in #2377
Update flow to enable beta weekly releases by @drisspg in #2378

New Contributors

@NJX-njx made their first contribution in #2315
@blake-snc made their first contribution in #2329
@Luosuu made their first contribution in #2335
@risan-raja made their first contribution in #2337
@MatthewBonanni made their first contribution in #2338
@rainj-me made their first contribution in #2345
@Soddentrough made their first contribution in #2343
@Inodayy made their first contribution in #2347
@Johnsonms made their first contribution in #2377

Full Changelog: fa4-v4.0.0.beta4...fa4-v4.0.0.beta5

Contributors

Johnsonms, Alkaid-Benetnash, and 17 other contributors

Assets 4

05 Mar 18:02

github-actions

fa4-v4.0.0.beta4

fbe1568

fa4-v4.0.0.beta4 Latest

Latest

Full Changelog: fa4-v4.0.0.beta2...fa4-v4.0.0.beta4

Assets 4

05 Mar 12:22

github-actions

fa4-v4.0.0.beta2

dc754c7

fa4-v4.0.0.beta2

Full Changelog: fa4-v4.0.0.beta1...fa4-v4.0.0.beta2

Assets 4

05 Mar 12:19

github-actions

fa4-v4.0.0.beta1

253ecf5

fa4-v4.0.0.beta1

Full Changelog: fa4-v4.0.0.beta0...fa4-v4.0.0.beta1

Assets 4

05 Mar 12:09

github-actions

fa4-v4.0.0.beta0

a365a19

fa4-v4.0.0.beta0

What's Changed

[BugFix] Fix flash_attn_with_kvcache with scalar cache_seqlen by @stepinto in #1795
Add sorting and head swizzle to varlen scheduler by @jayhshah in #1823
Fixes incorrect variable reference in comment by @LoserCheems in #1775
Update the initialization of dk/dv_semaphore by @y-sq in #1839
FA3 tensor size parameter fix for long context len (seqlen >=4M) by @ghadiaravi13 in #1841
ci: Move build job to workflow template by @ko3n1g in #1835
ci: Build via workflow template by @ko3n1g in #1844
ci: Switch to workflow_dispatch by @ko3n1g in #1847
[FA3] Allow returning LSE via kwarg by @vasqu in #1851
[BugFix] fix flash_fwd.FlashAttentionForwardSm80 bugs by @mingyangHao in #1856
[FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuTe DSL by @reubenconducts in #1858
[BUG] CUDA 13: make FA3 compatible with CUDA 13 Builds by @johnnynunez in #1860
[BUILD] SBSA wheels + CUDA 13 Support by @johnnynunez in #1865
benchmark: qualify all attention backends by methods list by @rajesh-s in #1881
ABI stable fa3 by @mikaylagawarecki in #1791
[NVIDIA] Enable Blackwell Family Specific by @johnnynunez in #1882
Fix typo in flops calculation for local attention by @henrylhtsang in #1883
flash-attn-cute bwd sm90 by @tzadouri in #1868
[Cute] Make testing utils standalone for cute by @drisspg in #1892
[Cute] Bump pin for CuTeDSL by @drisspg in #1891
Improve causal backward determinism perf with SPT schedule by @jayhshah in #1893
Upgrade to cutlass v4.2.1 by @johnnynunez in #1905
Switch to use cutlass.utils.get_smem_capacity_in_bytes by @brandon-yujie-sun in #1906
Add Missing None Gradient in FA3 QKVPacked by @JackCharlesZhang in #1908
C++11 fix warnings by @johnnynunez in #1904
[CuteDSL] Explicitly cast for Flash Combine by @drisspg in #1925
Refactors to enable FlexAttention by @drisspg in #1840
feat: Adding varlen support to cute-dsl sm80 bwd by @imbr92 in #1934
Remove self refs in softmax for-loop by @kevin-tong-augment in #1924
[AMD] Torch Compile Issues by @micmelesse in #1756
[CUTE] Enable Pack GQA for score mods by @drisspg in #1937
Add precommit list and then uncomment in chunks by @drisspg in #1941
[ROCm] prepare CK sources for pytorch hipify v2 APIs by @jeffdaily in #1944
Blackwell FlashAttention-BWD (v1.0) by @tzadouri in #1945
Sm100 BWD (barrier) by @tzadouri in #1946
Fix hopper cuda 13 build by @kevmo314 in #1949
[CuteDSL] Fix hash function for cute.jit decorator by @drisspg in #1953
Block Sparsity and Flex Attention mask mod support by @reubenconducts in #1942
[NVIDIA] cutlass v4.3.0 by @johnnynunez in #1952
[CuTe DSL] Update "buffers" name to "aux_tensors"; fix flex bugs by @reubenconducts in #1961
Fix FA3 segfault with custom CUDA streams in ABI stable build by @kevmo314 in #1957
[Cute] Blocks tweaks by @drisspg in #1964
BlockSparse Tweaks by @drisspg in #1970
[Cute] Fix main by @drisspg in #1982
[Cute,Fwd,Sm100] Implement SplitKV by @timmy-feng in #1940
[Cute] Extract block-sparse utilities from SM80/90 by @drisspg in #1984
Enable python-3.10+ by @drisspg in #1998
[Cute, Bwd, Sm100] Add GQA support by @jayhshah in #2004
[Cute,Fwd,Sm100] fix major regression with split kv by @jayhshah in #2006
[CuTe DSL] Block sparsity computation kernel by @reubenconducts in #1983
[NVIDIA] bump github actions by @johnnynunez in #1996
[Cute,Fwd,Sm100] Support paged attention by @timmy-feng in #1999
[Cute] Block sparse support Sm100 by @drisspg in #1985
[Cute,Sm100,Fwd] use correction warps for epi when not using TMA by @jayhshah in #2014
add fastdivmod for oob reads in mask_mods by @drisspg in #2020
[Cute,Fwd,Sm100] don't pass mask_fn to softmax_step generically by @jayhshah in #2026
[CuTeDSL] Swap order of decorators by @anakinxc in #2029
[Cute,Bwd,Sm100] enable deterministic mode for sm100 bwd and fix race conditions by @jayhshah in #2033
[NFC] Trivial fix to silence linter by @jduprat in #1928
Add LICENSE and AUTHORS to flash_attn/cute by @jduprat in #2032
[Cute,Fwd] enable mask mod without blocksparsity by @reubenconducts in #2031
Bump pin by @drisspg in #2025
ruff all the smaller files by @drisspg in #2040
[Cute] Fix head dim 64 bwd by @drisspg in #2035
Add headdim64 tests to race condition by @drisspg in #2041
Add torch.compile support to flash attention 3 by @guilhermeleobas in #1769
[Cute,Bwd,Sm100] Add local for sm100 bwd by @jayhshah in #2046
Add hash attr to shortcut expensive check by @drisspg in #2048
[AMD ROCm] Update to latest composable_kernel to improve performance by @rocking5566 in #2052
fixing cute bwd func def by @liangel-02 in #2056
Fix use-after-free in FA3 deterministic mode. by @skarupke in #2063
[CUTE] Allow grads to be preallocated by @drisspg in #2065
[Cute,Fwd] Extend score_mod to variable sequence length by @reubenconducts in #2043
[CUTE] Enabling TVM-FFI to reduce cpu overhead by @drisspg in #2042
Fix softcap scoremod kwargs typo. by @LeoZDong in #2072
Add score-mod bwd support by @drisspg in #2070
Add blocksparse support for bwd on blackwell by @drisspg in #2085
Fix IMA in fwd on m boundary by @drisspg in #2091
cutedsl 4.3.4 by @drisspg in #2092
README for AMD ROCm by @seungrokj in #2068
[Cute] Fix shuffle sync and enable pack gqa for varlen sm100 by @jayhshah in #2097
[NVIDIA] Enable Jetson Thor FA4 by @johnnynunez in #2108
Add pack-gqa fwd support for sparse impl w/ broadcasted H dim by @drisspg in #2098
[Cute,Fwd] improved block sparsity by @reubenconducts in #2100
Misc tests that should be xfailed for now by @drisspg in #2127
Update CUTLASS to fix undefined symbol : cuDriverGetVersion by @HydraQYH in #2142
[Cute,Fwd,Sm100] Support q_stage=1 for inference by @Timmy...