Releases: Dao-AILab/flash-attention
Releases Β· Dao-AILab/flash-attention
fa4-v4.0.0.beta8
What's Changed
- fix noisy logger by @drisspg in #2414
- [AMD ROCm] Fix NaN in FMHA BWD when seq_q=0 by @rocking5566 in #2421
- Add FA4 CI: GitHub Actions workflow with Apptainer on B200 runner by @Johnsonms in #2393
- Fix some bugs of CI by @Johnsonms in #2423
- [ROCM] Fix windows issues by @micmelesse in #2385
- fix: add [cu13] extra to dev install instructions for CUDA 13 / B200 systems by @Johnsonms in #2430
- Fix: disable 2-CTA backward mode when block_sparse_tensors is used by @jduprat in #2433
- CI: extend FA4 test matrix with causal/non-causal correctness and fwd+bwd benchmark seqlen 1K-32K by @Johnsonms in #2428
Full Changelog: fa4-v4.0.0.beta7...fa4-v4.0.0.beta8
fa4-v4.0.0.beta7
What's Changed
- fix: use LSE accum strides from params instead of hardcoded ones by @ZeronSix in #2388
- [Sm75] Add README link for initial Turing support by @ssiu in #2379
- [Cute,Sm100,Bwd] refine bwd swizzle for deterministic by @jayhshah in #2390
- Fix edge case when tag has no delta from previous by @drisspg in #2394
- [AMD ROCm] Update CK and add RDNA 3/4 support by @rocking5566 in #2400
- [Ai-assisted] CLC work stealing by @drisspg in #2218
- Various bug fixes / enable subtile > 2 by @drisspg in #2411
- Add to varlen by @drisspg in #2346
- Allow compact block sparse index tensors by @jduprat in #2417
New Contributors
Full Changelog: fa4-v4.0.0.beta5...fa4-v4.0.0.beta7
fa4-v4.0.0.beta6
What's Changed
- [Cute][Testing] Minor improvements on pytest-xdist workflow by @Alkaid-Benetnash in #2311
- Nicer headdim error message by @drisspg in #2227
- [Fwd,Sm100] Extract named barriers by @drisspg in #2309
- Change 2cta opt in to have min seqlen > 2*m_block_size by @drisspg in #2320
- [CuteDSL][SM90] varlen bwd works by @KareemMusleh in #2275
- Add Logging helper by @drisspg in #2327
- [CuTeDSL][Sm80] basic fix for new api by @zhuochenKIDD in #2297
- fix: duplicate softmax_scale param by @NanoCode012 in #2328
- Fix FA2 + FA4 co-existence by @drisspg in #2331
- [Cute,Sm100] Introduce a flexible lambda-based R2P masking by @Alkaid-Benetnash in #2313
- [Cute, SM90, bwd] Wire seqused_q/k through backward pass by @NJX-njx in #2315
- SM120 forward pass (Blackwell GeForce / DGX Spark) by @blake-snc in #2329
- [cutlass] Allow compilation of cutlass FA3 for sm100 via enable_sm90 by @henrylhtsang in #2332
- [Cute] fix: rename logging module to avoid circular import at building by @Luosuu in #2335
- BUG: SeqlenInfo.create has a tile parameter that defaults to 128 by @risan-raja in #2337
- [Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch by @MatthewBonanni in #2338
- [AMD] Migrate to Triton Backend to Aiter by @micmelesse in #2230
- [Bwd,Sm120] Add SM120 backward pass support by @blake-snc in #2330
- [Bwd, SM80] Fix tdKrdS typo by @henrylhtsang in #2341
- Add SM120 varlen attention support by @blake-snc in #2333
- fix the create_ragged_tensor_for_tma issue by @rainj-me in #2345
- [Sm90] Fix test_mask_mod and bwd block-sparse kwarg mismatch by @henrylhtsang in #2365
- [Cute, Testing] Fix aot + tvm-ffi EnvStream related parameter mismatch by @Alkaid-Benetnash in #2369
- [Cute, Testing] Bump cutedsl to 4.4.2 and remove prior aot cache management workarounds by @Alkaid-Benetnash in #2370
- [Cute] fix: FA4 paged attention kv load for DeepSeek (192,128) on SM100 by @Luosuu in #2368
- [AMD ROCm] Update ROCm/CK backend to align with latest ComposableKernel API changes by @rocking5566 in #2363
- [ROCm] Auto-detect Triton backend if C++ extension is missing by @Soddentrough in #2343
- [Fwd,Sm90] Add paged KV attention support (tma and cp.async) by @henrylhtsang in #2360
- [CuTe,Flex] limit vec_size to 2 for score mod when not on Sm100 by @reubenconducts in #2371
- Support 2CTA for sliding window hdim 192 by @Inodayy in #2347
- [Cute,Fwd,Sm100] support irregular qhead / kvhead ratios by @timmy-feng in #2186
- benchmarks: add MFU% column to benchmark output by @Johnsonms in #2377
- Update flow to enable beta weekly releases by @drisspg in #2378
New Contributors
- @NJX-njx made their first contribution in #2315
- @blake-snc made their first contribution in #2329
- @Luosuu made their first contribution in #2335
- @risan-raja made their first contribution in #2337
- @MatthewBonanni made their first contribution in #2338
- @rainj-me made their first contribution in #2345
- @Soddentrough made their first contribution in #2343
- @Inodayy made their first contribution in #2347
- @Johnsonms made their first contribution in #2377
Full Changelog: fa4-v4.0.0.beta4...fa4-v4.0.0.beta6
fa4-v4.0.0.beta5
What's Changed
- [Cute][Testing] Minor improvements on pytest-xdist workflow by @Alkaid-Benetnash in #2311
- Nicer headdim error message by @drisspg in #2227
- [Fwd,Sm100] Extract named barriers by @drisspg in #2309
- Change 2cta opt in to have min seqlen > 2*m_block_size by @drisspg in #2320
- [CuteDSL][SM90] varlen bwd works by @KareemMusleh in #2275
- Add Logging helper by @drisspg in #2327
- [CuTeDSL][Sm80] basic fix for new api by @zhuochenKIDD in #2297
- fix: duplicate softmax_scale param by @NanoCode012 in #2328
- Fix FA2 + FA4 co-existence by @drisspg in #2331
- [Cute,Sm100] Introduce a flexible lambda-based R2P masking by @Alkaid-Benetnash in #2313
- [Cute, SM90, bwd] Wire seqused_q/k through backward pass by @NJX-njx in #2315
- SM120 forward pass (Blackwell GeForce / DGX Spark) by @blake-snc in #2329
- [cutlass] Allow compilation of cutlass FA3 for sm100 via enable_sm90 by @henrylhtsang in #2332
- [Cute] fix: rename logging module to avoid circular import at building by @Luosuu in #2335
- BUG: SeqlenInfo.create has a tile parameter that defaults to 128 by @risan-raja in #2337
- [Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch by @MatthewBonanni in #2338
- [AMD] Migrate to Triton Backend to Aiter by @micmelesse in #2230
- [Bwd,Sm120] Add SM120 backward pass support by @blake-snc in #2330
- [Bwd, SM80] Fix tdKrdS typo by @henrylhtsang in #2341
- Add SM120 varlen attention support by @blake-snc in #2333
- fix the create_ragged_tensor_for_tma issue by @rainj-me in #2345
- [Sm90] Fix test_mask_mod and bwd block-sparse kwarg mismatch by @henrylhtsang in #2365
- [Cute, Testing] Fix aot + tvm-ffi EnvStream related parameter mismatch by @Alkaid-Benetnash in #2369
- [Cute, Testing] Bump cutedsl to 4.4.2 and remove prior aot cache management workarounds by @Alkaid-Benetnash in #2370
- [Cute] fix: FA4 paged attention kv load for DeepSeek (192,128) on SM100 by @Luosuu in #2368
- [AMD ROCm] Update ROCm/CK backend to align with latest ComposableKernel API changes by @rocking5566 in #2363
- [ROCm] Auto-detect Triton backend if C++ extension is missing by @Soddentrough in #2343
- [Fwd,Sm90] Add paged KV attention support (tma and cp.async) by @henrylhtsang in #2360
- [CuTe,Flex] limit vec_size to 2 for score mod when not on Sm100 by @reubenconducts in #2371
- Support 2CTA for sliding window hdim 192 by @Inodayy in #2347
- [Cute,Fwd,Sm100] support irregular qhead / kvhead ratios by @timmy-feng in #2186
- benchmarks: add MFU% column to benchmark output by @Johnsonms in #2377
- Update flow to enable beta weekly releases by @drisspg in #2378
New Contributors
- @NJX-njx made their first contribution in #2315
- @blake-snc made their first contribution in #2329
- @Luosuu made their first contribution in #2335
- @risan-raja made their first contribution in #2337
- @MatthewBonanni made their first contribution in #2338
- @rainj-me made their first contribution in #2345
- @Soddentrough made their first contribution in #2343
- @Inodayy made their first contribution in #2347
- @Johnsonms made their first contribution in #2377
Full Changelog: fa4-v4.0.0.beta4...fa4-v4.0.0.beta5
fa4-v4.0.0.beta4
Full Changelog: fa4-v4.0.0.beta2...fa4-v4.0.0.beta4
fa4-v4.0.0.beta2
Full Changelog: fa4-v4.0.0.beta1...fa4-v4.0.0.beta2
fa4-v4.0.0.beta1
Full Changelog: fa4-v4.0.0.beta0...fa4-v4.0.0.beta1
fa4-v4.0.0.beta0
What's Changed
- [BugFix] Fix flash_attn_with_kvcache with scalar cache_seqlen by @stepinto in #1795
- Add sorting and head swizzle to varlen scheduler by @jayhshah in #1823
- Fixes incorrect variable reference in comment by @LoserCheems in #1775
- Update the initialization of dk/dv_semaphore by @y-sq in #1839
- FA3 tensor size parameter fix for long context len (seqlen >=4M) by @ghadiaravi13 in #1841
- ci: Move build job to workflow template by @ko3n1g in #1835
- ci: Build via workflow template by @ko3n1g in #1844
- ci: Switch to workflow_dispatch by @ko3n1g in #1847
- [
FA3] Allow returning LSE via kwarg by @vasqu in #1851 - [BugFix] fix flash_fwd.FlashAttentionForwardSm80 bugs by @mingyangHao in #1856
- [FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuTe DSL by @reubenconducts in #1858
- [BUG] CUDA 13: make FA3 compatible with CUDA 13 Builds by @johnnynunez in #1860
- [BUILD] SBSA wheels + CUDA 13 Support by @johnnynunez in #1865
- benchmark: qualify all attention backends by methods list by @rajesh-s in #1881
- ABI stable fa3 by @mikaylagawarecki in #1791
- [NVIDIA] Enable Blackwell Family Specific by @johnnynunez in #1882
- Fix typo in flops calculation for local attention by @henrylhtsang in #1883
- flash-attn-cute bwd sm90 by @tzadouri in #1868
- [Cute] Make testing utils standalone for cute by @drisspg in #1892
- [Cute] Bump pin for CuTeDSL by @drisspg in #1891
- Improve causal backward determinism perf with SPT schedule by @jayhshah in #1893
- Upgrade to cutlass v4.2.1 by @johnnynunez in #1905
- Switch to use cutlass.utils.get_smem_capacity_in_bytes by @brandon-yujie-sun in #1906
- Add Missing None Gradient in FA3 QKVPacked by @JackCharlesZhang in #1908
- C++11 fix warnings by @johnnynunez in #1904
- [CuteDSL] Explicitly cast for Flash Combine by @drisspg in #1925
- Refactors to enable FlexAttention by @drisspg in #1840
- feat: Adding varlen support to cute-dsl sm80 bwd by @imbr92 in #1934
- Remove self refs in softmax for-loop by @kevin-tong-augment in #1924
- [AMD] Torch Compile Issues by @micmelesse in #1756
- [CUTE] Enable Pack GQA for score mods by @drisspg in #1937
- Add precommit list and then uncomment in chunks by @drisspg in #1941
- [ROCm] prepare CK sources for pytorch hipify v2 APIs by @jeffdaily in #1944
- Blackwell FlashAttention-BWD (v1.0) by @tzadouri in #1945
- Sm100 BWD (barrier) by @tzadouri in #1946
- Fix hopper cuda 13 build by @kevmo314 in #1949
- [CuteDSL] Fix hash function for cute.jit decorator by @drisspg in #1953
- Block Sparsity and Flex Attention mask mod support by @reubenconducts in #1942
- [NVIDIA] cutlass v4.3.0 by @johnnynunez in #1952
- [CuTe DSL] Update "buffers" name to "aux_tensors"; fix flex bugs by @reubenconducts in #1961
- Fix FA3 segfault with custom CUDA streams in ABI stable build by @kevmo314 in #1957
- [Cute] Blocks tweaks by @drisspg in #1964
- BlockSparse Tweaks by @drisspg in #1970
- [Cute] Fix main by @drisspg in #1982
- [Cute,Fwd,Sm100] Implement SplitKV by @timmy-feng in #1940
- [Cute] Extract block-sparse utilities from SM80/90 by @drisspg in #1984
- Enable python-3.10+ by @drisspg in #1998
- [Cute, Bwd, Sm100] Add GQA support by @jayhshah in #2004
- [Cute,Fwd,Sm100] fix major regression with split kv by @jayhshah in #2006
- [CuTe DSL] Block sparsity computation kernel by @reubenconducts in #1983
- [NVIDIA] bump github actions by @johnnynunez in #1996
- [Cute,Fwd,Sm100] Support paged attention by @timmy-feng in #1999
- [Cute] Block sparse support Sm100 by @drisspg in #1985
- [Cute,Sm100,Fwd] use correction warps for epi when not using TMA by @jayhshah in #2014
- add fastdivmod for oob reads in mask_mods by @drisspg in #2020
- [Cute,Fwd,Sm100] don't pass mask_fn to softmax_step generically by @jayhshah in #2026
- [CuTeDSL] Swap order of decorators by @anakinxc in #2029
- [Cute,Bwd,Sm100] enable deterministic mode for sm100 bwd and fix race conditions by @jayhshah in #2033
- [NFC] Trivial fix to silence linter by @jduprat in #1928
- Add LICENSE and AUTHORS to flash_attn/cute by @jduprat in #2032
- [Cute,Fwd] enable mask mod without blocksparsity by @reubenconducts in #2031
- Bump pin by @drisspg in #2025
- ruff all the smaller files by @drisspg in #2040
- [Cute] Fix head dim 64 bwd by @drisspg in #2035
- Add headdim64 tests to race condition by @drisspg in #2041
- Add torch.compile support to flash attention 3 by @guilhermeleobas in #1769
- [Cute,Bwd,Sm100] Add local for sm100 bwd by @jayhshah in #2046
- Add hash attr to shortcut expensive check by @drisspg in #2048
- [AMD ROCm] Update to latest composable_kernel to improve performance by @rocking5566 in #2052
- fixing cute bwd func def by @liangel-02 in #2056
- Fix use-after-free in FA3 deterministic mode. by @skarupke in #2063
- [CUTE] Allow grads to be preallocated by @drisspg in #2065
- [Cute,Fwd] Extend score_mod to variable sequence length by @reubenconducts in #2043
- [CUTE] Enabling TVM-FFI to reduce cpu overhead by @drisspg in #2042
- Fix softcap scoremod kwargs typo. by @LeoZDong in #2072
- Add score-mod bwd support by @drisspg in #2070
- Add blocksparse support for bwd on blackwell by @drisspg in #2085
- Fix IMA in fwd on m boundary by @drisspg in #2091
- cutedsl 4.3.4 by @drisspg in #2092
- README for AMD ROCm by @seungrokj in #2068
- [Cute] Fix shuffle sync and enable pack gqa for varlen sm100 by @jayhshah in #2097
- [NVIDIA] Enable Jetson Thor FA4 by @johnnynunez in #2108
- Add pack-gqa fwd support for sparse impl w/ broadcasted H dim by @drisspg in #2098
- [Cute,Fwd] improved block sparsity by @reubenconducts in #2100
- Misc tests that should be xfailed for now by @drisspg in #2127
- Update CUTLASS to fix undefined symbol : cuDriverGetVersion by @HydraQYH in #2142
- [Cute,Fwd,Sm100] Support
q_stage=1for inference by @Timmy...
v2.8.3
Bump to v2.8.3
v2.8.2
Bump to v2.8.2