Added softmax_scale in attentions by Sirorezka · Pull Request #2828 · PrimeIntellect-ai/prime-rl

Sirorezka · 2026-06-16T23:08:57Z

Added softmax_scale in attentions (+ring, ulysses)
Refactored Attention function to split Q,K,V calculations and core attention call.
Added possibility to use flash attentions for cases when batch>1 (non cp)
Added tests on flash attention layer

This fix is required to later merge DeepSeek_V3 (#2681)

Note

High Risk
Touches the core transformer attention and distributed ring/Ulysses kernels; incorrect scaling or path selection would silently change model numerics during training.

Overview
Extends attention so custom softmax_scale (via AttentionConfig.scaling) flows through Flash Attention (FA2/3/4), SDPA, ring, and Ulysses varlen paths instead of always using head_dim**-0.5.

Refactors FlashAttention / SDPAAttention into core modules (kernel only) plus wrapper modules that own Q/K/V projections, so ring/Ulysses can keep patching _compute_attention while shared logic lives in one place.

Flash path behavior: when cu_seqlens is omitted, attention uses the dense flash_attn_func path for batched inputs (batch > 1); varlen still requires batch == 1. Adds window_size_left on config for sliding-window flash kernels.

Adds CUDA unit tests comparing FA variants, ring, and Ulysses varlen outputs against SDPA (including custom scale and sliding window).

^{Reviewed by Cursor Bugbot for commit 0b83f36. Bugbot is set up for automated code reviews on this repo. Configure here.}

… attention with Q,K,V layers. Added tests for attentions.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 3429290. Configure here.}

Sirorezka · 2026-06-19T21:03:25Z

I appreciate you're very busy, but I would be exceedingly grateful for your feedback on this pull request whenever you get a chance. Thank you! @samsja @mikasenghaas @S1ro1.

Added softmax_scale in attentions. Separated attention_core call from…

83415b9

… attention with Q,K,V layers. Added tests for attentions.

cursor Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread src/prime_rl/trainer/models/layers/attn.py

added torch dynamo disable

b1708fd

cursor Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread src/prime_rl/trainer/models/layers/ring_attn.py

Sirorezka added 2 commits June 17, 2026 02:22

fixing returns in RingFA

40d9bc5

fixing backward return in RingFA

7157f86

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread src/prime_rl/trainer/models/layers/attn.py

added sliding window support into core attetion

39f3a82

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread src/prime_rl/trainer/models/layers/attn.py Outdated

reversed 'window_size_left' to 'sliding_window'

3429290

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread src/prime_rl/trainer/models/layers/ulysses_attn.py Outdated

reversed 'window_size_left' to 'sliding_window' in ulyssses

0b83f36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added softmax_scale in attentions#2828

Added softmax_scale in attentions#2828
Sirorezka wants to merge 7 commits into
PrimeIntellect-ai:mainfrom
Sirorezka:feat_attn_scalling

Sirorezka commented Jun 16, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Sirorezka commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Sirorezka commented Jun 16, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sirorezka commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sirorezka commented Jun 16, 2026 •

edited by cursor Bot

Loading