Skip to content

Added softmax_scale in attentions#2828

Open
Sirorezka wants to merge 7 commits into
PrimeIntellect-ai:mainfrom
Sirorezka:feat_attn_scalling
Open

Added softmax_scale in attentions#2828
Sirorezka wants to merge 7 commits into
PrimeIntellect-ai:mainfrom
Sirorezka:feat_attn_scalling

Conversation

@Sirorezka

@Sirorezka Sirorezka commented Jun 16, 2026

Copy link
Copy Markdown
  1. Added softmax_scale in attentions (+ring, ulysses)
  2. Refactored Attention function to split Q,K,V calculations and core attention call.
  3. Added possibility to use flash attentions for cases when batch>1 (non cp)
  4. Added tests on flash attention layer

This fix is required to later merge DeepSeek_V3 (#2681)


Note

High Risk
Touches the core transformer attention and distributed ring/Ulysses kernels; incorrect scaling or path selection would silently change model numerics during training.

Overview
Extends attention so custom softmax_scale (via AttentionConfig.scaling) flows through Flash Attention (FA2/3/4), SDPA, ring, and Ulysses varlen paths instead of always using head_dim**-0.5.

Refactors FlashAttention / SDPAAttention into core modules (kernel only) plus wrapper modules that own Q/K/V projections, so ring/Ulysses can keep patching _compute_attention while shared logic lives in one place.

Flash path behavior: when cu_seqlens is omitted, attention uses the dense flash_attn_func path for batched inputs (batch > 1); varlen still requires batch == 1. Adds window_size_left on config for sliding-window flash kernels.

Adds CUDA unit tests comparing FA variants, ring, and Ulysses varlen outputs against SDPA (including custom scale and sliding window).

Reviewed by Cursor Bugbot for commit 0b83f36. Bugbot is set up for automated code reviews on this repo. Configure here.

… attention with Q,K,V layers. Added tests for attentions.
Comment thread src/prime_rl/trainer/models/layers/attn.py
Comment thread src/prime_rl/trainer/models/layers/ring_attn.py
Comment thread src/prime_rl/trainer/models/layers/attn.py
Comment thread src/prime_rl/trainer/models/layers/attn.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3429290. Configure here.

Comment thread src/prime_rl/trainer/models/layers/ulysses_attn.py Outdated
@Sirorezka

Copy link
Copy Markdown
Author

I appreciate you're very busy, but I would be exceedingly grateful for your feedback on this pull request whenever you get a chance. Thank you! @samsja @mikasenghaas @S1ro1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant