[NPU, A3] Add NPU kernel support for A3 machines by zheliuyu · Pull Request #1220 · linkedin/Liger-Kernel

zheliuyu · 2026-05-11T07:31:26Z

Motivation

This work follows the roadmap in linkedin/Liger-Kernel#969. The goal is to exercise the NPU kernel on Atlas 800T A3 (64G) and report how the test suite behaves on that hardware.

Details

Fixed the Ascend implementation of attn_res so it passes on A3 alongside the rest of the suite.
After the fix, the full test run completes successfully. 🍾

Why `attn_res` failed on A3

The failure showed up as vector-core / ACL errors (e.g. 507035, device sync failing), not a normal atol/rtol mismatch.

Ascend attn_res uses wide masked loads along the feature dim (e.g. BLOCK_D = next_power_of_2(D)).
Tests include awkward sizes like D = 123 with float32, so row pitch is 123 × 4 = 492 bytes—not 32B/64B-friendly for many vectorized paths on this stack, which can trigger vector-core faults for that lowering.
Fix: pad the last dim to a multiple of 16, pass d_stride as the real memory pitch, keep D for math/masks, and slice/pad tensors so callers still see logical D.

Benchmark results for the 4 most frequently used kernels


cross_entropy_memory_full_token_length	cross_entropy_speed_backward_token_length

cross_entropy_speed_forward_token_length	cross_entropy_speed_full_token_length

cross_entropy_speed_no-grad-forward_token_length	rms_norm_memory_full_token_length

rms_norm_speed_backward_token_length	rms_norm_speed_forward_token_length

rms_norm_speed_full_token_length	rope_memory_full_token_length

rope_speed_backward_token_length	rope_speed_forward_token_length

rope_speed_full_token_length	swiglu_memory_full_token_length

swiglu_speed_backward_token_length	swiglu_speed_forward_token_length

swiglu_speed_full_token_length