Skip to content

perf: SIMD auto-vectorization — HADDPD reduction + AVX2 f64×4 widening#11

Merged
jailalawat merged 3 commits into
mainfrom
issue-1-simd-reduction
May 18, 2026
Merged

perf: SIMD auto-vectorization — HADDPD reduction + AVX2 f64×4 widening#11
jailalawat merged 3 commits into
mainfrom
issue-1-simd-reduction

Conversation

@jailalawat

Copy link
Copy Markdown
Contributor

Closes #1

Summary

  • Task A (HADD reduction): slp_reduce_pass detects acc += a[i] + a[i+1] loop patterns (two chained OP_FADD with adjacent fixed-offset LOAD_MEM operands) and fuses them into OP_F64X2_HADD — emitting MOVUPD xmm, [base+off] + HADDPD xmm, xmm + ADDSD xmm0, xmm1. An OP_I2F guard prevents applying raw-bit loads to converted-integer values.
  • Task B (AVX2 widening): slp_widen4 detects adjacent OP_F64X2_BIN pairs with compatible base/offset (offset₂ = offset₁ + 16) and fuses them into OP_F64X4_BIN — emitting VMOVUPD ymm, [mem] + VADDPD/VMULPD/VSUBPD/VDIVPD ymm, ymm, ymm + VEXTRACTF128 + VZEROUPPER, then scattering 4 scalar results to GPRs.
  • Task C (AVX-512): Out of scope — Rosetta 2 does not support AVX-512.

New opcodes

Opcode Value Description
OP_F64X2_HADD 103 SSE3 horizontal add: acc += a[i] + a[i+1] via MOVUPD + HADDPD + ADDSD
OP_F64X4_BIN 104 AVX2 4-wide f64 binary op via VMOVUPD ymm + VADDPD/VMULPD/etc ymm

Test plan

  • simd_hadd_basic — HADDPD sum-of-4 = 10 ✓
  • simd_avx2_basic — AVX2 dot4 (1×2+2×2+3×2+4×2) = 20 ✓
  • 395/398 pass (3 pre-existing f64 formatting failures unrelated)
  • Self-host converged: jda1_sh2 == jda1_sh3 (2,025,361 bytes)

🤖 Generated with Claude Code

…g (Issue #1)

Task A: slp_reduce_pass detects acc+=a[i]+a[i+1] patterns and emits
OP_F64X2_HADD (MOVUPD + HADDPD + ADDSD) for horizontal f64 reduction.
I2F guard prevents mismatch between raw-bit loads and CVTSI2SD paths.

Task B: slp_widen4 fuses adjacent OP_F64X2_BIN pairs into OP_F64X4_BIN
(VMOVUPD ymm + VADDPD/VMULPD/etc ymm), with VEXTRACTF128 + VZEROUPPER
to extract 4 scalar results and avoid AVX-SSE transition penalties.

VEX helpers: emit_vmovupd_ymm_mem (2-byte and 3-byte VEX), emit_vop_ymm3,
emit_vextractf128, emit_vzeroupper, emit_vinsertf128.

Tests: simd_hadd_basic (HADDPD fires, sum=10) and simd_avx2_basic
(AVX2 dot4, 1*2+2*2+3*2+4*2=20) both pass. 395/398 pass (3 pre-existing
f64 formatting failures unrelated to this change). Self-host converged.
…Task C)

slp_widen8 fuses adjacent OP_F64X4_BIN pairs (off_a2=off_a+32) into
OP_F64X8_BIN. lower_f64x8_bin emits inline CPUID leaf-7 check (EBX bit
16 = AVX-512F): AVX-512 path uses EVEX VMOVUPD zmm + VADDPD/VMULPD/etc
zmm + VEXTRACTF64X4 + VZEROUPPER; AVX2 fallback emits two ymm sequences.
Both paths push 8 scalar results to stack and merge at a common pop/alloc
epilogue — correct on any x86-64 CPU regardless of AVX-512 support.

Extras: emit_evex_vmovupd_zmm_mem, emit_evex_vop_zmm3,
emit_evex_vbroadcastsd, emit_vextractf64x4, emit_push4_f64_from_xmm01.
r6/r7 IDs stored in g_avx8_r6/r7 arrays indexed by g_avx8_cnt.

Test: simd_avx512_dot8 (dot8 with all-2 b-vector = 72) passes.
396/399 pass (3 pre-existing). Self-host converged (2,040,018 bytes).
…tion

Replace (frac_val << 8) | frac_digits with frac_val * 256 + frac_digits in
lexer, and (fl_fpk >> 8) with fl_fpk / 256 in both codegen paths. jda0 may
silently no-op << 8 / >> 8 (same class of bug as the known << 32 issue).
Fixes f64_type_inference, float_fmt_basic, float_literals. 399/399 pass, converged.
@jailalawat jailalawat merged commit 93bae79 into main May 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Perf: SIMD Auto-Vectorization (SLP Phase 1+2+3 + I2FX2)

1 participant