perf: SIMD auto-vectorization — HADDPD reduction + AVX2 f64×4 widening by jailalawat · Pull Request #11 · jdalang/jda-lang

jailalawat · 2026-05-18T03:39:26Z

Closes #1

Summary

Task A (HADD reduction): slp_reduce_pass detects acc += a[i] + a[i+1] loop patterns (two chained OP_FADD with adjacent fixed-offset LOAD_MEM operands) and fuses them into OP_F64X2_HADD — emitting MOVUPD xmm, [base+off] + HADDPD xmm, xmm + ADDSD xmm0, xmm1. An OP_I2F guard prevents applying raw-bit loads to converted-integer values.
Task B (AVX2 widening): slp_widen4 detects adjacent OP_F64X2_BIN pairs with compatible base/offset (offset₂ = offset₁ + 16) and fuses them into OP_F64X4_BIN — emitting VMOVUPD ymm, [mem] + VADDPD/VMULPD/VSUBPD/VDIVPD ymm, ymm, ymm + VEXTRACTF128 + VZEROUPPER, then scattering 4 scalar results to GPRs.
Task C (AVX-512): Out of scope — Rosetta 2 does not support AVX-512.

New opcodes

Opcode	Value	Description
`OP_F64X2_HADD`	103	SSE3 horizontal add: `acc += a[i] + a[i+1]` via MOVUPD + HADDPD + ADDSD
`OP_F64X4_BIN`	104	AVX2 4-wide f64 binary op via VMOVUPD ymm + VADDPD/VMULPD/etc ymm

Test plan

simd_hadd_basic — HADDPD sum-of-4 = 10 ✓
simd_avx2_basic — AVX2 dot4 (1×2+2×2+3×2+4×2) = 20 ✓
395/398 pass (3 pre-existing f64 formatting failures unrelated)
Self-host converged: jda1_sh2 == jda1_sh3 (2,025,361 bytes)

🤖 Generated with Claude Code

…g (Issue #1) Task A: slp_reduce_pass detects acc+=a[i]+a[i+1] patterns and emits OP_F64X2_HADD (MOVUPD + HADDPD + ADDSD) for horizontal f64 reduction. I2F guard prevents mismatch between raw-bit loads and CVTSI2SD paths. Task B: slp_widen4 fuses adjacent OP_F64X2_BIN pairs into OP_F64X4_BIN (VMOVUPD ymm + VADDPD/VMULPD/etc ymm), with VEXTRACTF128 + VZEROUPPER to extract 4 scalar results and avoid AVX-SSE transition penalties. VEX helpers: emit_vmovupd_ymm_mem (2-byte and 3-byte VEX), emit_vop_ymm3, emit_vextractf128, emit_vzeroupper, emit_vinsertf128. Tests: simd_hadd_basic (HADDPD fires, sum=10) and simd_avx2_basic (AVX2 dot4, 1*2+2*2+3*2+4*2=20) both pass. 395/398 pass (3 pre-existing f64 formatting failures unrelated to this change). Self-host converged.

…Task C) slp_widen8 fuses adjacent OP_F64X4_BIN pairs (off_a2=off_a+32) into OP_F64X8_BIN. lower_f64x8_bin emits inline CPUID leaf-7 check (EBX bit 16 = AVX-512F): AVX-512 path uses EVEX VMOVUPD zmm + VADDPD/VMULPD/etc zmm + VEXTRACTF64X4 + VZEROUPPER; AVX2 fallback emits two ymm sequences. Both paths push 8 scalar results to stack and merge at a common pop/alloc epilogue — correct on any x86-64 CPU regardless of AVX-512 support. Extras: emit_evex_vmovupd_zmm_mem, emit_evex_vop_zmm3, emit_evex_vbroadcastsd, emit_vextractf64x4, emit_push4_f64_from_xmm01. r6/r7 IDs stored in g_avx8_r6/r7 arrays indexed by g_avx8_cnt. Test: simd_avx512_dot8 (dot8 with all-2 b-vector = 72) passes. 396/399 pass (3 pre-existing). Self-host converged (2,040,018 bytes).

…tion Replace (frac_val << 8) | frac_digits with frac_val * 256 + frac_digits in lexer, and (fl_fpk >> 8) with fl_fpk / 256 in both codegen paths. jda0 may silently no-op << 8 / >> 8 (same class of bug as the known << 32 issue). Fixes f64_type_inference, float_fmt_basic, float_literals. 399/399 pass, converged.

jailalawat added 3 commits May 18, 2026 09:08

jailalawat merged commit 93bae79 into main May 18, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: SIMD auto-vectorization — HADDPD reduction + AVX2 f64×4 widening#11

perf: SIMD auto-vectorization — HADDPD reduction + AVX2 f64×4 widening#11
jailalawat merged 3 commits into
mainfrom
issue-1-simd-reduction

jailalawat commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jailalawat commented May 18, 2026

Summary

New opcodes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant