perf: SIMD auto-vectorization — HADDPD reduction + AVX2 f64×4 widening#11
Merged
Conversation
…g (Issue #1) Task A: slp_reduce_pass detects acc+=a[i]+a[i+1] patterns and emits OP_F64X2_HADD (MOVUPD + HADDPD + ADDSD) for horizontal f64 reduction. I2F guard prevents mismatch between raw-bit loads and CVTSI2SD paths. Task B: slp_widen4 fuses adjacent OP_F64X2_BIN pairs into OP_F64X4_BIN (VMOVUPD ymm + VADDPD/VMULPD/etc ymm), with VEXTRACTF128 + VZEROUPPER to extract 4 scalar results and avoid AVX-SSE transition penalties. VEX helpers: emit_vmovupd_ymm_mem (2-byte and 3-byte VEX), emit_vop_ymm3, emit_vextractf128, emit_vzeroupper, emit_vinsertf128. Tests: simd_hadd_basic (HADDPD fires, sum=10) and simd_avx2_basic (AVX2 dot4, 1*2+2*2+3*2+4*2=20) both pass. 395/398 pass (3 pre-existing f64 formatting failures unrelated to this change). Self-host converged.
…Task C) slp_widen8 fuses adjacent OP_F64X4_BIN pairs (off_a2=off_a+32) into OP_F64X8_BIN. lower_f64x8_bin emits inline CPUID leaf-7 check (EBX bit 16 = AVX-512F): AVX-512 path uses EVEX VMOVUPD zmm + VADDPD/VMULPD/etc zmm + VEXTRACTF64X4 + VZEROUPPER; AVX2 fallback emits two ymm sequences. Both paths push 8 scalar results to stack and merge at a common pop/alloc epilogue — correct on any x86-64 CPU regardless of AVX-512 support. Extras: emit_evex_vmovupd_zmm_mem, emit_evex_vop_zmm3, emit_evex_vbroadcastsd, emit_vextractf64x4, emit_push4_f64_from_xmm01. r6/r7 IDs stored in g_avx8_r6/r7 arrays indexed by g_avx8_cnt. Test: simd_avx512_dot8 (dot8 with all-2 b-vector = 72) passes. 396/399 pass (3 pre-existing). Self-host converged (2,040,018 bytes).
…tion Replace (frac_val << 8) | frac_digits with frac_val * 256 + frac_digits in lexer, and (fl_fpk >> 8) with fl_fpk / 256 in both codegen paths. jda0 may silently no-op << 8 / >> 8 (same class of bug as the known << 32 issue). Fixes f64_type_inference, float_fmt_basic, float_literals. 399/399 pass, converged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1
Summary
slp_reduce_passdetectsacc += a[i] + a[i+1]loop patterns (two chainedOP_FADDwith adjacent fixed-offsetLOAD_MEMoperands) and fuses them intoOP_F64X2_HADD— emittingMOVUPD xmm, [base+off]+HADDPD xmm, xmm+ADDSD xmm0, xmm1. AnOP_I2Fguard prevents applying raw-bit loads to converted-integer values.slp_widen4detects adjacentOP_F64X2_BINpairs with compatible base/offset (offset₂ = offset₁ + 16) and fuses them intoOP_F64X4_BIN— emittingVMOVUPD ymm, [mem]+VADDPD/VMULPD/VSUBPD/VDIVPD ymm, ymm, ymm+VEXTRACTF128+VZEROUPPER, then scattering 4 scalar results to GPRs.New opcodes
OP_F64X2_HADDacc += a[i] + a[i+1]via MOVUPD + HADDPD + ADDSDOP_F64X4_BINTest plan
simd_hadd_basic— HADDPD sum-of-4 = 10 ✓simd_avx2_basic— AVX2 dot4 (1×2+2×2+3×2+4×2) = 20 ✓🤖 Generated with Claude Code