Skip to content

perf(operators): AVX2 pow(x,c) kernel for the ONNX Pow op#104

Merged
kolkov merged 1 commit into
born-ml:mainfrom
tphakala:perf/simd-pow-op
Jun 17, 2026
Merged

perf(operators): AVX2 pow(x,c) kernel for the ONNX Pow op#104
kolkov merged 1 commit into
born-ml:mainfrom
tphakala:perf/simd-pow-op

Conversation

@tphakala

Copy link
Copy Markdown
Contributor

Summary

The ONNX Pow operator was 6.58% of single-thread BirdNET v2.4 inference, entirely in scalar math.Pow (per-element exp/log) on the mel spectrogram. Vectorize the constant-exponent case with a vendored AVX2+FMA kernel computing out[i] = exp(c·log(in[i])): Cephes single-precision logf (bitwise frexp + minimax polynomial) composed with the Cephes expf already used by the sigmoid kernel, plus a positive-lane mask that flushes x<=0 to 0 (pow(0, c>0) == 0). The elementwise-exponent case and non-AVX2 CPUs keep the scalar path.

Same always-on dispatch and layout as the depthwise (#103) and sigmoid (#97) kernels: package-local _gen/pow avo module (separate go.mod so avo never enters born's module graph), nil-func dispatch wired in init, plain amd64 tag, scalar fallback.

Benchmarks

BenchmarkPowConst (i7-1260P, single core via taskset, n=8, benchstat scalar -> simd):

size scalar simd delta
mel_511x96 (49056) 1635µs 66.2µs -95.9%
n4096 131µs 5.5µs -95.8%

geomean -95.9% (~24x). CPU profile (300 inferences): handlePow 6.58% -> 0.29% of inference.

Correctness

  • Kernel max relative error vs math.Pow: 5.3e-7 (parity test asserts < 1e-4).
  • Full-model parity vs onnxruntime: max abs diff 1.068e-04, top-1 and top-5 match.
  • The Cephes logf constants are Moshier's single-precision reference values.

Domain: the kernel requires a non-negative base (mel-spectrogram power). It flushes x<=0 to 0; for a negative base it returns 0 rather than math.Pow's NaN / signed root, so it is not a drop-in for arbitrary Pow inputs. No born model feeds a negative base into a scalar-exponent Pow; this is documented on the dispatch var.

Test plan

  • AVX2-vs-math.Pow parity across the BirdNET exponents and non-negative inputs (zeros, small, unit, large) spanning vector blocks + sub-8 tails
  • init-wiring contract test
  • go test -race ./..., both default and GOEXPERIMENT=simd builds, go vet, golangci-lint clean

The Pow operator was 6.58% of single-thread BirdNET v2.4 inference, all in
scalar math.Pow (per-element exp/log) on the mel spectrogram. Vectorize the
constant-exponent case with a vendored AVX2+FMA kernel computing
out[i] = exp(c*log(in[i])): Cephes single-precision logf (bitwise frexp +
minimax polynomial) composed with the Cephes expf already used by the sigmoid
kernel, then a positive-lane mask that flushes x<=0 to 0 (pow(0,c>0) == 0; the
bitwise frexp cannot represent 0 or negatives). The elementwise-exponent case
and non-AVX2 CPUs keep the scalar path.

Same always-on dispatch and layout as the depthwise/sigmoid kernels:
package-local _gen/pow avo module (separate go.mod so avo never enters born's
module graph), nil-func dispatch wired in init, plain amd64 tag, scalar
fallback.

BenchmarkPowConst (i7-1260P, single core, n=8, benchstat scalar -> simd):
  mel_511x96 (49056)  1635us -> 66.2us  -95.9%
  n4096                131us ->  5.5us  -95.8%   (geomean -95.9%, ~24x)

Profile (300 inferences): handlePow 6.58% -> 0.29% of inference. Full-model
parity vs onnxruntime: max abs diff 1.068e-04, top-1 and top-5 match. Kernel
max relative error vs math.Pow is 5.3e-7 (test asserts < 1e-4). The Cephes
logf constants are Moshier's single-precision reference values.

Tests: AVX2-vs-math.Pow parity across the BirdNET exponents and non-negative
inputs (zeros, small, unit, large) spanning vector blocks plus sub-8 tails,
and an init-wiring contract test.
@tphakala tphakala requested a review from kolkov as a code owner June 17, 2026 13:25
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@kolkov kolkov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct AVX2 pow(x,c) via exp(c·ln(x)). Cephes logf + expf polynomials verified against Moshier source. Measured max relative error 5.3e-7 (~4 ULP) — well within ML inference tolerance.

Math verified

  • Cephes logf: frexp → SQRTHF branchless adjust → degree-8 Horner → Cody-Waite ln2 correction
  • Cephes expf: same kernel as sigmoid PR #97 — verified
  • 9 logP coefficients match Moshier source exactly
  • SQRTHF masking via VCMPPS 0x1e (GT_OQ) + VANDPS — correct branchless selection
  • Domain: x>0 correct, x=0 flushed to 0, x<0/NaN flushed to 0 (documented)

Assembly

  • VZEROUPPER at sole exit ✓
  • Frame $0-60 = 2 slices (48) + int (8) + float32 (4) = 60 ✓
  • 8/16 YMM registers used, no spills
  • Scalar tail for n%8 remainder

Integration

Hooks into handlePow in math_ops.go for scalar exponent case. Element-wise pow stays scalar. break after SIMD path prevents fallthrough — correct.

Suggestions (non-blocking)

  1. Document x=+inf~exp(88.3) rather than +inf (clamping artifact)
  2. Consider tightening test tolerance from 1e-4 to 1e-5 (measured max is 5.3e-7)
  3. TODO for fast paths: c=2 (VMULPS), c=0.5 (VSQRTPS) — avoids 2 transcendentals for common SE-Net exponents

Approved.

@kolkov kolkov merged commit 5b921f9 into born-ml:main Jun 17, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants