perf(operators): AVX2 pow(x,c) kernel for the ONNX Pow op by tphakala · Pull Request #104 · born-ml/born

tphakala · 2026-06-17T13:25:47Z

Summary

The ONNX Pow operator was 6.58% of single-thread BirdNET v2.4 inference, entirely in scalar math.Pow (per-element exp/log) on the mel spectrogram. Vectorize the constant-exponent case with a vendored AVX2+FMA kernel computing out[i] = exp(c·log(in[i])): Cephes single-precision logf (bitwise frexp + minimax polynomial) composed with the Cephes expf already used by the sigmoid kernel, plus a positive-lane mask that flushes x<=0 to 0 (pow(0, c>0) == 0). The elementwise-exponent case and non-AVX2 CPUs keep the scalar path.

Same always-on dispatch and layout as the depthwise (#103) and sigmoid (#97) kernels: package-local _gen/pow avo module (separate go.mod so avo never enters born's module graph), nil-func dispatch wired in init, plain amd64 tag, scalar fallback.

Benchmarks

BenchmarkPowConst (i7-1260P, single core via taskset, n=8, benchstat scalar -> simd):

size	scalar	simd	delta
mel_511x96 (49056)	1635µs	66.2µs	-95.9%
n4096	131µs	5.5µs	-95.8%

geomean -95.9% (~24x). CPU profile (300 inferences): handlePow 6.58% -> 0.29% of inference.

Correctness

Kernel max relative error vs math.Pow: 5.3e-7 (parity test asserts < 1e-4).
Full-model parity vs onnxruntime: max abs diff 1.068e-04, top-1 and top-5 match.
The Cephes logf constants are Moshier's single-precision reference values.

Domain: the kernel requires a non-negative base (mel-spectrogram power). It flushes x<=0 to 0; for a negative base it returns 0 rather than math.Pow's NaN / signed root, so it is not a drop-in for arbitrary Pow inputs. No born model feeds a negative base into a scalar-exponent Pow; this is documented on the dispatch var.

Test plan

AVX2-vs-math.Pow parity across the BirdNET exponents and non-negative inputs (zeros, small, unit, large) spanning vector blocks + sub-8 tails
init-wiring contract test
go test -race ./..., both default and GOEXPERIMENT=simd builds, go vet, golangci-lint clean

The Pow operator was 6.58% of single-thread BirdNET v2.4 inference, all in scalar math.Pow (per-element exp/log) on the mel spectrogram. Vectorize the constant-exponent case with a vendored AVX2+FMA kernel computing out[i] = exp(c*log(in[i])): Cephes single-precision logf (bitwise frexp + minimax polynomial) composed with the Cephes expf already used by the sigmoid kernel, then a positive-lane mask that flushes x<=0 to 0 (pow(0,c>0) == 0; the bitwise frexp cannot represent 0 or negatives). The elementwise-exponent case and non-AVX2 CPUs keep the scalar path. Same always-on dispatch and layout as the depthwise/sigmoid kernels: package-local _gen/pow avo module (separate go.mod so avo never enters born's module graph), nil-func dispatch wired in init, plain amd64 tag, scalar fallback. BenchmarkPowConst (i7-1260P, single core, n=8, benchstat scalar -> simd): mel_511x96 (49056) 1635us -> 66.2us -95.9% n4096 131us -> 5.5us -95.8% (geomean -95.9%, ~24x) Profile (300 inferences): handlePow 6.58% -> 0.29% of inference. Full-model parity vs onnxruntime: max abs diff 1.068e-04, top-1 and top-5 match. Kernel max relative error vs math.Pow is 5.3e-7 (test asserts < 1e-4). The Cephes logf constants are Moshier's single-precision reference values. Tests: AVX2-vs-math.Pow parity across the BirdNET exponents and non-negative inputs (zeros, small, unit, large) spanning vector blocks plus sub-8 tails, and an init-wiring contract test.

codecov · 2026-06-17T13:30:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

kolkov

Correct AVX2 pow(x,c) via exp(c·ln(x)). Cephes logf + expf polynomials verified against Moshier source. Measured max relative error 5.3e-7 (~4 ULP) — well within ML inference tolerance.

Math verified

Cephes logf: frexp → SQRTHF branchless adjust → degree-8 Horner → Cody-Waite ln2 correction
Cephes expf: same kernel as sigmoid PR #97 — verified
9 logP coefficients match Moshier source exactly
SQRTHF masking via VCMPPS 0x1e (GT_OQ) + VANDPS — correct branchless selection
Domain: x>0 correct, x=0 flushed to 0, x<0/NaN flushed to 0 (documented)

Assembly

VZEROUPPER at sole exit ✓
Frame $0-60 = 2 slices (48) + int (8) + float32 (4) = 60 ✓
8/16 YMM registers used, no spills
Scalar tail for n%8 remainder

Integration

Hooks into handlePow in math_ops.go for scalar exponent case. Element-wise pow stays scalar. break after SIMD path prevents fallthrough — correct.

Suggestions (non-blocking)

Document x=+inf → ~exp(88.3) rather than +inf (clamping artifact)
Consider tightening test tolerance from 1e-4 to 1e-5 (measured max is 5.3e-7)
TODO for fast paths: c=2 (VMULPS), c=0.5 (VSQRTPS) — avoids 2 transcendentals for common SE-Net exponents

Approved.

tphakala requested a review from kolkov as a code owner June 17, 2026 13:25

kolkov approved these changes Jun 17, 2026

View reviewed changes

kolkov merged commit 5b921f9 into born-ml:main Jun 17, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(operators): AVX2 pow(x,c) kernel for the ONNX Pow op#104

perf(operators): AVX2 pow(x,c) kernel for the ONNX Pow op#104
kolkov merged 1 commit into
born-ml:mainfrom
tphakala:perf/simd-pow-op

tphakala commented Jun 17, 2026

Uh oh!

codecov Bot commented Jun 17, 2026

Uh oh!

kolkov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tphakala commented Jun 17, 2026

Summary

Benchmarks

Correctness

Test plan

Uh oh!

codecov Bot commented Jun 17, 2026

Codecov Report

Uh oh!

kolkov left a comment

Choose a reason for hiding this comment

Math verified

Assembly

Integration

Suggestions (non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants