perf(operators): AVX2 pow(x,c) kernel for the ONNX Pow op#104
Merged
Conversation
The Pow operator was 6.58% of single-thread BirdNET v2.4 inference, all in scalar math.Pow (per-element exp/log) on the mel spectrogram. Vectorize the constant-exponent case with a vendored AVX2+FMA kernel computing out[i] = exp(c*log(in[i])): Cephes single-precision logf (bitwise frexp + minimax polynomial) composed with the Cephes expf already used by the sigmoid kernel, then a positive-lane mask that flushes x<=0 to 0 (pow(0,c>0) == 0; the bitwise frexp cannot represent 0 or negatives). The elementwise-exponent case and non-AVX2 CPUs keep the scalar path. Same always-on dispatch and layout as the depthwise/sigmoid kernels: package-local _gen/pow avo module (separate go.mod so avo never enters born's module graph), nil-func dispatch wired in init, plain amd64 tag, scalar fallback. BenchmarkPowConst (i7-1260P, single core, n=8, benchstat scalar -> simd): mel_511x96 (49056) 1635us -> 66.2us -95.9% n4096 131us -> 5.5us -95.8% (geomean -95.9%, ~24x) Profile (300 inferences): handlePow 6.58% -> 0.29% of inference. Full-model parity vs onnxruntime: max abs diff 1.068e-04, top-1 and top-5 match. Kernel max relative error vs math.Pow is 5.3e-7 (test asserts < 1e-4). The Cephes logf constants are Moshier's single-precision reference values. Tests: AVX2-vs-math.Pow parity across the BirdNET exponents and non-negative inputs (zeros, small, unit, large) spanning vector blocks plus sub-8 tails, and an init-wiring contract test.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
kolkov
approved these changes
Jun 17, 2026
kolkov
left a comment
Contributor
There was a problem hiding this comment.
Correct AVX2 pow(x,c) via exp(c·ln(x)). Cephes logf + expf polynomials verified against Moshier source. Measured max relative error 5.3e-7 (~4 ULP) — well within ML inference tolerance.
Math verified
- Cephes logf: frexp → SQRTHF branchless adjust → degree-8 Horner → Cody-Waite ln2 correction
- Cephes expf: same kernel as sigmoid PR #97 — verified
- 9 logP coefficients match Moshier source exactly
- SQRTHF masking via
VCMPPS 0x1e(GT_OQ) +VANDPS— correct branchless selection - Domain: x>0 correct, x=0 flushed to 0, x<0/NaN flushed to 0 (documented)
Assembly
- VZEROUPPER at sole exit ✓
- Frame
$0-60= 2 slices (48) + int (8) + float32 (4) = 60 ✓ - 8/16 YMM registers used, no spills
- Scalar tail for n%8 remainder
Integration
Hooks into handlePow in math_ops.go for scalar exponent case. Element-wise pow stays scalar. break after SIMD path prevents fallthrough — correct.
Suggestions (non-blocking)
- Document
x=+inf→~exp(88.3)rather than+inf(clamping artifact) - Consider tightening test tolerance from
1e-4to1e-5(measured max is 5.3e-7) - TODO for fast paths: c=2 (
VMULPS), c=0.5 (VSQRTPS) — avoids 2 transcendentals for common SE-Net exponents
Approved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The ONNX
Powoperator was 6.58% of single-thread BirdNET v2.4 inference, entirely in scalarmath.Pow(per-element exp/log) on the mel spectrogram. Vectorize the constant-exponent case with a vendored AVX2+FMA kernel computingout[i] = exp(c·log(in[i])): Cephes single-precisionlogf(bitwise frexp + minimax polynomial) composed with the Cephesexpfalready used by the sigmoid kernel, plus a positive-lane mask that flushesx<=0to 0 (pow(0, c>0) == 0). The elementwise-exponent case and non-AVX2 CPUs keep the scalar path.Same always-on dispatch and layout as the depthwise (#103) and sigmoid (#97) kernels: package-local
_gen/powavo module (separatego.modso avo never enters born's module graph), nil-func dispatch wired ininit, plainamd64tag, scalar fallback.Benchmarks
BenchmarkPowConst(i7-1260P, single core viataskset, n=8, benchstat scalar -> simd):geomean -95.9% (~24x). CPU profile (300 inferences):
handlePow6.58% -> 0.29% of inference.Correctness
math.Pow: 5.3e-7 (parity test asserts < 1e-4).logfconstants are Moshier's single-precision reference values.Domain: the kernel requires a non-negative base (mel-spectrogram power). It flushes
x<=0to 0; for a negative base it returns 0 rather thanmath.Pow's NaN / signed root, so it is not a drop-in for arbitrary Pow inputs. No born model feeds a negative base into a scalar-exponent Pow; this is documented on the dispatch var.Test plan
math.Powparity across the BirdNET exponents and non-negative inputs (zeros, small, unit, large) spanning vector blocks + sub-8 tailsgo test -race ./..., both default andGOEXPERIMENT=simdbuilds,go vet,golangci-lintclean