perf(cpu): pool conv im2col scratch buffers#101
Merged
Conversation
Recycle the two large per-convolution scratch buffers in the float32 and float64 im2col conv path through sync.Pool instead of allocating and zeroing them on every call: the im2col column buffer and the matmul output buffer. The matmul now writes the pooled output scratch directly and rearrange permutes it straight into the output tensor, which also removes the old copy(outputData -> tempBuf) memmove. Both buffers are fully overwritten before any read (im2col writes every column entry including padding zeros; matMulColBuf* writes every output element, and cOut*colHeight == len(outputData)), so recycling an un-zeroed buffer is safe. This mirrors the existing colBufT transpose pool and the GEMM packing scratch. A small generic poolScratch helper replaces the get/grow/reslice dance, and the existing colBufT pool is folded onto it with a deferred Put for panic safety. BenchmarkConv2D_Im2col_* (i7-1260P, single core, n=12, benchstat): sec/op geomean -7.5% (-10.8% / -6.2% / -5.3%, p<0.01) B/op -84.6% allocs/op -2 per conv Parity vs onnxruntime on BirdNET v2.4 unchanged: max abs diff 1.135e-04, top-1 and top-5 match. Tests: alloc-free assertion (skipped under -short and -race where AllocsPerRun over sync.Pool is unreliable), bit-exact reuse determinism, poisoned-buffer overwrite proving the full-overwrite contract, and an oracle parity test across the regular-conv shapes covering both the scalar and SIMD-GEMM routing for both dtypes. BenchmarkConv2D_Im2col_* added.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
kolkov
approved these changes
Jun 17, 2026
kolkov
left a comment
Contributor
There was a problem hiding this comment.
Clean memory optimization. Un-zeroed reuse is safe — proven by code analysis and the poisoned-overwrite test.
Correctness
- im2col writes every element including padding zeros — full overwrite guaranteed
- matMulColBufFloat32 writes all
cOut * colHeightpositions — no stale reads - matOut and outputData are always distinct backing arrays — no aliasing
- Pointwise (1x1) paths not affected — bypass im2col entirely
Pool pattern
poolScratch[T] generic helper — correct grow-in-place, correct Get/Put lifecycle. defer Put migration of colBufTPool is a minor safety improvement. All four pools (colBuf/matOut × float32/float64) follow the same pattern.
Tests
Excellent coverage:
TestConv2DScratchAllocFree— zero-alloc verification (skipped under -race, correct tradeoff)TestConv2DPooledReuseDeterministic— 8 consecutive calls produce bit-identical resultsTestConv2DPooledPoisonedOverwrite— sentinel poison proves full overwrite before readTestConv2DPooledMatchesMock— correctness against naive oracle (1e-4 float32, 1e-12 float64)
Suggestion (non-blocking)
Consider adding one sentence to poolScratch doc: "Slices Put back retain their underlying array's full capacity; future Gets reuse that array for requests up to that capacity." Makes the grow-only contract explicit.
Approved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Recycle the two large per-convolution scratch buffers in the float32 and float64 im2col conv path through
sync.Poolinstead of allocating and zeroing them on every call:colBuf), andmatOut).The matmul now writes the pooled output scratch directly and
rearrangepermutes it straight into the output tensor, which also removes the oldcopy(outputData -> tempBuf)memmove.Both buffers are fully overwritten before any read (im2col writes every column entry including padding zeros;
matMulColBuf*writes every output element, andcOut*colHeight == len(outputData)), so recycling an un-zeroed buffer is safe. This mirrors the existingcolBufTtranspose pool (#99) and the GEMM packing scratch (#96). A small genericpoolScratchhelper replaces the get/grow/reslice dance, and the existingcolBufTpool is folded onto it with a deferredPutfor panic safety.This targets the
runtime.memclrNoHeapPointers+runtime.memmovecost that the CPU profile attributes to per-conv scratch allocation.Benchmarks
BenchmarkConv2D_Im2col_*(i7-1260P, single core viataskset, n=12, benchstat):All p < 0.01. The remaining B/op is the output tensor allocation, which is out of scope here.
Correctness
Full-model parity vs onnxruntime on BirdNET v2.4 is unchanged: max abs diff 1.135e-04, top-1 and top-5 match.
Test plan
go test -race ./...(the alloc-counting test skips under-shortand-race, whereAllocsPerRunover async.Poolis unreliable)GOEXPERIMENT=simd;go vetandgolangci-lintcleanBenchmarkConv2D_Im2col_*added