perf(cpu): pool conv im2col scratch buffers by tphakala · Pull Request #101 · born-ml/born

tphakala · 2026-06-17T11:56:28Z

Summary

Recycle the two large per-convolution scratch buffers in the float32 and float64 im2col conv path through sync.Pool instead of allocating and zeroing them on every call:

the im2col column buffer (colBuf), and
the matmul output buffer (matOut).

The matmul now writes the pooled output scratch directly and rearrange permutes it straight into the output tensor, which also removes the old copy(outputData -> tempBuf) memmove.

Both buffers are fully overwritten before any read (im2col writes every column entry including padding zeros; matMulColBuf* writes every output element, and cOut*colHeight == len(outputData)), so recycling an un-zeroed buffer is safe. This mirrors the existing colBufT transpose pool (#99) and the GEMM packing scratch (#96). A small generic poolScratch helper replaces the get/grow/reslice dance, and the existing colBufT pool is folded onto it with a deferred Put for panic safety.

This targets the runtime.memclrNoHeapPointers + runtime.memmove cost that the CPU profile attributes to per-conv scratch allocation.

Benchmarks

BenchmarkConv2D_Im2col_* (i7-1260P, single core via taskset, n=12, benchstat):

metric	GEMM	Deep	Strided	geomean
sec/op	-10.8%	-6.2%	-5.3%	-7.5%
B/op	-84.6%	-84.6%	-84.6%	-84.6%
allocs/op	8 -> 5	8 -> 5	7 -> 5

All p < 0.01. The remaining B/op is the output tensor allocation, which is out of scope here.

Correctness

Full-model parity vs onnxruntime on BirdNET v2.4 is unchanged: max abs diff 1.135e-04, top-1 and top-5 match.

Test plan

go test -race ./... (the alloc-counting test skips under -short and -race, where AllocsPerRun over a sync.Pool is unreliable)
New tests: alloc-free assertion, bit-exact reuse determinism, poisoned-buffer overwrite proving the full-overwrite contract, and an oracle parity test across the regular-conv shapes covering both scalar and SIMD-GEMM routing for both dtypes
Builds clean on default and GOEXPERIMENT=simd; go vet and golangci-lint clean
BenchmarkConv2D_Im2col_* added

Recycle the two large per-convolution scratch buffers in the float32 and float64 im2col conv path through sync.Pool instead of allocating and zeroing them on every call: the im2col column buffer and the matmul output buffer. The matmul now writes the pooled output scratch directly and rearrange permutes it straight into the output tensor, which also removes the old copy(outputData -> tempBuf) memmove. Both buffers are fully overwritten before any read (im2col writes every column entry including padding zeros; matMulColBuf* writes every output element, and cOut*colHeight == len(outputData)), so recycling an un-zeroed buffer is safe. This mirrors the existing colBufT transpose pool and the GEMM packing scratch. A small generic poolScratch helper replaces the get/grow/reslice dance, and the existing colBufT pool is folded onto it with a deferred Put for panic safety. BenchmarkConv2D_Im2col_* (i7-1260P, single core, n=12, benchstat): sec/op geomean -7.5% (-10.8% / -6.2% / -5.3%, p<0.01) B/op -84.6% allocs/op -2 per conv Parity vs onnxruntime on BirdNET v2.4 unchanged: max abs diff 1.135e-04, top-1 and top-5 match. Tests: alloc-free assertion (skipped under -short and -race where AllocsPerRun over sync.Pool is unreliable), bit-exact reuse determinism, poisoned-buffer overwrite proving the full-overwrite contract, and an oracle parity test across the regular-conv shapes covering both the scalar and SIMD-GEMM routing for both dtypes. BenchmarkConv2D_Im2col_* added.

codecov · 2026-06-17T12:01:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

kolkov

Clean memory optimization. Un-zeroed reuse is safe — proven by code analysis and the poisoned-overwrite test.

Correctness

im2col writes every element including padding zeros — full overwrite guaranteed
matMulColBufFloat32 writes all cOut * colHeight positions — no stale reads
matOut and outputData are always distinct backing arrays — no aliasing
Pointwise (1x1) paths not affected — bypass im2col entirely

Pool pattern

poolScratch[T] generic helper — correct grow-in-place, correct Get/Put lifecycle. defer Put migration of colBufTPool is a minor safety improvement. All four pools (colBuf/matOut × float32/float64) follow the same pattern.

Tests

Excellent coverage:

TestConv2DScratchAllocFree — zero-alloc verification (skipped under -race, correct tradeoff)
TestConv2DPooledReuseDeterministic — 8 consecutive calls produce bit-identical results
TestConv2DPooledPoisonedOverwrite — sentinel poison proves full overwrite before read
TestConv2DPooledMatchesMock — correctness against naive oracle (1e-4 float32, 1e-12 float64)

Suggestion (non-blocking)

Consider adding one sentence to poolScratch doc: "Slices Put back retain their underlying array's full capacity; future Gets reuse that array for requests up to that capacity." Makes the grow-only contract explicit.

Approved.

tphakala requested a review from kolkov as a code owner June 17, 2026 11:56

kolkov approved these changes Jun 17, 2026

View reviewed changes

kolkov merged commit f562268 into born-ml:main Jun 17, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(cpu): pool conv im2col scratch buffers#101

perf(cpu): pool conv im2col scratch buffers#101
kolkov merged 1 commit into
born-ml:mainfrom
tphakala:perf/cpu-conv-im2col-pooling

tphakala commented Jun 17, 2026

Uh oh!

codecov Bot commented Jun 17, 2026

Uh oh!

kolkov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tphakala commented Jun 17, 2026

Summary

Benchmarks

Correctness

Test plan

Uh oh!

codecov Bot commented Jun 17, 2026

Codecov Report

Uh oh!

kolkov left a comment

Choose a reason for hiding this comment

Correctness

Pool pattern

Tests

Suggestion (non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants