Skip to content

perf(cpu): pool conv im2col scratch buffers#101

Merged
kolkov merged 1 commit into
born-ml:mainfrom
tphakala:perf/cpu-conv-im2col-pooling
Jun 17, 2026
Merged

perf(cpu): pool conv im2col scratch buffers#101
kolkov merged 1 commit into
born-ml:mainfrom
tphakala:perf/cpu-conv-im2col-pooling

Conversation

@tphakala

Copy link
Copy Markdown
Contributor

Summary

Recycle the two large per-convolution scratch buffers in the float32 and float64 im2col conv path through sync.Pool instead of allocating and zeroing them on every call:

  • the im2col column buffer (colBuf), and
  • the matmul output buffer (matOut).

The matmul now writes the pooled output scratch directly and rearrange permutes it straight into the output tensor, which also removes the old copy(outputData -> tempBuf) memmove.

Both buffers are fully overwritten before any read (im2col writes every column entry including padding zeros; matMulColBuf* writes every output element, and cOut*colHeight == len(outputData)), so recycling an un-zeroed buffer is safe. This mirrors the existing colBufT transpose pool (#99) and the GEMM packing scratch (#96). A small generic poolScratch helper replaces the get/grow/reslice dance, and the existing colBufT pool is folded onto it with a deferred Put for panic safety.

This targets the runtime.memclrNoHeapPointers + runtime.memmove cost that the CPU profile attributes to per-conv scratch allocation.

Benchmarks

BenchmarkConv2D_Im2col_* (i7-1260P, single core via taskset, n=12, benchstat):

metric GEMM Deep Strided geomean
sec/op -10.8% -6.2% -5.3% -7.5%
B/op -84.6% -84.6% -84.6% -84.6%
allocs/op 8 -> 5 8 -> 5 7 -> 5

All p < 0.01. The remaining B/op is the output tensor allocation, which is out of scope here.

Correctness

Full-model parity vs onnxruntime on BirdNET v2.4 is unchanged: max abs diff 1.135e-04, top-1 and top-5 match.

Test plan

  • go test -race ./... (the alloc-counting test skips under -short and -race, where AllocsPerRun over a sync.Pool is unreliable)
  • New tests: alloc-free assertion, bit-exact reuse determinism, poisoned-buffer overwrite proving the full-overwrite contract, and an oracle parity test across the regular-conv shapes covering both scalar and SIMD-GEMM routing for both dtypes
  • Builds clean on default and GOEXPERIMENT=simd; go vet and golangci-lint clean
  • BenchmarkConv2D_Im2col_* added

Recycle the two large per-convolution scratch buffers in the float32 and
float64 im2col conv path through sync.Pool instead of allocating and
zeroing them on every call: the im2col column buffer and the matmul
output buffer. The matmul now writes the pooled output scratch directly
and rearrange permutes it straight into the output tensor, which also
removes the old copy(outputData -> tempBuf) memmove.

Both buffers are fully overwritten before any read (im2col writes every
column entry including padding zeros; matMulColBuf* writes every output
element, and cOut*colHeight == len(outputData)), so recycling an
un-zeroed buffer is safe. This mirrors the existing colBufT transpose
pool and the GEMM packing scratch. A small generic poolScratch helper
replaces the get/grow/reslice dance, and the existing colBufT pool is
folded onto it with a deferred Put for panic safety.

BenchmarkConv2D_Im2col_* (i7-1260P, single core, n=12, benchstat):
  sec/op     geomean -7.5%  (-10.8% / -6.2% / -5.3%, p<0.01)
  B/op       -84.6%
  allocs/op  -2 per conv

Parity vs onnxruntime on BirdNET v2.4 unchanged: max abs diff 1.135e-04,
top-1 and top-5 match.

Tests: alloc-free assertion (skipped under -short and -race where
AllocsPerRun over sync.Pool is unreliable), bit-exact reuse determinism,
poisoned-buffer overwrite proving the full-overwrite contract, and an
oracle parity test across the regular-conv shapes covering both the
scalar and SIMD-GEMM routing for both dtypes. BenchmarkConv2D_Im2col_*
added.
@tphakala tphakala requested a review from kolkov as a code owner June 17, 2026 11:56
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@kolkov kolkov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean memory optimization. Un-zeroed reuse is safe — proven by code analysis and the poisoned-overwrite test.

Correctness

  • im2col writes every element including padding zeros — full overwrite guaranteed
  • matMulColBufFloat32 writes all cOut * colHeight positions — no stale reads
  • matOut and outputData are always distinct backing arrays — no aliasing
  • Pointwise (1x1) paths not affected — bypass im2col entirely

Pool pattern

poolScratch[T] generic helper — correct grow-in-place, correct Get/Put lifecycle. defer Put migration of colBufTPool is a minor safety improvement. All four pools (colBuf/matOut × float32/float64) follow the same pattern.

Tests

Excellent coverage:

  • TestConv2DScratchAllocFree — zero-alloc verification (skipped under -race, correct tradeoff)
  • TestConv2DPooledReuseDeterministic — 8 consecutive calls produce bit-identical results
  • TestConv2DPooledPoisonedOverwrite — sentinel poison proves full overwrite before read
  • TestConv2DPooledMatchesMock — correctness against naive oracle (1e-4 float32, 1e-12 float64)

Suggestion (non-blocking)

Consider adding one sentence to poolScratch doc: "Slices Put back retain their underlying array's full capacity; future Gets reuse that array for requests up to that capacity." Makes the grow-only contract explicit.

Approved.

@kolkov kolkov merged commit f562268 into born-ml:main Jun 17, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants