Skip to content

Netty memory allocation issues #10597

@zilm13

Description

@zilm13

We have exception like this in our logs after bumping netty from Netty 4.2.10.Final to Netty 4.2.12.Final:

{"@timestamp":"2026-04-22T03:43:25,124","level":"ERROR","thread":"beaconchain-async-4","class":"teku-status-log","message":"PLEASE FIX OR REPORT | Unexpected exception thrown for beaconchain-async-4","throwable":"java.util.concurrent.CompletionException: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 131072 byte(s) of direct memory (used: 33554432, max: 33554432)\n\tat java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)\n\tat java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)\n\tat java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:936)\n\tat java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\tat java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)\n\tat tech.pegasys.teku.infrastructure.async.SafeFuture.lambda$propagateResult$2(SafeFuture.java:150)\n\tat java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)\n\tat java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\tat java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)\n\tat tech.pegasys.teku.infrastructure.async.SafeFuture.lambda$propagateResult$2(SafeFuture.java:150)\n\tat java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)\n\tat java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\tat java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)\n\tat tech.pegasys.teku.infrastructure.async.SafeFuture.lambda$propagateToAsync$29(SafeFuture.java:461)\n\tat tech.pegasys.teku.infrastructure.async.SafeFuture.of(SafeFuture.java:82)\n\tat tech.pegasys.teku.infrastructure.async.AsyncRunner.lambda$runAsync$2(AsyncRunner.java:47)\n\tat tech.pegasys.teku.infrastructure.async.SafeFuture.of(SafeFuture.java:74)\n\tat tech.pegasys.teku.infrastructure.async.ScheduledExecutorAsyncRunner.lambda$createRunnableForAction$1(ScheduledExecutorAsyncRunner.java:124)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 131072 byte(s) of direct memory (used: 33554432, max: 33554432)\n\tat io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:1102)\n\tat io.netty.util.internal.CleanerJava9$CleanableDirectBufferImpl.<init>(CleanerJava9.java:136)\n\tat io.netty.util.internal.CleanerJava9$CleanableDirectBufferImpl.<init>(CleanerJava9.java:131)\n\tat io.netty.util.internal.CleanerJava9.allocate(CleanerJava9.java:86)\n\tat io.netty.util.internal.PlatformDependent.allocateDirect(PlatformDependent.java:633)\n\tat io.netty.buffer.UnpooledDirectByteBuf.allocateDirectBuffer(UnpooledDirectByteBuf.java:129)\n\tat io.netty.buffer.UnpooledDirectByteBuf.<init>(UnpooledDirectByteBuf.java:70)\n\tat io.netty.buffer.UnpooledUnsafeDirectByteBuf.<init>(UnpooledUnsafeDirectByteBuf.java:50)\n\tat io.netty.buffer.UnsafeByteBufUtil.newDirectByteBuf(UnsafeByteBufUtil.java:734)\n\tat io.netty.buffer.AdaptiveByteBufAllocator$DirectChunkAllocator.allocate(AdaptiveByteBufAllocator.java:114)\n\tat io.netty.buffer.AdaptivePoolingAllocator$SizeClassChunkController.newChunkAllocation(AdaptivePoolingAllocator.java:734)\n\tat io.netty.buffer.AdaptivePoolingAllocator$Magazine.allocate(AdaptivePoolingAllocator.java:957)\n\tat io.netty.buffer.AdaptivePoolingAllocator$Magazine.tryAllocate(AdaptivePoolingAllocator.java:854)\n\tat io.netty.buffer.AdaptivePoolingAllocator$MagazineGroup.allocate(AdaptivePoolingAllocator.java:421)\n\tat io.netty.buffer.AdaptivePoolingAllocator.allocate(AdaptivePoolingAllocator.java:269)\n\tat io.netty.buffer.AdaptivePoolingAllocator.allocate(AdaptivePoolingAllocator.java:255)\n\tat io.netty.buffer.AdaptiveByteBufAllocator.newDirectBuffer(AdaptiveByteBufAllocator.java:67)\n\tat io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)\n\tat io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:154)\n\tat io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:88)\n\tat tech.pegasys.teku.networking.p2p.libp2p.rpc.LibP2PRpcStream.writeBytes(LibP2PRpcStream.java:47)\n\tat tech.pegasys.teku.networking.eth2.rpc.core.RpcResponseCallback.completeWithErrorResponse(RpcResponseCallback.java:64)\n\tat tech.pegasys.teku.networking.eth2.rpc.core.RpcResponseCallback.completeWithUnexpectedError(RpcResponseCallback.java:81)\n\tat tech.pegasys.teku.networking.eth2.rpc.beaconchain.methods.LoggingResponseCallback.completeWithUnexpectedError(LoggingResponseCallback.java:52)\n\tat tech.pegasys.teku.networking.eth2.rpc.beaconchain.methods.CompletionAwareResponseCallback.completeWithUnexpectedError(CompletionAwareResponseCallback.java:75)\n\tat tech.pegasys.teku.networking.eth2.rpc.core.PeerRequiredLocalMessageHandler.handleError(PeerRequiredLocalMessageHandler.java:73)\n\tat tech.pegasys.teku.networking.eth2.rpc.beaconchain.methods.DataColumnSidecarsByRangeMessageHandler.lambda$onIncomingMessage$2(DataColumnSidecarsByRangeMessageHandler.java:200)\n\tat tech.pegasys.teku.infrastructure.async.SafeFuture.lambda$finish$39(SafeFuture.java:503)\n\tat java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)\n\t... 21 more\n"}

By default Netty has 32MB limit for direct data allocation.
Options to fix it:

  • Force the pooled allocator at runtime
    Netty 4.2.x added the adaptive allocator as the default; the old PooledByteBufAllocator is still shipped and is battle-tested. Flip the default back for the Teku process
    only:
    // build.gradle applicationDefaultJvmArgs
    "-Dio.netty.allocator.type=pooled",

    This avoids the adaptive allocator entirely. You keep 4.2.12's CVE fixes / bug fixes, but sidestep the allocator regression. Recommended first step.

  • Disable the adaptive path directly at build
    There's also a narrower switch that only affects where adaptive is used:

    "-Dio.netty.allocator.useAdaptiveAllocator=false",

  • Increase direct buffer size limit from 32Mb to some greater value, say 64Mb
    /in build.gradle applicationDefaultJvmArgs:
    // 256Mb for Netty Direct ByteBuf
    "-Dio.netty.maxDirectMemory=67108864",


The difference between pools strategy:

Adaptive (AdaptiveByteBufAllocator) — Netty 4.2's default, what bit us

  • Pools buffers in per-thread magazines. Each magazine holds a handful of large chunks (up to 4 MB each). Allocations carve slices out of a chunk; releases return the slice
    to the magazine.
  • Self-tunes. The magazine decides its chunk size dynamically based on recent allocation patterns — hot threads get bigger chunks, quiet threads shrink.
  • Fast path is lock-free. Most allocations are thread-local, no contention.
  • Downside: retention. A magazine keeps its chunks even after buffers are released, so idle memory sits allocated. With many event-loop threads and bursty traffic, total
    retained direct memory can be far larger than the sum of currently-live buffers. That's exactly what hit your 32 MB ceiling.

Unpooled (UnpooledByteBufAllocator)

  • No pool. Every allocate() does a fresh native direct-buffer creation; every release() frees it (via the JDK Cleaner).
  • No retention. Direct-memory usage ≈ currently-live buffers, near-zero steady-state overhead.
  • Slow allocation/free. Each allocation involves a syscall-ish malloc; each free creates a PhantomReference for the Cleaner, so you also get more GC pressure from reference
    processing.
  • Not viable for Teku's hot paths. Gossip, RPC streaming, and discovery allocate buffers at high rates. Unpooled in those paths will tank throughput.

PooledByteBufAllocator sits between the two:

  • Pools, like adaptive, but with fixed-shape arenas (pre-Netty-4.2's tried-and-true design).
  • One arena per event-loop thread, buffers come from size classes.
  • Lower per-thread retention than adaptive because arenas don't inflate chunk size dynamically.
  • Well-understood memory profile — the one pre-4.2 Teku was implicitly using for years.

I've tried "-Dio.netty.allocator.type=pooled", PooledByteBufAllocator on one of the nightly nodes and exceptions gone away. But increasing direct buffer size could be a better long-term strategy. Anyway we should commit something to default args to make it default for any teku running.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions