Optimize batch-read to avoid wasted disk IO by dao-jun · Pull Request #4741 · apache/bookkeeper

dao-jun · 2026-04-09T06:10:33Z

Motivation

Batch reads currently fetch the next entry before checking whether it can still fit within the maxSize budget. When the last entry in the batch exceeds the remaining space, the read result is discarded, but the disk IO has already happened. This creates wasted IO on the critical read path.

This change avoids that extra read while preserving the existing batch-read behavior, including returning the first entry even when a single entry alone exceeds the requested maxSize.

Changes

Add a bounded-read path (readEntryIfFits / getEntryIfFits) through the batch-read stack, from BatchedReadEntryProcessor down to Bookie, LedgerDescriptor, LedgerStorage, and EntryLogger.
Update batch-read logic to:
- read the first entry as before
- compute the remaining response budget for subsequent entries
- stop without reading the next entry when it cannot fit
Implement size-aware entry reads in both DefaultEntryLogger and DirectEntryLogger by checking entry size from metadata before loading the full payload.
Wire the bounded-read path through all storage implementations, including:
- InterleavedLedgerStorage
- SortedLedgerStorage
- DbLedgerStorage / SingleDirectoryDbLedgerStorage
Keep framing semantics consistent by treating the per-entry 4-byte delimiter as part of the size budget.
Update mock batch-read behavior to match the production semantics.
Add regression and boundary tests covering:
- first entry larger than maxSize
- exact-fit remaining budget
- oversized subsequent entries
- default entry logger and direct entry logger paths
- interleaved, sorted, and DB ledger storage paths
- mock and client-facing batch-read behavior

In order to uphold a high standard for quality for code contributions, Apache BookKeeper runs various precommit
checks for pull requests. A pull request can only be merged when it passes precommit checks.

Be sure to do all the following to help us incorporate your contribution
quickly and easily:

If this PR is a BookKeeper Proposal (BP):

Make sure the PR title is formatted like:
<BP-#>: Description of bookkeeper proposal
e.g. BP-1: 64 bits ledger is support

Attach the master issue link in the description of this PR.

Attach the google doc link if the BP is written in Google Doc.

Otherwise:

Make sure the PR title is formatted like:
<Issue #>: Description of pull request
e.g. Issue 123: Description ...

Make sure tests pass via mvn clean apache-rat:check install spotbugs:check.

Replace <Issue #> in the title with the actual Issue number.

eolivelli

I can't find tests about failure scenarios.

Memory leaks can happen in case of failures (reading from storage...), client may see the wrong error code.....

dao-jun · 2026-04-13T13:04:37Z

@eolivelli I added 3 tests in BatchedReadEntryProcessorTest, PTAL

# Conflicts: # bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/DefaultEntryLogger.java

merlimat · 2026-05-12T22:17:36Z

@dao-jun Thanks for the optimization. I went through the PR end-to-end — overall the direction is good and the framing-budget contract is clean and consistent across the storage stack. A few things I'd like to discuss before merge.

Overall

The core idea is sound: today the batch-read path reads each entry from disk and only then checks whether it fits in maxSize, throwing the result away on overflow. This PR threads the remaining response budget down through Bookie → LedgerDescriptor → LedgerStorage → EntryLogger, lets DefaultEntryLogger/DirectEntryLogger short-circuit by reading just the size header, and updates BatchedReadEntryProcessor to use the bounded API for entries 2..N (keeping the "always return the first entry" semantics).

The framing budget contract is consistent end-to-end (entry.readableBytes() + 4 > maxEntrySize everywhere, matching what BatchedReadEntryProcessor adds to frameSize). Tests are thorough — exact-fit, off-by-one budgets, oversized-first-entry, missing entries, IOException propagation, all the storage flavors. Reference counting on the new paths checks out: WriteCache.get()/ReadCache.get() return freshly-allocated buffers, readEntrySize returns a thread-local, and ByteBufList.deallocate releases members.

Substantive concerns

1. Undocumented behavior change in exception handling — BatchedReadEntryProcessor.readData

} catch (Bookie.NoEntryException e) {
    if (data == null) { throw e; }
    break;
} catch (Throwable e) {
    if (data != null) { data.release(); }
    throw e;
}

The old code's catch (Throwable) silently broke and returned partial data on any error after the first entry — including transient IOExceptions. The new code only swallows NoEntryException; any other exception now propagates as EIO to the client, even if some entries were already read successfully. This is arguably more correct (hiding I/O errors is bad), but it's a real behavior change that isn't called out in the description and may surprise clients that relied on the lenient behavior. Could you call this out explicitly in the PR description?

2. Stat pollution in BookieImpl.readEntryIfFits

When entry == null (size-rejected), entrySize is 0, but the method still calls registerSuccessfulValue(0) on readBytesStats and records latency on readEntryStats. That mixes "actually read N bytes" and "size-rejected, didn't read" into the same histograms — the bytes histogram gets zero-spikes, and the latency histogram skews artificially low because size-rejected reads are much faster than full reads. Either skip recording on the null branch, or add a separate counter for size-rejected reads.

3. Significant code duplication in SingleDirectoryDbLedgerStorage.doGetEntryIfFits

doGetEntryIfFits is ~90 lines that mostly mirror doGetEntry, with the same if (entry.readableBytes() + Integer.BYTES > maxEntrySize) { release; return null; } block repeated four times (write cache, write-cache-being-flushed, read cache, after entry-log read). Could be reduced significantly by extracting a small checkBudget(ByteBuf, long) helper, or by adding a budget parameter to doGetEntry with a sentinel for "no limit". Smaller diff, easier to keep in sync going forward.

4. MockBookies.batchReadEntries no longer releases the skipped entry

The old code did entry.release() before breaking on overflow; the new code drops that call. This is actually a fix — MockLedgerData.getEntry() returns the same shared ByteBuf it stores in its map, so the old release() could free the only ref and corrupt subsequent reads (which is exactly what testBatchReadDoesNotReleaseOversizedSkippedEntry verifies). But the production BatchedReadEntryProcessor does release in the analogous overflow path (because Bookie.readEntry returns a freshly-allocated buffer), so the mock and the real bookie now have intentionally different ref-counting contracts. A short comment on the mock explaining "we don't own the buffer here" would prevent future confusion.

Smaller things

5. Inconsistent + 4 vs + Integer.BYTES — BatchedReadEntryProcessor uses literal 4, every other module uses Integer.BYTES. Pick one.

6. DefaultEntryLogger.readEntryIfFits returns null on oversized entries before calling validateEntry. If an entry's size header is corrupted to a huge value, this silently returns null ("doesn't fit") instead of catching the corruption. Probably acceptable — a later legitimate read would catch it — but worth a one-line comment so a future reader doesn't think it's an oversight.

7. When entryLogger.readEntryIfFits returns null in SingleDirectoryDbLedgerStorage.doGetEntryIfFits, the entry isn't put in the read cache. A subsequent call with a larger budget will hit disk again. Probably fine (we never actually read the data), just noting it.

Verdict

Direction is good, contract is clean, tests are solid. Before merging I'd ask for: (a) the exception-handling behavior change explicitly called out in the PR description, (b) stat recording tightened for size-rejected reads, and (c) the doGetEntryIfFits duplication reduced. The rest are nice-to-have.

dao-jun · 2026-05-14T02:25:25Z

/bkbot rerun-failure

dao-jun · 2026-05-14T11:55:45Z

@merlimat review comment addressed, PTAL

void-ptr974 · 2026-05-17T04:28:12Z

Thanks for working on this optimization. The direction makes sense to me.

One compatibility concern: this PR seems to change the batch-read behavior when a later entry fails after some entries have already been read. My understanding is that the previous behavior was closer to a best-effort prefix read: once the first entry was read successfully, later failures would stop the batch and return the accumulated entries.

I think it would be safer to preserve that behavior in this PR, so the change stays focused on avoiding unnecessary disk IO rather than changing error semantics. The failed entry would still be exposed when the client later starts reading from that entry.

dao-jun · 2026-05-18T09:37:33Z

Thanks for working on this optimization. The direction makes sense to me.

One compatibility concern: this PR seems to change the batch-read behavior when a later entry fails after some entries have already been read. My understanding is that the previous behavior was closer to a best-effort prefix read: once the first entry was read successfully, later failures would stop the batch and return the accumulated entries.

I think it would be safer to preserve that behavior in this PR, so the change stays focused on avoiding unnecessary disk IO rather than changing error semantics. The failed entry would still be exposed when the client later starts reading from that entry.

addressed

hangc0276

Thanks for tackling this — the wasted work on batched-read overflow is real. Before this lands I'd like to push back on both the framing and the scope of the API change, and propose a simpler
alternative.

The disk-I/O savings are smaller than the title suggests

The PR title says "avoid wasted disk IO", but the size-peek doesn't actually save disk I/O on the buffered path:

DefaultEntryLogger uses BufferedReadChannel over a regular FileChannel. The 4-byte readEntrySize call (bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/DefaultEntryLogger.java:856) pulls
a full read-buffer's worth of data into BufferedReadChannel's internal buffer, and the kernel's sequential-readahead heuristic pulls the following pages into the page cache. The body bytes are resident
in memory before we even decide whether to read them. What readEntryIfFits actually saves on this path is the user-space pread syscall, the allocator.buffer(entrySize, entrySize) allocation, and the
Netty/GC pressure from the discarded ByteBuf — real CPU/allocation wins, but not disk I/O.
DirectEntryLogger uses O_DIRECT via DirectReader, which bypasses the page cache and reads one aligned block at a time into nativeBuffer
(bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/directentrylogger/DirectReader.java:200). Here the peek does save real disk I/O — but only for entries that span more than one
block (block size is governed by dbStorage_directIOEntryLoggerReadBufferSizeBytes, typically 4KB–1MB). If the block size is large enough to hold the entry, the peek has already loaded everything.

It would help to either narrow the PR title, or post a measurement that distinguishes wall-clock improvement from actual disk-I/O reduction (e.g. iostat / read_bytes deltas under representative
workloads).

Where the savings actually come from

The PR adds readEntryIfFits / getEntryIfFits across Bookie, LedgerDescriptor, LedgerStorage, EntryLogger, plus the concrete implementations (BookieImpl, LedgerDescriptorImpl, DbLedgerStorage,
SingleDirectoryDbLedgerStorage, InterleavedLedgerStorage, SortedLedgerStorage, DefaultEntryLogger, DirectEntryLogger) and a new LogReader.readEntrySizeAt. But the substantive change is just two
methods:

DefaultEntryLogger.readEntryIfFits — peeks the size header before reading the body.
DirectEntryLogger.readEntryIfFits — same, via DirectReader.readEntrySizeAt.

The rest is plumbing. The default interface implementations in LedgerStorage.getEntryIfFits and EntryLogger.readEntryIfFits do "read then check & release" — i.e. they preserve correctness but provide
none of the optimization. And the cache-layer short-circuits in SingleDirectoryDbLedgerStorage.doGetEntryIfFits against the write/read caches save zero I/O (the entry is already fully in memory) — they
only save a small amount of memory traffic and network bytes.

Proposed alternative: predict from the first entry

For typical BookKeeper/Pulsar workloads, entries within a single ledger come from the same producer and are nearly uniform in size. Using the first entry as the size estimate, we can stop the batch
before issuing the wasted read — no storage API changes required:

long frameSize = 24 + 8 + 4;
ByteBufList data = null;
for (int i = 0; i < maxCount; i++) {
try {
if (data == null) {
ByteBuf first = requestProcessor.getBookie()
.readEntry(request.getLedgerId(), request.getEntryId());
frameSize += first.readableBytes() + 4;
data = ByteBufList.get(first);

          long perEntry = first.readableBytes() + 4;
          long remaining = (maxSize - frameSize) / Math.max(perEntry, 1);
          maxCount = (int) Math.min(maxCount, 1 + remaining);
          continue;
      }
      ByteBuf entry = requestProcessor.getBookie()
              .readEntry(request.getLedgerId(), request.getEntryId() + i);
      if (frameSize + entry.readableBytes() + 4 > maxSize) {
          entry.release();
          break;
      }
      frameSize += entry.readableBytes() + 4;
      data.add(entry);
  } catch (Bookie.NoEntryException e) {
      if (data == null) {
          throw e;
      }
      break;
  } catch (Throwable e) {
      if (data != null) {
          data.release();
      }
      throw e;
  }

}

Comparison

Case	This PR	First-entry prediction
Uniform entry sizes (typical)	Saves the tail-overflow read	Same — prediction is exact, zero waste
Sizes drift up within batch	Eliminates all overflow reads	One over-read possible, bounded to 1 per batch
Sizes drift down within batch	Fully fills budget	Under-fills budget — client does extra round-trips
Highly variable sizes	Optimal	Pessimistic or one wasted read

Code cost: ~10 LOC in one file vs. ~1000 LOC across 25 files.

Combined with the OS-prefetch point above: for DefaultEntryLogger the PR doesn't save disk I/O anyway, so the first-entry approach captures essentially all the achievable wins (skips the body read
entirely → same allocator/syscall savings, similar disk impact). The PR's edge only shows up on DirectEntryLogger with large entries and a highly-variable size distribution.

dao-jun added 2 commits April 9, 2026 13:39

Optimize batch-read to avoid wasted disk IO

2b9db24

Fix checkstyle

0c58386

dao-jun closed this Apr 9, 2026

dao-jun reopened this Apr 9, 2026

eolivelli requested changes Apr 13, 2026

View reviewed changes

Address review comments

b00b020

dao-jun requested a review from eolivelli April 13, 2026 13:04

Address review comments

0bbd2d1

dao-jun mentioned this pull request Apr 24, 2026

[fix] Upgrade to AsyncHTTP client 3.0.4 apache/pulsar#25023

Open

15 tasks

dao-jun added 2 commits May 8, 2026 16:18

Merge branch 'master' into dev/optimize_batch_read

cbac383

# Conflicts: # bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/DefaultEntryLogger.java

fix conflicts

1f1d168

dao-jun added 3 commits May 14, 2026 06:11

Address review comments

9031fdc

Address review comments

e4fd5eb

fix checkstyle

fcecb6b

fix code

e5a302b

hangc0276 requested changes May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize batch-read to avoid wasted disk IO#4741

Optimize batch-read to avoid wasted disk IO#4741
dao-jun wants to merge 10 commits into
apache:masterfrom
dao-jun:dev/optimize_batch_read

dao-jun commented Apr 9, 2026 •

edited

Loading

Uh oh!

eolivelli left a comment

Uh oh!

dao-jun commented Apr 13, 2026

Uh oh!

merlimat commented May 12, 2026

Uh oh!

dao-jun commented May 14, 2026

Uh oh!

dao-jun commented May 14, 2026

Uh oh!

void-ptr974 commented May 17, 2026

Uh oh!

dao-jun commented May 18, 2026

Uh oh!

hangc0276 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

dao-jun commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

dao-jun commented Apr 13, 2026

Uh oh!

merlimat commented May 12, 2026

Overall

Substantive concerns

Smaller things

Verdict

Uh oh!

dao-jun commented May 14, 2026

Uh oh!

dao-jun commented May 14, 2026

Uh oh!

void-ptr974 commented May 17, 2026

Uh oh!

dao-jun commented May 18, 2026

Uh oh!

hangc0276 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dao-jun commented Apr 9, 2026 •

edited

Loading

hangc0276 left a comment •

edited

Loading