Optimize batch-read to avoid wasted disk IO#4741
Conversation
eolivelli
left a comment
There was a problem hiding this comment.
I can't find tests about failure scenarios.
Memory leaks can happen in case of failures (reading from storage...), client may see the wrong error code.....
|
@eolivelli I added 3 tests in BatchedReadEntryProcessorTest, PTAL |
# Conflicts: # bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/DefaultEntryLogger.java
|
@dao-jun Thanks for the optimization. I went through the PR end-to-end — overall the direction is good and the framing-budget contract is clean and consistent across the storage stack. A few things I'd like to discuss before merge. OverallThe core idea is sound: today the batch-read path reads each entry from disk and only then checks whether it fits in The framing budget contract is consistent end-to-end ( Substantive concerns1. Undocumented behavior change in exception handling — } catch (Bookie.NoEntryException e) {
if (data == null) { throw e; }
break;
} catch (Throwable e) {
if (data != null) { data.release(); }
throw e;
}The old code's 2. Stat pollution in When 3. Significant code duplication in
4. The old code did Smaller things5. Inconsistent 6. 7. When VerdictDirection is good, contract is clean, tests are solid. Before merging I'd ask for: (a) the exception-handling behavior change explicitly called out in the PR description, (b) stat recording tightened for size-rejected reads, and (c) the |
|
/bkbot rerun-failure |
|
@merlimat review comment addressed, PTAL |
|
Thanks for working on this optimization. The direction makes sense to me. One compatibility concern: this PR seems to change the batch-read behavior when a later entry fails after some entries have already been read. My understanding is that the previous behavior was closer to a best-effort prefix read: once the first entry was read successfully, later failures would stop the batch and return the accumulated entries. I think it would be safer to preserve that behavior in this PR, so the change stays focused on avoiding unnecessary disk IO rather than changing error semantics. The failed entry would still be exposed when the client later starts reading from that entry. |
addressed |
There was a problem hiding this comment.
Thanks for tackling this — the wasted work on batched-read overflow is real. Before this lands I'd like to push back on both the framing and the scope of the API change, and propose a simpler
alternative.
The disk-I/O savings are smaller than the title suggests
The PR title says "avoid wasted disk IO", but the size-peek doesn't actually save disk I/O on the buffered path:
- DefaultEntryLogger uses BufferedReadChannel over a regular FileChannel. The 4-byte readEntrySize call (bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/DefaultEntryLogger.java:856) pulls
a full read-buffer's worth of data into BufferedReadChannel's internal buffer, and the kernel's sequential-readahead heuristic pulls the following pages into the page cache. The body bytes are resident
in memory before we even decide whether to read them. What readEntryIfFits actually saves on this path is the user-space pread syscall, the allocator.buffer(entrySize, entrySize) allocation, and the
Netty/GC pressure from the discarded ByteBuf — real CPU/allocation wins, but not disk I/O. - DirectEntryLogger uses O_DIRECT via DirectReader, which bypasses the page cache and reads one aligned block at a time into nativeBuffer
(bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/directentrylogger/DirectReader.java:200). Here the peek does save real disk I/O — but only for entries that span more than one
block (block size is governed by dbStorage_directIOEntryLoggerReadBufferSizeBytes, typically 4KB–1MB). If the block size is large enough to hold the entry, the peek has already loaded everything.
It would help to either narrow the PR title, or post a measurement that distinguishes wall-clock improvement from actual disk-I/O reduction (e.g. iostat / read_bytes deltas under representative
workloads).
Where the savings actually come from
The PR adds readEntryIfFits / getEntryIfFits across Bookie, LedgerDescriptor, LedgerStorage, EntryLogger, plus the concrete implementations (BookieImpl, LedgerDescriptorImpl, DbLedgerStorage,
SingleDirectoryDbLedgerStorage, InterleavedLedgerStorage, SortedLedgerStorage, DefaultEntryLogger, DirectEntryLogger) and a new LogReader.readEntrySizeAt. But the substantive change is just two
methods:
- DefaultEntryLogger.readEntryIfFits — peeks the size header before reading the body.
- DirectEntryLogger.readEntryIfFits — same, via DirectReader.readEntrySizeAt.
The rest is plumbing. The default interface implementations in LedgerStorage.getEntryIfFits and EntryLogger.readEntryIfFits do "read then check & release" — i.e. they preserve correctness but provide
none of the optimization. And the cache-layer short-circuits in SingleDirectoryDbLedgerStorage.doGetEntryIfFits against the write/read caches save zero I/O (the entry is already fully in memory) — they
only save a small amount of memory traffic and network bytes.
Proposed alternative: predict from the first entry
For typical BookKeeper/Pulsar workloads, entries within a single ledger come from the same producer and are nearly uniform in size. Using the first entry as the size estimate, we can stop the batch
before issuing the wasted read — no storage API changes required:
long frameSize = 24 + 8 + 4;
ByteBufList data = null;
for (int i = 0; i < maxCount; i++) {
try {
if (data == null) {
ByteBuf first = requestProcessor.getBookie()
.readEntry(request.getLedgerId(), request.getEntryId());
frameSize += first.readableBytes() + 4;
data = ByteBufList.get(first);
long perEntry = first.readableBytes() + 4;
long remaining = (maxSize - frameSize) / Math.max(perEntry, 1);
maxCount = (int) Math.min(maxCount, 1 + remaining);
continue;
}
ByteBuf entry = requestProcessor.getBookie()
.readEntry(request.getLedgerId(), request.getEntryId() + i);
if (frameSize + entry.readableBytes() + 4 > maxSize) {
entry.release();
break;
}
frameSize += entry.readableBytes() + 4;
data.add(entry);
} catch (Bookie.NoEntryException e) {
if (data == null) {
throw e;
}
break;
} catch (Throwable e) {
if (data != null) {
data.release();
}
throw e;
}
}
Comparison
| Case | This PR | First-entry prediction |
|---|---|---|
| Uniform entry sizes (typical) | Saves the tail-overflow read | Same — prediction is exact, zero waste |
| Sizes drift up within batch | Eliminates all overflow reads | One over-read possible, bounded to 1 per batch |
| Sizes drift down within batch | Fully fills budget | Under-fills budget — client does extra round-trips |
| Highly variable sizes | Optimal | Pessimistic or one wasted read |
Code cost: ~10 LOC in one file vs. ~1000 LOC across 25 files.
Combined with the OS-prefetch point above: for DefaultEntryLogger the PR doesn't save disk I/O anyway, so the first-entry approach captures essentially all the achievable wins (skips the body read
entirely → same allocator/syscall savings, similar disk impact). The PR's edge only shows up on DirectEntryLogger with large entries and a highly-variable size distribution.
Motivation
Batch reads currently fetch the next entry before checking whether it can still fit within the maxSize budget. When the last entry in the batch exceeds the remaining space, the read result is discarded, but the disk IO has already happened. This creates wasted IO on the critical read path.
This change avoids that extra read while preserving the existing batch-read behavior, including returning the first entry even when a single entry alone exceeds the requested maxSize.
Changes