Skip to content

feat: seekable reader#530

Draft
splix wants to merge 4 commits into
apache:mainfrom
splix:feat/reader-seek
Draft

feat: seekable reader#530
splix wants to merge 4 commits into
apache:mainfrom
splix:feat/reader-seek

Conversation

@splix
Copy link
Copy Markdown
Contributor

@splix splix commented Apr 5, 2026

Added a feature to seek to a particular block when reading an Avro file.

The Reader now provides the current Block position and for a Seek'able input it can seek to the specific position, assuming it's a valid start position of a block.

Copilot AI review requested due to automatic review settings April 5, 2026 23:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds seek support for Avro container readers by tracking block boundaries (offset + record count) during iteration and exposing an API to seek back to a previously-seen block for Read + Seek inputs.

Changes:

  • Introduced BlockPosition and internal position tracking to record block start offsets as blocks are loaded.
  • Added Reader::{current_block,data_start,seek_to_block} (seek API gated on Seek) plus tests validating seeking between blocks.
  • Added a new error detail variant for seek failures.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
avro/src/reader/mod.rs Exposes BlockPosition and adds reader-level block position + seek APIs with new tests.
avro/src/reader/block.rs Implements BlockPosition, PositionTracker, and block-level seek + block-boundary tracking.
avro/src/lib.rs Re-exports BlockPosition from the crate root.
avro/src/error.rs Adds Details::SeekToBlock for I/O seek errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread avro/src/reader/block.rs Outdated
Comment thread avro/src/reader/mod.rs
Comment thread avro/src/reader/mod.rs Outdated
Comment thread avro/src/reader/block.rs Outdated
@martin-g
Copy link
Copy Markdown
Member

martin-g commented Apr 6, 2026

Added a feature to seek to a particular block when reading an Avro file.

Do you have a use case for this functionality ?

Comment thread avro/src/reader/block.rs
Comment thread avro/src/reader/mod.rs Outdated
Comment thread avro/src/reader/block.rs Outdated
Comment thread avro/src/reader/block.rs Outdated

self.current_block_info = Some(BlockPosition {
offset: block_start,
message_count: block_len as usize,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
message_count: block_len as usize,
message_count: self.message_count,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wrote this part specifically this way to ensure it would not lost, if a refactoring or other change are applied. It cannot rely on the meaning of the self.message_count and its current value if those two will be separated into different places of code

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move self.message_count = block_len as usize down next to self.current_block_info to reduce that risk. As I do think Martin's suggestion is reasonable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not arguing, any of this would work and I can change, that's no problem. But I'm curios to understand why you think this would be better?

My idea is just using the source of truth variable, so there is no way it gets a wrong number. Using the self.message_count works with the current code, but it doesn't give any guarantee if it changes. So I'm wondering why the second approach is better?

@splix
Copy link
Copy Markdown
Contributor Author

splix commented Apr 6, 2026

Added a feature to seek to a particular block when reading an Avro file.

Do you have a use case for this functionality ?

I need to read just a few records from a large Avro file, and without this it's incredibly inefficient as I need to read the whole file from the start each time.

@splix
Copy link
Copy Markdown
Contributor Author

splix commented Apr 27, 2026

I'm wondering if you guys expect me to make some changes to the PR or we're waiting for something else?

@Kriskras99
Copy link
Copy Markdown
Contributor

I don't have further comments, @martin-g how about you?

@martin-g martin-g marked this pull request as draft April 29, 2026 05:09
Comment thread avro/src/reader/block.rs
Comment thread avro/src/reader/mod.rs
Comment thread avro/src/reader/block.rs Outdated
Comment thread avro/src/reader/block.rs Outdated
Comment thread avro/src/reader/mod.rs
Comment thread avro/src/reader/block.rs Outdated
@martin-g
Copy link
Copy Markdown
Member

Added a feature to seek to a particular block when reading an Avro file.

Do you have a use case for this functionality ?

I need to read just a few records from a large Avro file, and without this it's incredibly inefficient as I need to read the whole file from the start each time.

AFAIU you need to read the Avro data once to collect the offsets to be able to seek later, right ?
So, this helps for the following reads ?

@splix
Copy link
Copy Markdown
Contributor Author

splix commented May 24, 2026

AFAIU you need to read the Avro data once to collect the offsets to be able to seek later, right ?
So, this helps for the following reads ?

Right.

First time I needed that was when we had a separate index with records, so we were just opening the Avro file at a needed position. Very large files (audit and traces) and small index for positions. That was in Java, where seeking is supported.

Now I need to show an Avro file on screen, using Rust. User can scroll the file back and forth. One option is keeping the whole file in memory, which is not optimal considering the usual sizes, or just keep the list of positions in file. The file opens from the start, so I always have all the previous positions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants