richat: fall back to live tip when source cannot replay and another can by mindrunner · Pull Request #208 · lamports-dev/richat

mindrunner · 2026-04-21T19:16:48Z

When richat resumes after a restart, every source is asked to stream starting from the last finalized slot persisted in storage. If one source has a shorter backlog than that slot, subscribe() returns ReplayFromSlotNotAvailable and the existing code simply slept and retried the same unreachable slot. As long as any other source could replay and fill the gap, the short-window source stayed parked forever and never delivered live messages, even though no data was actually missing from the channel.

Behavior change. When a replay error is reported and report_replay_failed(name) returns false (meaning at least one other source is still expected to replay), flip a per-source flag and reconnect with replay_from_slot = None, i.e. at live tip. The channel still has no gap because another source is covering the history, and this source resumes delivering future messages.

The flag is cleared after every successful subscribe, so a later disconnect will retry the normal replay path first. By that point global_replay_from_slot is typically close to live and the source's backlog will cover it, keeping reconnects gap free when possible.

Observed on richat-ewr-frontend-5, where QuickNode's Yellowstone-grpc backlog does not reach back to the last finalized slot after a richat restart. Before this change the QN source logged
"failed to replay, waiting for other sources" once per second indefinitely. After this change it reconnects at live tip and participates in message races normally.

No config surface change.

When richat resumes after a restart, every source is asked to stream starting from the last finalized slot persisted in storage. If one source has a shorter backlog than that slot, `subscribe()` returns `ReplayFromSlotNotAvailable` and the existing code simply slept and retried the same unreachable slot. As long as any other source could replay and fill the gap, the short-window source stayed parked forever and never delivered live messages, even though no data was actually missing from the channel. Behavior change. When a replay error is reported and `report_replay_failed(name)` returns `false` (meaning at least one other source is still expected to replay), flip a per-source flag and reconnect with `replay_from_slot = None`, i.e. at live tip. The channel still has no gap because another source is covering the history, and this source resumes delivering future messages. The flag is cleared after every successful subscribe, so a later disconnect will retry the normal replay path first. By that point `global_replay_from_slot` is typically close to live and the source's backlog will cover it, keeping reconnects gap free when possible. Observed on richat-ewr-frontend-5, where QuickNode's Yellowstone-grpc backlog does not reach back to the last finalized slot after a richat restart. Before this change the QN source logged "failed to replay, waiting for other sources" once per second indefinitely. After this change it reconnects at live tip and participates in message races normally. No config surface change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

richat: fall back to live tip when source cannot replay and another can#208

richat: fall back to live tip when source cannot replay and another can#208
mindrunner wants to merge 1 commit into
lamports-dev:masterfrom
mindrunner:feat/source-replay-fallback-live

mindrunner commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mindrunner commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant