richat: fall back to live tip when source cannot replay and another can#208
Draft
mindrunner wants to merge 1 commit into
Draft
richat: fall back to live tip when source cannot replay and another can#208mindrunner wants to merge 1 commit into
mindrunner wants to merge 1 commit into
Conversation
When richat resumes after a restart, every source is asked to stream starting from the last finalized slot persisted in storage. If one source has a shorter backlog than that slot, `subscribe()` returns `ReplayFromSlotNotAvailable` and the existing code simply slept and retried the same unreachable slot. As long as any other source could replay and fill the gap, the short-window source stayed parked forever and never delivered live messages, even though no data was actually missing from the channel. Behavior change. When a replay error is reported and `report_replay_failed(name)` returns `false` (meaning at least one other source is still expected to replay), flip a per-source flag and reconnect with `replay_from_slot = None`, i.e. at live tip. The channel still has no gap because another source is covering the history, and this source resumes delivering future messages. The flag is cleared after every successful subscribe, so a later disconnect will retry the normal replay path first. By that point `global_replay_from_slot` is typically close to live and the source's backlog will cover it, keeping reconnects gap free when possible. Observed on richat-ewr-frontend-5, where QuickNode's Yellowstone-grpc backlog does not reach back to the last finalized slot after a richat restart. Before this change the QN source logged "failed to replay, waiting for other sources" once per second indefinitely. After this change it reconnects at live tip and participates in message races normally. No config surface change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When richat resumes after a restart, every source is asked to stream starting from the last finalized slot persisted in storage. If one source has a shorter backlog than that slot,
subscribe()returnsReplayFromSlotNotAvailableand the existing code simply slept and retried the same unreachable slot. As long as any other source could replay and fill the gap, the short-window source stayed parked forever and never delivered live messages, even though no data was actually missing from the channel.Behavior change. When a replay error is reported and
report_replay_failed(name)returnsfalse(meaning at least one other source is still expected to replay), flip a per-source flag and reconnect withreplay_from_slot = None, i.e. at live tip. The channel still has no gap because another source is covering the history, and this source resumes delivering future messages.The flag is cleared after every successful subscribe, so a later disconnect will retry the normal replay path first. By that point
global_replay_from_slotis typically close to live and the source's backlog will cover it, keeping reconnects gap free when possible.Observed on richat-ewr-frontend-5, where QuickNode's Yellowstone-grpc backlog does not reach back to the last finalized slot after a richat restart. Before this change the QN source logged
"failed to replay, waiting for other sources" once per second indefinitely. After this change it reconnects at live tip and participates in message races normally.
No config surface change.