Skip to content

richat: fall back to live tip when source cannot replay and another can#208

Draft
mindrunner wants to merge 1 commit into
lamports-dev:masterfrom
mindrunner:feat/source-replay-fallback-live
Draft

richat: fall back to live tip when source cannot replay and another can#208
mindrunner wants to merge 1 commit into
lamports-dev:masterfrom
mindrunner:feat/source-replay-fallback-live

Conversation

@mindrunner
Copy link
Copy Markdown
Collaborator

When richat resumes after a restart, every source is asked to stream starting from the last finalized slot persisted in storage. If one source has a shorter backlog than that slot, subscribe() returns ReplayFromSlotNotAvailable and the existing code simply slept and retried the same unreachable slot. As long as any other source could replay and fill the gap, the short-window source stayed parked forever and never delivered live messages, even though no data was actually missing from the channel.

Behavior change. When a replay error is reported and report_replay_failed(name) returns false (meaning at least one other source is still expected to replay), flip a per-source flag and reconnect with replay_from_slot = None, i.e. at live tip. The channel still has no gap because another source is covering the history, and this source resumes delivering future messages.

The flag is cleared after every successful subscribe, so a later disconnect will retry the normal replay path first. By that point global_replay_from_slot is typically close to live and the source's backlog will cover it, keeping reconnects gap free when possible.

Observed on richat-ewr-frontend-5, where QuickNode's Yellowstone-grpc backlog does not reach back to the last finalized slot after a richat restart. Before this change the QN source logged
"failed to replay, waiting for other sources" once per second indefinitely. After this change it reconnects at live tip and participates in message races normally.

No config surface change.

When richat resumes after a restart, every source is asked to stream
starting from the last finalized slot persisted in storage. If one
source has a shorter backlog than that slot, `subscribe()` returns
`ReplayFromSlotNotAvailable` and the existing code simply slept and
retried the same unreachable slot. As long as any other source could
replay and fill the gap, the short-window source stayed parked forever
and never delivered live messages, even though no data was actually
missing from the channel.

Behavior change. When a replay error is reported and
`report_replay_failed(name)` returns `false` (meaning at least one
other source is still expected to replay), flip a per-source flag and
reconnect with `replay_from_slot = None`, i.e. at live tip. The channel
still has no gap because another source is covering the history, and
this source resumes delivering future messages.

The flag is cleared after every successful subscribe, so a later
disconnect will retry the normal replay path first. By that point
`global_replay_from_slot` is typically close to live and the source's
backlog will cover it, keeping reconnects gap free when possible.

Observed on richat-ewr-frontend-5, where QuickNode's Yellowstone-grpc
backlog does not reach back to the last finalized slot after a richat
restart. Before this change the QN source logged
"failed to replay, waiting for other sources" once per second
indefinitely. After this change it reconnects at live tip and
participates in message races normally.

No config surface change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant