Fix intermittent IllegalGeneration error in spk publish#1391
Open
jrray wants to merge 1 commit into
Open
Conversation
24e08e6 to
679ce7f
Compare
679ce7f to
50472c3
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
7e8b6f3 to
904eca6
Compare
spk publish's listen_to_index_status_updates() subscribed to the
index-updates topic via Kafka consumer-group management and then
immediately committed starting offsets, before the consumer had
joined the group / finished rebalancing (the join only happens on
the first poll). Committing offsets with no valid group generation
intermittently failed with:
failed to commit new starting offsets for the kafka index
updates consumer: Consumer commit error: IllegalGeneration
(Broker: Specified group generation id is not valid)
The commit served no purpose anyway: this consumer uses a fresh,
ephemeral Ulid group id every run, so the committed offsets are
never read back.
Switch this read-only consumer from subscribe() + commit() to
manual partition assignment (assign()) with the starting offsets
carried in the TopicPartitionList. Manual assignment has no group
generation, so the race cannot occur. The subscribe() call is
removed since a consumer cannot mix subscribe and assign.
The indexer's long-lived consumer is untouched; it legitimately
uses a stable group id and offset commits.
Add production diagnostics that explain the related failure where the
consumer reads stale messages: instead of silently falling back to
offset 0 when a per-partition watermark fetch fails (which replays the
whole partition), log a warning and skip the partition; and while
consuming, warn (once per partition) when the broker delivers a
message from before the offset that was assigned -- the signature of
an assigned offset being discarded (out-of-range falling back to
auto.offset.reset, or a failed watermark fetch). The detection is a
pure helper with a unit test.
Add an integration test and container harness to exercise the
offset/assignment behavior against a real broker. The test reproduces
production conditions (a 12-partition topic with messages in only some
partitions) and asserts the consumer is positioned at the last message
of each non-empty partition without replaying from the start. It is
#[ignore]d and only runs when SPK_TEST_KAFKA_BROKERS is set.
test-support/kafka/ stands up Kafka 2.8.1 (Confluent CP 6.2.1) in a
container; run it with 'make test-kafka'.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
904eca6 to
a1e1e62
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
spk publishsometimes fails with:Root cause
listen_to_index_status_updates()(run byspk publishto wait for the index to update)subscribe()d to the index-updates topic via Kafka consumer-group management, then immediatelycommit()ed the starting offsets — before the consumer had joined the group / finished rebalancing. The group join only happens on the first poll (stream.next()), so the commit was sent with no valid group generation, yieldingIllegalGeneration. It's intermittent because it's a timing race against the join/rebalance.The commit also served no purpose: this consumer uses a fresh, ephemeral
Ulidgroup id every run, so the committed offsets are never read back.Fix
Switch this read-only, ephemeral consumer from
subscribe()+commit()to manual partitionassign(), carrying the starting offsets in theTopicPartitionList. Manual assignment has no group generation, so the race cannot occur. Thesubscribe()call is removed (a consumer can't mixsubscribeandassign).The indexer's long-lived consumer (
listen_to_package_events_and_run_index_updates) is untouched — it legitimately uses a stable group id and offset commits.Testing
Verified by
cargo check -p spk-storage. Not reproduced end-to-end against a live Kafka broker.🤖 Generated with Claude Code