Skip to content

Fix intermittent IllegalGeneration error in spk publish#1391

Open
jrray wants to merge 1 commit into
mainfrom
fix-kafka-publish-illegal-generation
Open

Fix intermittent IllegalGeneration error in spk publish#1391
jrray wants to merge 1 commit into
mainfrom
fix-kafka-publish-illegal-generation

Conversation

@jrray

@jrray jrray commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Problem

spk publish sometimes fails with:

ERROR   × failed to commit new starting offsets for the kafka index updates
  │ consumer: Consumer commit error: IllegalGeneration (Broker: Specified
  │ group generation id is not valid)

Root cause

listen_to_index_status_updates() (run by spk publish to wait for the index to update) subscribe()d to the index-updates topic via Kafka consumer-group management, then immediately commit()ed the starting offsets — before the consumer had joined the group / finished rebalancing. The group join only happens on the first poll (stream.next()), so the commit was sent with no valid group generation, yielding IllegalGeneration. It's intermittent because it's a timing race against the join/rebalance.

The commit also served no purpose: this consumer uses a fresh, ephemeral Ulid group id every run, so the committed offsets are never read back.

Fix

Switch this read-only, ephemeral consumer from subscribe() + commit() to manual partition assign(), carrying the starting offsets in the TopicPartitionList. Manual assignment has no group generation, so the race cannot occur. The subscribe() call is removed (a consumer can't mix subscribe and assign).

The indexer's long-lived consumer (listen_to_package_events_and_run_index_updates) is untouched — it legitimately uses a stable group id and offset commits.

Testing

Verified by cargo check -p spk-storage. Not reproduced end-to-end against a live Kafka broker.

🤖 Generated with Claude Code

@jrray jrray force-pushed the fix-kafka-publish-illegal-generation branch from 24e08e6 to 679ce7f Compare June 24, 2026 18:49
@jrray jrray added bug Something isn't working AI Code authored with AI assistance. labels Jun 24, 2026
@jrray jrray force-pushed the fix-kafka-publish-illegal-generation branch from 679ce7f to 50472c3 Compare June 24, 2026 18:53
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 15.08380% with 152 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/spk-storage/src/storage/messaging/kafka.rs 15.08% 152 Missing ⚠️

📢 Thoughts on this report? Let us know!

@jrray jrray force-pushed the fix-kafka-publish-illegal-generation branch 3 times, most recently from 7e8b6f3 to 904eca6 Compare June 24, 2026 20:31
spk publish's listen_to_index_status_updates() subscribed to the
index-updates topic via Kafka consumer-group management and then
immediately committed starting offsets, before the consumer had
joined the group / finished rebalancing (the join only happens on
the first poll). Committing offsets with no valid group generation
intermittently failed with:

    failed to commit new starting offsets for the kafka index
    updates consumer: Consumer commit error: IllegalGeneration
    (Broker: Specified group generation id is not valid)

The commit served no purpose anyway: this consumer uses a fresh,
ephemeral Ulid group id every run, so the committed offsets are
never read back.

Switch this read-only consumer from subscribe() + commit() to
manual partition assignment (assign()) with the starting offsets
carried in the TopicPartitionList. Manual assignment has no group
generation, so the race cannot occur. The subscribe() call is
removed since a consumer cannot mix subscribe and assign.

The indexer's long-lived consumer is untouched; it legitimately
uses a stable group id and offset commits.

Add production diagnostics that explain the related failure where the
consumer reads stale messages: instead of silently falling back to
offset 0 when a per-partition watermark fetch fails (which replays the
whole partition), log a warning and skip the partition; and while
consuming, warn (once per partition) when the broker delivers a
message from before the offset that was assigned -- the signature of
an assigned offset being discarded (out-of-range falling back to
auto.offset.reset, or a failed watermark fetch). The detection is a
pure helper with a unit test.

Add an integration test and container harness to exercise the
offset/assignment behavior against a real broker. The test reproduces
production conditions (a 12-partition topic with messages in only some
partitions) and asserts the consumer is positioned at the last message
of each non-empty partition without replaying from the start. It is
#[ignore]d and only runs when SPK_TEST_KAFKA_BROKERS is set.
test-support/kafka/ stands up Kafka 2.8.1 (Confluent CP 6.2.1) in a
container; run it with 'make test-kafka'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jrray jrray force-pushed the fix-kafka-publish-illegal-generation branch from 904eca6 to a1e1e62 Compare June 24, 2026 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Code authored with AI assistance. bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants