Skip to content

Use GetLatestSnapshot in CleanupRecordExists#8594

Open
sfc-gh-mslot wants to merge 1 commit into
citusdata:mainfrom
sfc-gh-mslot:marcoslot/cleanup-snapshot
Open

Use GetLatestSnapshot in CleanupRecordExists#8594
sfc-gh-mslot wants to merge 1 commit into
citusdata:mainfrom
sfc-gh-mslot:marcoslot/cleanup-snapshot

Conversation

@sfc-gh-mslot
Copy link
Copy Markdown

DESCRIPTION: Prevent a possible race condition in shard cleanup

systable_beginscan without a snapshot calls GetCatalogSnapshot, which might be cached from a previous call. That means that if a shard move succeeded and the deletion of the cleanup record completed just before the cleaner tries to acquire a lock, CleanupRecordExists might still see the record in its snapshot and shard cleanup will incorrectly proceed.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.75%. Comparing base (810e7fb) to head (2d018eb).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8594      +/-   ##
==========================================
- Coverage   88.75%   88.75%   -0.01%     
==========================================
  Files         288      288              
  Lines       64303    64304       +1     
  Branches     8090     8090              
==========================================
- Hits        57075    57072       -3     
- Misses       4892     4893       +1     
- Partials     2336     2339       +3     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ScanKeyInit(&scanKey[0], Anum_pg_dist_cleanup_record_id,
BTEqualStrategyNumber, F_INT8EQ, Int64GetDatum(recordId));

/* make sure we can always see deletion after acquiring the operation ID lock */
Copy link
Copy Markdown
Contributor

@colm-mchugh colm-mchugh May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment format strays from convention. Maybe add this text to the function comment at line 1179 ?

Copy link
Copy Markdown
Contributor

@colm-mchugh colm-mchugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, it looks like it fixes a real race condition by ensuring the catalog is read at a fresh snapshot. Is it possible to add a regress test ? An isolation test where the maintenance daemon contends over pg_dist_cleanup with a shard move/split operation, for example. The main concern is that without regress test(s), a future code change could compromise this fix.

@sfc-gh-mslot
Copy link
Copy Markdown
Author

Is it possible to add a regress test ? An isolation test where the maintenance daemon contends over pg_dist_cleanup with a shard move/split operation

Since cleanup does not block waiting for the lock, the isolation testing framework cannot intercept it. It would require injection points, but Citus does not use those yet.

@colm-mchugh
Copy link
Copy Markdown
Contributor

colm-mchugh commented May 24, 2026

Since cleanup does not block waiting for the lock, the isolation testing framework cannot intercept it. It would require injection points, but Citus does not use those yet.

How did you encounter this bug ? Curious what prompted the PR. Another test approach is a superuser only GUC that reflects a millisec delay, and enables a test to create a race and assert there's only one drop. But that may be overkill for this PR - possibly requires a TAP test - so for this but can you update the function's comment with fix details before we merge it ? And also a note about a future test using Postgres injection points. It's worth emphasizing since comments are a useful (and in this case, the only) defense again a silent regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants