Skip to content

feat: add distributed cleanup old versions#5160

Open
LuciferYang wants to merge 3 commits into
lance-format:mainfrom
LuciferYang:feat/distributed-cleanup-old-versions
Open

feat: add distributed cleanup old versions#5160
LuciferYang wants to merge 3 commits into
lance-format:mainfrom
LuciferYang:feat/distributed-cleanup-old-versions

Conversation

@LuciferYang
Copy link
Copy Markdown

Summary

  • add cleanup_old_versions for single-table old-version cleanup with URI or namespace table resolution
  • add cleanup_database_old_versions to run old-version cleanup across namespace tables with a Ray Pool
  • return serializable cleanup stats from distributed workers and skip declared-only namespace tables
  • document the cleanup APIs alongside distributed compaction

Testing

  • python -m ruff check .
  • python -m ruff format --check lance_ray/cleanup.py tests/test_cleanup.py lance_ray/__init__.py
  • python -m pytest tests/test_cleanup.py tests/test_distributed_compaction.py::TestCompactDatabase -q
  • real dir namespace cleanup smoke with two tables and Ray Pool

Closes #98

@github-actions github-actions Bot added the enhancement New feature or request label Jun 4, 2026
Follow-up review hardening on top of the distributed cleanup feature.

- Guard retain_versions <= 0 in both cleanup_old_versions and
  cleanup_database_old_versions. retain_versions=0 otherwise triggers a
  Rust PanicException in Lance core (a BaseException) that escapes the
  worker's `except Exception`, aborting the whole batch with a generic
  error and breaking the documented per-table aggregation contract.
- Expand docstrings to the compaction.py bar: full Args/Returns/Raises,
  the deliberate parallel-aggregate vs serial-fail-fast contrast,
  num_workers semantics, the non-atomic/destructive Warning, the 7-day
  delete_unverified guard, and the conditional two-week older_than default.
- Extract the page-size literal into the named _LIST_TABLES_PAGE_SIZE,
  annotate _handle_cleanup_table's return type, and tighten
  _cleanup_stats_to_dict to dict[str, int].
- Document the parallel/aggregate behavior, non-atomic warning, and stat
  field names in docs/src/compaction.md.
- Strengthen tests: uri-vs-namespace validation, retain_versions rejection,
  safe-default pinning (incl. num_workers=4), 3-page pagination, exact
  partial-failure message, failure-path pool close/join and __cause__
  chaining, and a CleanupStats field drift guard.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support for distributed clean up old versions

1 participant