Skip to content

feat: support planning cleanup#7147

Draft
yanghua wants to merge 9 commits into
lance-format:mainfrom
yanghua:cleanup-plan
Draft

feat: support planning cleanup#7147
yanghua wants to merge 9 commits into
lance-format:mainfrom
yanghua:cleanup-plan

Conversation

@yanghua

@yanghua yanghua commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Adds a plan + execute flow for cleanup so deletions become auditable. Per reviewer feedback on the original design, the plan now follows SQL EXPLAIN semantics: it is a read-only audit report, not a frozen deletion contract.

Design

  • Plan is an EXPLAIN-style report. CleanupPlan exposes:

    • dataset_uri, base_path, policy, created_at — authoritative inputs reused by execution.
    • files, stats, referenced_branches, tagged_old_versions — informational; describe the dataset state observed at planning time.
  • Execute re-plans against the current dataset state. cleanup_with_plan:

    1. Validates dataset_uri / base_path match the current dataset.
    2. Re-runs planning with plan.policy.
    3. Deletes from the freshly computed set.

    Tags, branches, and commits added after planning are naturally honored — including the cases where they do not advance the manifest version.

  • Drops the read_version enforcement. latest_version_id does not move when new tags or branch refs are created, so the previous check did not protect against the TOCTOU case the reviewer raised. Re-planning is strictly stronger.

API

pub struct CleanupPlan {
    pub dataset_uri: String,
    pub base_path: String,
    pub policy: CleanupPolicy,
    pub created_at: DateTime<Utc>,
    pub files: Vec<CleanupFile>,
    pub stats: CleanupPlanStats,
    pub referenced_branches: Vec<CleanupReferencedBranch>,
    pub tagged_old_versions: Vec<u64>,
}

pub async fn plan_cleanup(dataset: &Dataset, policy: CleanupPolicy) -> Result<CleanupPlan>;
pub async fn cleanup_with_plan(dataset: &Dataset, plan: CleanupPlan) -> Result<RemovalStats>;

Impact

  • Audit value is preserved: policy, referenced branches, tag-protected versions, and per-category stats are still surfaced.
  • The actual delete set may differ from plan.files if the dataset changes between plan and execute. This upper-bound semantics matches SQL EXPLAIN and is documented on CleanupPlan and cleanup_with_plan.
  • Public API shape is unchanged aside from removing the read_version field, which has not been released yet.

Tests

  • cleanup_with_plan_reevaluates_against_concurrent_commit — execute after a concurrent commit succeeds and removes the correct set.
  • cleanup_with_plan_handles_encoded_paths — re-planning still discovers temp manifests with URL-unsafe names.
  • Replaced tests that asserted on read_version with behavior-level coverage of the new semantics.

@github-actions github-actions Bot added the enhancement New feature or request label Jun 8, 2026
@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.43169% with 46 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/cleanup.rs 88.88% 26 Missing and 14 partials ⚠️
rust/lance/src/dataset.rs 0.00% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@yanghua

yanghua commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread rust/lance/src/dataset/cleanup.rs
Comment thread rust/lance/src/dataset/cleanup.rs Outdated
@yanghua yanghua marked this pull request as ready for review June 8, 2026 11:38

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@yanghua yanghua requested a review from Xuanwo June 8, 2026 13:57

@Xuanwo Xuanwo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should follow SQL EXPLAIN semantics: the cleanup plan should be a dry-run/audit report, not a materialized deletion plan.

Execution should re-evaluate cleanup from the current dataset/ref state instead of trusting an old file list. For example, a tag or branch can be added after planning without advancing the manifest version, so the current read_version check can still pass while the old plan deletes files that are now protected.

@yanghua

yanghua commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

I think this should follow SQL EXPLAIN semantics: the cleanup plan should be a dry-run/audit report, not a materialized deletion plan.

Execution should re-evaluate cleanup from the current dataset/ref state instead of trusting an old file list. For example, a tag or branch can be added after planning without advancing the manifest version, so the current read_version check can still pass while the old plan deletes files that are now protected.

Sounds reasonable. My original idea is also for dry-run purpose. I think I misunderstood your meaning of the plan when we talked offline before. Will refactor later.

@yanghua yanghua marked this pull request as draft June 9, 2026 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants