Skip to content

feat: support planning cleanup#7147

Draft
yanghua wants to merge 3 commits into
lance-format:mainfrom
yanghua:cleanup-plan
Draft

feat: support planning cleanup#7147
yanghua wants to merge 3 commits into
lance-format:mainfrom
yanghua:cleanup-plan

Conversation

@yanghua
Copy link
Copy Markdown
Collaborator

@yanghua yanghua commented Jun 8, 2026

Background / Motivation

cleanup_old_versions today behaves as a black box: callers hand in a policy
and either get a RemovalStats back or, on failure, a partially mutated
dataset with no record of which files were inspected, kept, or deleted.

That opacity becomes painful in three scenarios we hit in production:

  1. Operational dry-run. Operators want to know exactly which files an
    upcoming cleanup will remove (and how many bytes that frees) before
    actually running it, especially on tables with 100k+ fragments where a
    mistaken policy could remove tens of GB.
  2. Auditing and reproducibility. When a cleanup is triggered automatically
    (commit hooks, schedulers), there is no artifact we can inspect afterwards
    to answer "why did this file go away?". The tracing audit log helps, but
    only if you were already capturing it.
  3. Two-phase execution. Some deployments want to plan on one node and
    execute on another (or in a maintenance window), which the current API
    does not support at all.

This PR splits cleanup into an explicit plan and execute pair, while
keeping the existing cleanup_old_versions entry point byte-for-byte
compatible. The plan is a serializable description of every file the cleanup
intends to delete, the reason it qualifies, and the dataset snapshot it was
built from.

What's in this PR

  • New public APIs:
    • plan_cleanup(&Dataset, CleanupPolicy) -> CleanupPlan
    • cleanup_with_plan(&Dataset, CleanupPlan) -> RemovalStats
  • CleanupPlan / CleanupFile / CleanupFileKind / CleanupFileReason /
    CleanupPlanStats / CleanupReferencedBranch data types.
  • Internal refactor of CleanupTask into three explicit execution paths
    (cleanup_old_versions / cleanup_with_plan / commit hooks), each with
    its own trust model documented in-source.
  • cleanup_with_plan validates: dataset URI, base path, that every planned
    path stays under the dataset base, and that the plan's read_version
    still matches the storage-resolved latest version. A residual TOCTOU
    window between the version check and the deletes is documented in the
    rustdoc; callers running concurrent writers must serialize externally.
  • Plan creation resolves the latest version from storage rather than from
    the in-memory dataset handle, so plans built from a stale handle are
    still safe.
  • Listing-consistency guard: planning fails if list_manifest_locations
    did not return the storage-resolved latest version (defends against
    eventual-consistency or racing list output).

Behavior changes worth flagging

  • RemovalStats returned by cleanup_old_versions now includes the stats
    of cascaded clean_referenced_branches cleanups. Previously those were
    silently dropped. Monitoring/dashboards that compared against the old
    numbers will see an increase.
  • cleanup_with_plan will reject a plan if any commit lands between
    plan_cleanup and cleanup_with_plan on the same dataset. This is by
    design — see rustdoc. The internal cleanup_old_versions path is
    unaffected.

Tests

  • plan_cleanup_does_not_delete_files
  • plan_cleanup_uses_latest_version_with_stale_handle
  • cleanup_with_plan_rejects_stale_version
  • cleanup_with_plan_rejects_toctou_commit_with_stale_handle
  • internal_cleanup_plan_allows_toctou_commit_before_delete
  • process_manifests_rejects_listing_missing_latest_version
  • All existing cleanup_old_versions / cleanup_with_policy tests continue
    to pass unmodified.

@github-actions github-actions Bot added the enhancement New feature or request label Jun 8, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 8, 2026

Codecov Report

❌ Patch coverage is 90.30023% with 42 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/cleanup.rs 92.43% 19 Missing and 13 partials ⚠️
rust/lance/src/dataset.rs 0.00% 10 Missing ⚠️

📢 Thoughts on this report? Let us know!

@yanghua
Copy link
Copy Markdown
Collaborator Author

yanghua commented Jun 8, 2026

@claude review

Comment on lines +289 to +296
async fn plan_with_referenced_branches(
&self,
referenced_branches: Vec<(String, u64)>,
) -> Result<CleanupPlan> {
let latest_version = self.dataset.latest_version_id().await?;
self.plan_with_referenced_branches_at_version(referenced_branches, latest_version)
.await
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 plan.read_version is captured before manifest listing, causing spurious cleanup_with_plan rejections on commits-that-land-during-planning. plan_with_referenced_branches (cleanup.rs:289-296) resolves latest_version via latest_version_id() before listing manifests, but the listing-consistency guard (cleanup.rs:377-378) accepts max_listed_version >= latest_version and process_manifest_file treats any manifest with version >= latest_version as in-the-working-set and protects its references. So a commit V+1 landing during planning is absorbed into the plan — yet plan.read_version is still stamped with the older V, and validate_plan_read_version rejects with "plan was created from version V, but latest dataset version is V+1" even though no commit landed between plan completion and execute. Fix: set plan.read_version = inspection.max_listed_version (or re-resolve latest_version_id after listing) so the recorded read-version matches what the planner actually observed.

Extended reasoning...

What the bug is

plan_with_referenced_branches resolves latest_version from storage at the start of planning (cleanup.rs:293), then passes that value into plan_with_referenced_branches_at_versionprocess_manifestsprocess_manifest_file. While planning, two things happen with that captured latest_version:

  1. process_manifest_file (cleanup.rs:414) computes is_latest = latest_version <= manifest.version. Any newly-listed manifest with a version >= latest_version is treated as in the working set, and its referenced data/transaction/index/deletion files are added to inspection.referenced_files (i.e. protected from deletion).
  2. process_manifests (cleanup.rs:377-378) accepts the listing as consistent whenever max_listed_version >= latest_version. The PR description explicitly calls this a guard against "eventual-consistency or racing list output".

Then build_cleanup_plan stamps plan.read_version = latest_version — the pre-listing snapshot, not what the planner actually observed.

The race window

There is real I/O between latest_version_id() (a store request via resolve_latest_location) and the start of list_manifest_locations. Suppose:

  1. Planner calls latest_version_id() → returns V (e.g. 2).
  2. A concurrent writer commits → storage latest is now V+1 (3).
  3. Planner runs list_manifest_locations. It returns v1, v2, and v3.
  4. For v3: is_latest = V (2) <= 3 = truein_working_set = true → v3's references are added to verified_files. max_listed_version becomes 3.
  5. Listing-consistency check: 3 >= 2 → passes.
  6. build_cleanup_plan writes plan.read_version = 2.
  7. Operator immediately calls cleanup_with_plan(plan). No commit happens between step 6 and step 7.
  8. validate_plan_read_version (cleanup.rs:722) re-resolves latest_version_id() → 3. 2 != 3Error::Cleanup: "cleanup plan was created from version 2, but latest dataset version is 3".

The plan was internally consistent with V+1 — the planner saw V+1, treated it as live, and kept its files protected. The plan would have safely executed. The rejection is spurious.

Why existing code doesn't prevent it

The PR's rustdoc on cleanup_with_plan documents: "cleanup_with_plan will reject a plan if any commit lands between plan_cleanup and cleanup_with_plan on the same dataset." That contract is what validate_plan_read_version is implementing. But here the commit landed during plan_cleanup, not between plan and execute, and the planner already absorbed it. The strict-equality check immediately undoes the planner's TOCTOU tolerance.

The existing test internal_cleanup_plan_allows_toctou_commit_before_delete (added in this PR, lines around 2210) explicitly demonstrates that the author already considers "plan saw a newer manifest than read_version claims" to be a legal, safe state — that test has to call execute_plan_unchecked to skip the version check. The new public cleanup_with_plan API has no such escape hatch.

Impact

For the PR's stated motivation (operators planning cleanups on tables with 100k+ fragments in busy production environments), every concurrent commit that happens to land during the latest_version_id()list_manifest_locations window forces a full re-plan, even though the planner already observed and accepted the newer state. The user is shown an error that says "a commit landed between plan and execute" when in fact no such commit happened. This is precisely the operational scenario the listing-consistency guard was designed to tolerate.

No data is at risk — the failure mode is rejection, not deletion of live files.

Fix

Set plan.read_version from what the planner actually observed rather than the pre-listing snapshot. Either:

// In build_cleanup_plan, instead of latest_version:
read_version: inspection.max_listed_version
    .expect("process_manifests would have errored if no manifests were listed"),

or re-resolve latest_version_id() after process_manifests returns and use that. inspection.max_listed_version is guaranteed Some here because process_manifests (cleanup.rs:387-390) errors out if it is None. With this change, the post-plan equality check in validate_plan_read_version accepts the V+1-absorbed plan, while still rejecting any genuine commit landing between plan completion and cleanup_with_plan.

Comment on lines +1385 to +1388
pub async fn cleanup_with_plan(dataset: &Dataset, plan: CleanupPlan) -> Result<RemovalStats> {
let cleanup = CleanupTask::new(dataset, plan.policy.clone());
cleanup.execute_plan(plan).await
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 cleanup_with_plan silently skips the clean_referenced_branches cascade. When a caller builds a plan with CleanupPolicy { clean_referenced_branches: true, .. } and then runs cleanup_with_plan, child branches are never visited and the returned RemovalStats omits their contribution — even though the rustdoc on plan_cleanup_with_policy (dataset.rs:1208-1213) promises parity with cleanup_with_policy for the same policy. Fix by cascading on plan.referenced_branches inside execute_plan when policy.clean_referenced_branches is true (or rejecting/documenting that the plan path does not support that policy).

Extended reasoning...

What is broken

Dataset::cleanup_with_plan (dataset.rs:1230) calls cleanup::cleanup_with_planCleanupTask::execute_plan (cleanup.rs:356) → execute_plan_unchecked. execute_plan_unchecked only iterates plan.files and deletes the listed paths. The plan.referenced_branches field — which build_cleanup_plan (cleanup.rs:622-625) faithfully populates from the planning-time branch scan — is read by no code path during execution. By contrast, the equivalent direct entry point cleanup_old_versionsCleanupTask::runrun_at_version (cleanup.rs:255-267) explicitly does:

if self.policy.clean_referenced_branches {
    let branch_stats = self.clean_referenced_branches(&referenced_branches).await?;
    final_stats.bytes_removed += branch_stats.bytes_removed;
    // ... accumulates the rest of branch_stats into final_stats
}

clean_referenced_branches cascades into cleanup_cascade_branch for each child branch, which itself runs a full file cleanup against that branch. Skipping it means child-branch garbage is never collected.

Why nothing protects against this

plan_cleanup accepts any CleanupPolicy — including one with clean_referenced_branches=true — and produces a populated referenced_branches Vec. cleanup_with_plan validates the dataset URI, base path, file paths, and read_version, but does not inspect plan.policy.clean_referenced_branches. There is no warning, no error, no log line.

Impact

This is the exact use case the PR motivates: a dry-run + execute pair. An operator who currently relies on cleanup_old_versions(policy{clean_referenced_branches:true}) and wants to migrate to plan/execute will get silently divergent behaviour:

  1. plan_cleanup returns stats for the current branch only (the CleanupPlanStats field counts files in plan.files, which never includes child-branch files).
  2. cleanup_with_plan deletes only those files; child-branch storage continues to accumulate.
  3. The returned RemovalStats is missing the branch contribution that cleanup_old_versions would have reported (and that the PR description explicitly highlights as a new behavior: "RemovalStats returned by cleanup_old_versions now includes the stats of cascaded clean_referenced_branches cleanups").

This directly contradicts the rustdoc on plan_cleanup_with_policy (dataset.rs:1208-1213): "The returned plan contains the concrete files that would be removed by Self::cleanup_with_policy for the same dataset snapshot and policy." For clean_referenced_branches=true, that statement is false.

Step-by-step proof

Setup: a dataset with one child branch feature/x, both with old versions eligible for cleanup. Policy: CleanupPolicy { before_timestamp: Some(now), clean_referenced_branches: true, .. }.

Path Adataset.cleanup_with_policy(policy):

  1. cleanup_old_versionsCleanupTask::runrun_at_version.
  2. Line 259: self.find_referenced_branches() returns [("feature/x", root_version)].
  3. Line 261: policy flag is true → self.clean_referenced_branches(&referenced_branches) runs, which calls cleanup_cascade_branch against branch feature/x. Suppose that frees 500 MB across 30 data files.
  4. Lines 262-267: those numbers are folded into final_stats.
  5. Main-branch plan is built and executed; suppose that frees 100 MB across 10 files.
  6. Returned RemovalStats: bytes_removed = 600 MB, data_files_removed = 40.

Path Blet p = dataset.plan_cleanup_with_policy(policy).await?; dataset.cleanup_with_plan(p).await?:

  1. plan_cleanupCleanupTask::planplan_with_referenced_branches. find_referenced_branches runs and the result is only attached to plan.referenced_branches (cleanup.rs:622-625). build_cleanup_plan does not scan child-branch files.
  2. plan.files contains only main-branch entries; plan.stats reflects only 100 MB / 10 files.
  3. cleanup_with_planexecute_planvalidate_plan + validate_plan_read_version + execute_plan_unchecked.
  4. execute_plan_unchecked iterates plan.files. plan.referenced_branches is in scope but never read.
  5. Returned RemovalStats: bytes_removed = 100 MB, data_files_removed = 10. Branch feature/x is untouched.

500 MB / 30 files silently leaked, no error, no log. Two API paths the rustdoc promises to be equivalent diverge.

Suggested fix

In execute_plan (cleanup.rs:356-360), before calling execute_plan_unchecked, check plan.policy.clean_referenced_branches. If true, build the Vec<(String, u64)> from plan.referenced_branches, call self.clean_referenced_branches(...), and fold those stats into the result returned by execute_plan_unchecked (mirroring run_at_version lines 261-267). Alternatively, plan_cleanup could reject policies with clean_referenced_branches=true and the rustdoc should be updated to call that out. The first option preserves the documented equivalence; the second is a smaller change but is a feature gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant