Skip to content

Prevent checkpointed DTLs from restarting resilvers#18592

Open
favilances wants to merge 1 commit into
openzfs:masterfrom
favilances:fix-checkpoint-resilver-loop
Open

Prevent checkpointed DTLs from restarting resilvers#18592
favilances wants to merge 1 commit into
openzfs:masterfrom
favilances:fix-checkpoint-resilver-loop

Conversation

@favilances
Copy link
Copy Markdown

Motivation and Context

Pool checkpoints intentionally preserve DTL_MISSING entries after a scan because checkpoint-only blocks are not traversed. That retained DTL is needed if the pool is rewound, but it does not mean a new resilver was deferred.

Deferred resilvers rely on vdev_resilver_deferred to mark vdevs that missed txgs outside the active scan range. Treating any remaining DTL as deferred work breaks that distinction, so a completed resilver with a checkpoint can immediately queue another resilver and repeat indefinitely.

This fixes #11434 and fixes #17109. It follows the deferred-resilver intent from #7732 and keeps the restart avoidance added by #9588: a follow-up scan is only needed when a vdev was actually deferred.

Description

The follow-up resilver request is now gated on whether a leaf vdev had vdev_resilver_deferred set before it was cleared. Checkpoint-retained DTLs remain intact, while real deferred resilvers still run after the current scan completes.

A regression test creates a checkpoint during an attach resilver, lets the resilver finish, and verifies that no second resilver_start event is generated.

How Has This Been Tested?

  • scripts/cstyle.pl module/zfs/vdev.c
  • git diff --check
  • bash -n tests/zfs-tests/tests/functional/replacement/resilver_restart_003.ksh
  • scripts/commitcheck.sh HEAD

I did not run the ZFS Test Suite in this environment because this checkout is not configured (Makefile and config.status are absent) and ksh is not installed.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Pool checkpoints intentionally keep DTL_MISSING entries after a scan
because checkpoint-only blocks are not traversed. Those retained DTLs
are needed if the pool is rewound, but they are not evidence of a
deferred resilver.

vdev_clear_resilver_deferred() treated any remaining DTL on an
available leaf vdev as a reason to start another resilver, even when
the vdev had not been marked vdev_resilver_deferred. With a checkpoint
present, a completed resilver therefore queued another resilver
indefinitely.

Only request the follow-up scan when a deferred flag was actually
cleared. Retained checkpoint DTLs stay preserved, while real deferred
resilvers still run after the current scan completes.

Add a regression test for the checkpoint case.

Closes openzfs#11434
Closes openzfs#17109

Signed-off-by: Favilances <78090594+favilances@users.noreply.github.com>
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resilver restarts repeatedly when checkpoint is present infinit loop in resilver if checkpoint present

2 participants