implement thorough scrub support (zpool scrub -t)#18474
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for a new thorough scrub mode (zpool scrub -t) that forces blocks to be decrypted and decompressed as they are scrubbed, enabling detection of corruption cases that standard scrub can miss.
Changes:
- Adds
-tflag plumbing fromzpool(8)CLI → libzfs → ioctl/kernel scan code, including new scrub command bit(s) and scan-status reporting. - Updates scrub I/O behavior to drop
ZIO_FLAG_RAWunder thorough scrub and to tolerate missing encryption keys (fallback to normal scrub behavior for those blocks). - Adds a functional zfs-test for thorough scrub and updates manpages to document the new option.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_scrub_thorough.ksh | Adds a new functional test covering thorough scrub across compression/encryption and pause/resume scenarios. |
| tests/zfs-tests/tests/Makefile.am | Installs the new thorough scrub test script. |
| tests/zfs-tests/cmd/libzfs_input_check.c | Adjusts ioctl input-validation test expectations for the updated scrub command encoding. |
| tests/runfiles/common.run | Adds the new thorough scrub test to the default functional runfile. |
| module/zfs/zio.c | Allows scrub to tolerate missing keys (EACCES) without posting I/O/data ereports, and relaxes decrypt-time assertions for scrub. |
| module/zfs/zfs_ioctl.c | Introduces stricter scrub ioctl validation and threads scrub command bits into scan start/range paths. |
| module/zfs/spa_misc.c | Exposes persisted scrub flags in scan stats for zpool status reporting. |
| module/zfs/spa.c | Extends scan entry points to accept scrub flags and forwards thorough mode into dsl_scan(). |
| module/zfs/dsl_scan.c | Persists thorough flag in scan phys state, adjusts scrub zio flags/buffer sizing, and handles missing-key EACCES during scrub completion. |
| lib/libzfs/libzfs_pool.c | Updates scrub command handling logic for new command bit semantics (incl. start detection). |
| lib/libzfs/libzfs.abi | Records the ABI change to pool_scrub_cmd enum values and new POOL_SCRUB_THOROUGH. |
| include/sys/spa.h | Updates scan function prototypes to accept scrub flags. |
| include/sys/fs/zfs.h | Converts scrub commands to a bitmask enum and extends scan stats struct for thorough reporting. |
| include/sys/dsl_scan.h | Adds DSF_SCRUB_THOROUGH scan flag and threads scan setup args through. |
| cmd/ztest.c | Updates internal scrub invocations for the new spa_scan() signature. |
| cmd/zpool/zpool_main.c | Adds -t CLI parsing and prints “thorough scrub” status when the kernel reports that mode. |
| man/man8/zpool-scrub.8 | Documents zpool scrub -t and its constraints/behavior. |
| man/man4/zfs.4 | Minor documentation wording fix (“2 hours”). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # Start scrub (async) | ||
| log_must zpool scrub -t $TESTPOOL | ||
|
|
There was a problem hiding this comment.
The test unloads the dataset key immediately after starting an async scrub, without ensuring the scrub has actually begun reading those blocks. This can be timing-dependent and flaky on fast pools. Consider waiting until zpool status shows the (thorough) scrub is in progress (or using an existing helper) before unloading the key.
| # Wait until the thorough scrub is reported in progress before | |
| # unloading the key, otherwise this test can race on fast pools. | |
| typeset scrub_started=0 | |
| typeset i=0 | |
| while (( i < 30 )); do | |
| if zpool status $TESTPOOL | grep -q "scrub in progress" && | |
| zpool status $TESTPOOL | grep -q "thorough"; then | |
| scrub_started=1 | |
| break | |
| fi | |
| sleep 1 | |
| (( i = i + 1 )) | |
| done | |
| log_must test $scrub_started -eq 1 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Start scrub (async) | ||
| log_must zpool scrub -t $TESTPOOL | ||
|
|
||
| # Unload key while scrub is running | ||
| log_must zfs unload-key $TESTPOOL/comp_enc_unloading |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Start scrub (async) | ||
| log_must zpool scrub -t $TESTPOOL | ||
|
|
||
| # Unload key while scrub is running | ||
| log_must zfs unload-key $TESTPOOL/comp_enc_unloading | ||
|
|
||
| # Wait for scrub | ||
| log_must zpool wait -t scrub $TESTPOOL | ||
|
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
module/zfs/dsl_scan.c:5006
- For thorough scrub,
scan_exec_io()allocatesdatasized byBP_GET_LSIZE(bp)whenZIO_FLAG_RAWis not set, but the in-flight throttling (spa_scrub_inflight/q_inflight_bytes) is still accounting withBP_GET_PSIZE(bp). On highly-compressed blocks this can significantly under-account memory and allow much more RAM to be pinned thanscn_maxinflight_bytesintends. Consider tracking the actual allocated buffer size (e.g.,sizeorzio->io_size) in the in-flight counters and subtracting the same value indsl_scan_scrub_done().
size_t size = (zio_flags & ZIO_FLAG_RAW) ?
BP_GET_PSIZE(bp) : BP_GET_LSIZE(bp);
abd_t *data = abd_alloc_for_io(size, B_FALSE);
zio_t *pio;
if (queue == NULL) {
ASSERT3U(scn->scn_maxinflight_bytes, >, 0);
mutex_enter(&spa->spa_scrub_lock);
while (spa->spa_scrub_inflight >= scn->scn_maxinflight_bytes)
cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
spa->spa_scrub_inflight += BP_GET_PSIZE(bp);
mutex_exit(&spa->spa_scrub_lock);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
took a while to get the tests and manpage to reflect reality but I think this is ready for review now. Looks like the manpage currently in tree is wrong as it describes cli arg combinations that are not allowed. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
module/zfs/dsl_scan.c:5006
- In scan_exec_io(), the I/O buffer size is now BP_GET_LSIZE() for non-RAW (thorough) scrubs, but the in-flight throttling counters (spa_scrub_inflight / q_inflight_bytes) are still incremented/decremented using BP_GET_PSIZE(). For highly-compressible blocks this can allow far more memory to be allocated than the throttling intends, potentially causing large RAM spikes or OOM during thorough scrubs. Consider accounting using the allocated buffer size (or a separate memory-based counter/limit) so throttling matches actual memory consumption for non-RAW scrubs.
size_t size = (zio_flags & ZIO_FLAG_RAW) ?
BP_GET_PSIZE(bp) : BP_GET_LSIZE(bp);
abd_t *data = abd_alloc_for_io(size, B_FALSE);
zio_t *pio;
if (queue == NULL) {
ASSERT3U(scn->scn_maxinflight_bytes, >, 0);
mutex_enter(&spa->spa_scrub_lock);
while (spa->spa_scrub_inflight >= scn->scn_maxinflight_bytes)
cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
spa->spa_scrub_inflight += BP_GET_PSIZE(bp);
mutex_exit(&spa->spa_scrub_lock);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Start scrub (async) | ||
| log_must zpool scrub -t $TESTPOOL | ||
| log_mustnot zpool scrub $TESTPOOL | ||
|
|
||
| # Unload key while scrub is running | ||
| log_must zfs unload-key $TESTPOOL/comp_enc_unloading | ||
|
|
||
| # Wait for scrub | ||
| log_must zpool wait -t scrub $TESTPOOL |
| nvlist_t *required = fnvlist_alloc(); | ||
| fnvlist_add_uint64(required, "scan_type", POOL_SCAN_FUNCS + 1); | ||
| fnvlist_add_uint64(required, "scan_command", POOL_SCRUB_FLAGS_END + 1); | ||
| fnvlist_add_uint64(required, "scan_command", 0xDEADBEEF); | ||
| IOC_INPUT_TEST(ZFS_IOC_POOL_SCRUB, pool, required, NULL, EINVAL); |
| typeset -a args=("-?" "blah blah" "-%" "--?" "-*" "-=" \ | ||
| "-b" "-c" "-d" "-f" "-g" "-h" "-i" "-j" "-k" "-l" \ | ||
| "-m" "-n" "-o" "-q" "-r" "-u" "-v" "-x" "-y" "-z" \ | ||
| "-A" "-B" "-D" "-E" "-F" "-G" "-H" "-I" "-J" "-K" "-L" \ | ||
| "-M" "-N" "-O" "-P" "-Q" "-R" "-S" "-T" "-U" "-V" "-W" "-X" "-W" "-Z") | ||
|
|
||
|
|
||
| log_assert "Execute 'zpool scrub' using invalid parameters." | ||
|
|
||
| typeset -i i=0 | ||
| while [[ $i -lt ${#args[*]} ]]; do | ||
| log_mustnot zpool scrub ${args[i]} | ||
|
|
||
| ((i = i + 1)) | ||
| for arg in "${args[@]}"; do | ||
| log_mustnot zpool scrub "$arg" "$TESTPOOL" | ||
| done |
980622e to
0ff3555
Compare
| { | ||
| nvlist_t *required = fnvlist_alloc(); | ||
| fnvlist_add_uint64(required, "scan_type", POOL_SCAN_FUNCS + 1); | ||
| fnvlist_add_uint64(required, "scan_command", POOL_SCRUB_FLAGS_END + 1); | ||
| fnvlist_add_uint64(required, "scan_command", 0xDEADBEEF); | ||
| IOC_INPUT_TEST(ZFS_IOC_POOL_SCRUB, pool, required, NULL, EINVAL); | ||
| nvlist_free(required); | ||
| } |
This introduces the -t (thorough) flag to 'zpool scrub' command. A thorough scrub decrypts and decompresses blocks as they are read, allowing ZFS to catch rare corruption scenarios where the block's checksum matches the data on disk, but the block fails to decrypt or decompress. For encrypted datasets, the keys must be loaded to perform a thorough scrub. If the keys are not loaded, or are unloaded while the scrub is in progress, the scrub will fall back to a normal scrub for those encrypted blocks with key unloaded. Signed-off-by: Alek Pinchuk <Alek.Pinchuk@connectwise.com>
| resumes as a normal scrub. | ||
| Specify | ||
| .Fl t | ||
| again to resume thorough scrubbing. |
There was a problem hiding this comment.
Why require the -t to resume this as as thorough scrub? It seems like that should be the default behavior, similar to how a errlog, last_txg, or date range scrub pick up where they left off.
There was a problem hiding this comment.
i wanted to preserve the ability to switch back and forth between thorough and normal scrub without reseting scrub progress which with the current code is achived by choosing which scrub you want when resuming
There was a problem hiding this comment.
Okay, that's what I was guessing. From a testing standing point I get it, but from a user perspective it's not what I'd expect. If you're doing a thorough scrub, then it's probably because you have some reason to suspect there's some subtle damage/corruption. In which case, I'd like some assurance that when a thorough scrub is started it will check the whole pool, or date range, even if it was paused and restarted. Operationally, accidentally converting a paused thorough scrub to a normal scrub would be an easy mistake to make. I'd prefer to change this behavior. DSF_SCRUB_THOROUGH is persisted in the scn_flags so the information is there on disk.
| if (pause == 0) { | ||
| (void) printf(gettext("scrub in progress since %s"), | ||
| (void) printf(gettext("%sscrub in progress since %s"), | ||
| is_thorough ? "thorough " : "", |
There was a problem hiding this comment.
Keeping gettext() for the main phrase, but not for "thorough " is odd, unless it is an international term.
| uint64_t sio_birth; | ||
| uint64_t sio_salt; | ||
| uint64_t sio_iv1; | ||
| uint64_t sio_iv2; |
There was a problem hiding this comment.
It seems iv2 is 32bit in block pointers.
|
|
||
| int | ||
| spa_scan(spa_t *spa, pool_scan_func_t func) | ||
| spa_scan(spa_t *spa, pool_scan_func_t func, uint64_t flags) |
There was a problem hiding this comment.
I wonder if the argument should be pool_scrub_cmd_t, not abstract flags?
This introduces the -t (thorough) flag to 'zpool scrub' command. A thorough scrub decrypts and decompresses blocks as they are read, allowing ZFS to catch rare corruption scenarios where the block's checksum matches the data on disk, but the block fails to decrypt or decompress.
For encrypted datasets, the keys must be loaded to perform a thorough scrub. If the keys are not loaded, or are unloaded while the scrub is in progress, the scrub will fall back to a normal scrub for those encrypted blocks with key unloaded.
Motivation and Context
See #17630 for original motivation and context. This is an enhanced version of what was presented in that PR. Specifically this version adds support for decryption during a thorough scrub, uses the existing Z/O pipeline through unsetting the
ZIO_FLAG_RAWflag, and wires in the functionality to be accessed with a zpool scrub cli flag (-t) instead of a sysctl.Description
This PR implements thorough scrub support, which can be invoked using
zpool scrub -t.Even though this patch adds RAM overhead, in practive I didn't see any meaninful impact on RAM usage as the existing
zfs_scan_mem_lim_*tunables still apply to throttle and control memory usage.For CPU overhead I've seen about a 10% increase in average CPU usage during a through scrub on a compressed pool vs a normal scrub on the same pool.
Key features and implementation details:
ZIO_FLAG_RAWflag from the scrub ZIOs when theDSF_SCRUB_THOROUGHflag is set.scan_io_tstruct, which now includessio_salt,sio_iv1, andsio_iv2fields (an additional 24 bytes per IO) needed to decrypt blocks when required.scan_exec_io()must allocate data buffers based on the block's logical size (BP_GET_LSIZE) rather than its compressed physical size (BP_GET_PSIZE). This increases memory consumption for in-flight I/O compared to a normal scrub.EACCESerrors returned during decryption and MAC verification (since the block's checksum was successfully verified beforehand).zpool statusand thepool_scan_stat_tstruct have been updated to track and correctly report when a "thorough scrub" is in progress, paused, completed, or canceled. This uses an extra word in the scan stats (pss_pass_scrub_flags) to persist the thorough scrub state.How Has This Been Tested?
zfs-tests including the added through-scrub-specfic test
some light performance testing
Types of changes
Checklist:
Signed-off-by.