dsl_scan: detect 'unencrypted block in encrypted objset' during scrub#18587
dsl_scan: detect 'unencrypted block in encrypted objset' during scrub#18587jkolo wants to merge 1 commit into
Conversation
Add a BP-level check in dsl_scan_visitbp(): when a dataset is encrypted yet a block pointer lacks BP_USES_CRYPT, count it as a scan error, push the offending bookmark into the SPA error log (so it surfaces in 'zpool status -v' as a permanent error against the affected file), and emit a zfs_dbgmsg with the exact bookmark coordinates for offline correlation with zdb output. The check covers BOTH leaf blocks (level 0, file data) AND indirect blocks (level > 0). Both must have BP_USES_CRYPT set on an encrypted dataset — leaf BPs store the data MAC in blk_cksum, while indirect BPs store a MAC-of-MAC computed over their children's MACs (see zio_crypt_do_indirect_mac_checksum_abd in zio_crypt.c). Both forms are corrupted equally when the flag is lost during write. Embedded BPs are explicitly skipped — they inline data and never carry the CRYPT flag. This gives users an actionable per-file list for the bug class discussed in openzfs#14330, openzfs#15275, openzfs#16065, openzfs#14709, openzfs#18186 — historically the only visible symptoms were recurring kernel panics in zfs_mknode and intermittent EIOs from dbuf_read. PR openzfs#15677 converted that read-path panic to a logged error; this commit adds the symmetric visibility during scrub so users do not have to wait for a panic to identify affected files. Detection only — no repair attempted. Users who hit it should delete the affected file and restore from backup, or recreate the pool without ZFS native encryption (LUKS underneath still provides data-at-rest). Refs: openzfs#14330 openzfs#15275 openzfs#16065 openzfs#14709 openzfs#18186 ZFS-CI-Type: full Signed-off-by: Jerzy Kołosowski <jerzy@kolosowscy.pl>
835d520 to
6f6dfa3
Compare
|
Updated to extend detection to indirect blocks (level > 0), not just leaf blocks. The same "unencrypted block in encrypted object set" failure class applies equally to indirect BPs (which store a MAC-of-MAC over children's MACs) — the encryption flag is required at every level on an encrypted dataset. Embedded BPs remain explicitly skipped (they inline data and don't carry the flag). Commit was amended in place rather than added as a separate commit since the original detection logic was only one filter line. |
Add ZINJECT_LOSE_CRYPT_FLAG injection that strips BP_USES_CRYPT from a BP after the encryption pipeline has finished encoding salt/IV/MAC. The on-disk block is genuinely encrypted but its BP reports as unencrypted - the corruption shape detected by the scrub check added in openzfs#18587 and targeted by the upcoming Mode 1.5 / Mode 3 repair commits. Without an in-tree way to reach this state there is no deterministic ZTS reproducer; without a reproducer the repair code cannot be verified. This commit unblocks both. Modeled on the no-op injection added in openzfs#16085. Mechanics: - new zinject_type_t enum value ZINJECT_LOSE_CRYPT_FLAG - kernel handler zio_handle_lose_crypt_flag_injection() reuses the existing zio_match_handler bookmark matching, so injections target a specific (objset, object, level, blkid) - both leaf (level 0) and indirect (level > 0) BPs are reachable - zio_encrypt() calls a static helper before each successful return so all encryption paths (raw, indirect, objset, MAC-only, normal encrypt) honour the injection - userspace exposes 'zinject -e lose-crypt-flag' via a private sentinel in errstrtable[] (not a real errno, mapped to the new zi_cmd) The feature flag added by the previous scaffolding commit is renamed: SPA_FEATURE_ENCRYPTED_BP_REPAIR -> SPA_FEATURE_CRYPT_REPAIR org.openzfs:encrypted_bp_repair -> org.openzfs:crypt_repair Shorter, matches the naming style of block_cloning / head_errlog / redaction_bookmarks. No on-disk state exists yet for the feature, so the rename is internal. Refs: openzfs#14330 openzfs#15275 openzfs#15464 openzfs#16065 openzfs#18587 ZFS-CI-Type: full Signed-off-by: Jerzy Kołosowski <jerzy@kolosowscy.pl>
Extend the scrub detection in dsl_scan_visitbp() so that when the new
--repair-crypt-mismatches flag is set, every detected mismatch is
classified as either a candidate for in-kernel repair (Mode 1.5
smart-flip or Mode 3 free-fallback) or skipped with a logged reason.
The classification is the entry point for the repair pipeline added
in follow-up commits. By landing it as report-only first, the guard
logic can be observed (via zfs_dbgmsg) against real and zinjected
corruption before any block mutation goes in.
Eligibility guards encoded in dsl_scan_crypt_repair_eligible():
- ds == NULL - MOS object, no dataset wrapper
- ds->ds_is_snapshot - parent BP is immutable
- DMU_OBJECT_IS_SPECIAL - dn_dbuf can be NULL
- zb_level == ZB_ZIL_LEVEL - ZIL semantics, replay-once
- BP_GET_TYPE == DMU_OT_OBJSET - uses spa_do_crypt_objset_mac_abd,
separate path
- BP_IS_GANG - multi-BP, complex
- BP_GET_DEDUP - DDT key includes CRYPT bit
- dsl_dataset_get_keystatus - need key loaded for MAC verify
Rationale for each guard is documented in
plans/mode15-23-research.md (sections 1.4 and 7).
No behaviour change unless DSF_REPAIR_CRYPT_MISMATCHES is set.
Refs: openzfs#14330 openzfs#15275 openzfs#15464 openzfs#16065 openzfs#18587
ZFS-CI-Type: full
Signed-off-by: Jerzy Kołosowski <jerzy@kolosowscy.pl>
When the new --repair-crypt-mismatches flag is set and a leaf BP passes the eligibility guards added in the previous commit, decrypt the on-disk bytes with the dataset key and check whether the MAC stored in blk_cksum HIGH words verifies. The result classifies the BP as either a Mode 1.5 candidate (MAC matches, lossless smart-flip will work) or a Mode 3 candidate (MAC fails, only destructive free can recover). This commit only LOGS the classification via zfs_dbgmsg. The atomic parent BP restore (Mode 1.5 dispatch) and bptree-based free (Mode 3 dispatch) land in follow-up commits so the read-only probe can be exercised against zinjected and real corruption before any block metadata is mutated. Implementation choice (read-only via arc_read with a stack-local BP that has BP_USES_CRYPT forced TRUE): The original BP has BP_USES_CRYPT clear, which would put arc_read on the plaintext path and trigger a checksum verify against blk_cksum - but blk_cksum holds the MAC, not a checksum, so that verify would fail. We pass arc_read a stack copy of the BP with the CRYPT bit forced TRUE instead. The encrypted-BP path skips the normal checksum verify, ZIO_FLAG_RAW skips the decrypt pipeline, and ZIO_FLAG_SPECULATIVE keeps IO/cksum errors as return values instead of ereports. An earlier sketch tried zio_read_phys per DVA, but that primitive asserts vd->vdev_children == 0 and so cannot route through a top-level RAIDZ/mirror vdev without leaf-level iteration that also has to re-implement the vdev's internal redundancy. The arc_read-with-modified-BP approach works on any topology. The CRYPT bit is only flipped on a stack copy; the on-disk BP is never modified by this probe. spa_do_crypt_abd operates on PSIZE (the post-compression on-disk ciphertext size), not LSIZE; the transform stack does decrypt-then- decompress in zio_read_bp_init, so the MAC is computed over PSIZE bytes. See plans/mode15-23-research.md section 1.1 for details. Indirect (level > 0) verification uses a different MAC-of-MAC path and is not yet wired up; the dbgmsg notes "Mode 1.5 MAC-of-MAC verify pending" for indirect candidates. No behaviour change unless DSF_REPAIR_CRYPT_MISMATCHES is set. Refs: openzfs#14330 openzfs#15275 openzfs#15464 openzfs#16065 openzfs#18587 ZFS-CI-Type: full Signed-off-by: Jerzy Kołosowski <jerzy@kolosowscy.pl>
Extend the read-only repair-classification probe added in the
previous commit to indirect (level > 0) block pointers.
Indirect BPs on encrypted datasets carry a HMAC-SHA512 in
blk_cksum HIGH words computed over their children's MACs (the
"MAC of MACs"). Verifying that this MAC-of-MAC matches confirms
the on-disk indirect block is the genuine encrypted-dataset
indirect that lost only its CRYPT flag.
Differences from the leaf path (dsl_scan_crypt_repair_verify_leaf):
- The MAC is computed over the indirect's plaintext layout
(the BP array). We must decompress before passing it to
zio_crypt_do_indirect_mac_checksum_abd(). The compressed
case mirrors the pattern in zio_decrypt() at zio.c:599-622.
- No dataset key is required for MAC-of-MAC verification (the
MAC is a HMAC over the children's MACs, not a GCM tag), so
we still use the arc_read-with-modified-BP trick to bypass
the normal checksum verify but do not call spa_do_crypt_abd.
- Mode 3 (free fallback) is intentionally disabled for indirect
blocks - freeing an indirect would orphan its entire subtree.
A failed MAC-of-MAC verify is logged as unrepairable for
operator follow-up rather than queued for destructive repair.
The classification + verify dispatch grew to a depth that hit
cstyle line-length warnings, so the detection block now calls a
new helper dsl_scan_classify_crypt_mismatch() with the eligibility
check, leaf/indirect verify, and zfs_dbgmsg classification all
contained at function scope.
No behaviour change unless DSF_REPAIR_CRYPT_MISMATCHES is set.
Refs: openzfs#14330 openzfs#15275 openzfs#15464 openzfs#16065 openzfs#18587
ZFS-CI-Type: full
Signed-off-by: Jerzy Kołosowski <jerzy@kolosowscy.pl>
After MAC (leaf) or MAC-of-MAC (indirect) verification confirms the
on-disk block is genuinely encrypted with the dataset key, walk back
to the parent dnode or indirect block, take the correct lock, mark
the parent dirty in the current scrub txg, and set BP_USES_CRYPT on
the in-memory parent BP slot. The change persists at the next
spa_sync() pass; subsequent reads go through the normal decrypt
pipeline and succeed.
Three parent-location cases handled per
plans/mode15-23-research.md sec. 1.3:
A. spill block (zb_blkid == DMU_SPILL_BLKID)
- slot is DN_SPILL_BLKPTR(dn->dn_phys)
B. top of the object tree (zb_level + 1 >= dn_nlevels)
- slot is dn->dn_phys->dn_blkptr[zb_blkid]
C. deeper levels
- parent indirect fetched via dbuf_hold_impl()
- slot is (blkptr_t *)db->db.db_data + (blkid & (epb - 1))
Lock ordering mirrors dnode_increase_indirection() in dnode_sync.c:
dn->dn_struct_rwlock READER outside, parent's db_rwlock WRITER
inside, dmu_buf_will_dirty() under the writer, releases in reverse
order. DN_SPILL_BLKID and Case B share dn->dn_dbuf as the parent
buffer; Case C uses the indirect dbuf returned by dbuf_hold_impl.
Defensive dn->dn_dbuf NULL check on top of the eligibility guard
that already rejects DMU_OBJECT_IS_SPECIAL objects (where dn_dbuf
can be NULL). The eligibility guards also keep snapshot datasets,
ZIL blocks, objset blocks, gang/dedup BPs, and key-unavailable
datasets out of this path.
Behaviour:
- Detection-only mode (no --repair-crypt-mismatches): unchanged.
- --repair-crypt-mismatches set, MAC verifies: parent BP CRYPT
bit restored, persisted at next sync. File becomes readable.
- --repair-crypt-mismatches set, MAC fails on leaf: logged as
Mode 3 candidate (dispatch in follow-up commit).
- --repair-crypt-mismatches set, MAC fails on indirect: logged
as unrepairable (auto-free would orphan the subtree).
Refs: openzfs#14330 openzfs#15275 openzfs#15464 openzfs#16065 openzfs#18587
ZFS-CI-Type: full
Signed-off-by: Jerzy Kołosowski <jerzy@kolosowscy.pl>
When MAC verification fails for a leaf BP - the on-disk bytes are
not legitimate ciphertext under the dataset key, so neither Mode 1.5
smart-flip nor a future in-kernel re-encrypt can recover the data -
and the user opts in with --free-crypt-fallback, punch a hole in
the parent BP slot and free the old block via
dsl_dataset_block_kill().
The file accumulates a hole (zeros) at the affected offset;
everything else remains accessible. This is the destructive
alternative to permanent EIO on every read - opt-in only.
Restrictions:
- Leaf only. Indirect blocks are never auto-freed; freeing one
would orphan the entire subtree below.
- Requires --free-crypt-fallback explicitly. With only
--repair-crypt-mismatches set, an unrepairable leaf is logged
and left in zpool status -v for the operator to handle.
Implementation:
- Same atomic parent-slot lookup as Mode 1.5 (three cases:
spill block, top of object tree, deeper indirect via
dbuf_hold_impl).
- BP_SET_HOLE() on the in-memory parent slot.
- dsl_dataset_dirty(ds, tx) before dsl_dataset_block_kill() so
block_kill's "ASSERT(dmu_buf_is_dirty(ds_dbuf, tx))" holds in
scrub context, where the dataset may not have been touched by
user writes this txg.
- dsl_dataset_block_kill(..., B_TRUE) async path handles
accounting plus livelist / BRT / DDT defer automatically.
Refs: openzfs#14330 openzfs#15275 openzfs#15464 openzfs#16065 openzfs#18587
ZFS-CI-Type: full
Signed-off-by: Jerzy Kołosowski <jerzy@kolosowscy.pl>
zpool-scrub(8) gets two new option entries describing the new flags
introduced by the BP_USES_CRYPT repair pipeline. zpoolconcepts(7)
gets a new "Encryption Flag Mismatches Repair" subsection summarising
the bug class, the three repair dispositions (Mode 1.5 smart flip,
Mode 3 hole+free, log-only for unrepairable indirects), the
eligibility skip list, and the existing user-space alternative
('zfs send -U | zfs recv' onto a fresh pool).
Both pages validate cleanly via 'man --warnings -Z'.
No code changes.
Refs: openzfs#14330 openzfs#15275 openzfs#15464 openzfs#16065 openzfs#18587
Signed-off-by: Jerzy Kołosowski <jerzy@kolosowscy.pl>
…path
The Faza 1 scaffolding only added the CLI options to zpool scrub; the
flags never reached the kernel, so the repair logic in dsl_scan
never executed and the warning message in zpool_main.c told the user
"the repair logic is not yet wired through the ioctl path; this
invocation will run a normal scrub."
Wire them through:
cmd/zpool/zpool_main.c
- scrub_callback() translates the cb_repair_crypt_mismatches and
cb_free_crypt_fallback booleans into DSF_REPAIR_CRYPT_MISMATCHES
/ DSF_FREE_CRYPT_FALLBACK bits and calls the new
zpool_scan_range_with_flags() instead of zpool_scan_range().
- Drop the "not yet wired" warning; the path is wired now.
- #include <sys/dsl_scan.h> for the DSF_* bits.
lib/libzfs/libzfs_pool.c, include/libzfs.h
- Add zpool_scan_range_with_flags() which appends an optional
"scan_repair_flags" uint64 to the ZFS_IOC_POOL_SCRUB nvlist.
The existing zpool_scan_range() becomes a thin wrapper for ABI
stability.
module/zfs/zfs_ioctl.c
- zfs_keys_pool_scrub[] gains an optional "scan_repair_flags"
key; zfs_ioc_pool_scrub() reads it and passes it through to
spa_scan_with_flags() on both the SCRUB_FROM_LAST_TXG path and
the normal start path.
module/zfs/spa.c, include/sys/spa.h
- Add spa_scan_with_flags() that forwards the flags down; the
existing spa_scan_range() is a thin wrapper. EXPORT_SYMBOL for
the new entry point.
module/zfs/dsl_scan.c, include/sys/dsl_scan.h
- Add dsl_scan_with_flags(); the existing dsl_scan() is a thin
wrapper.
- setup_sync_arg_t gains a scan_flags field, masked by
DSL_SCAN_FLAGS_MASK on entry so only the repair-related bits
can be set from userspace.
- dsl_scan_setup_sync() ORs setup_sync_arg->scan_flags into
scn_phys.scn_flags right after the memset(), so the repair
bits survive the rest of the setup and dsl_scan_visitbp() sees
them.
End-to-end now: 'zpool scrub --repair-crypt-mismatches POOL' sets
the bit in scn_phys.scn_flags and the existing Mode 1.5 / Mode 3
dispatch logic in dsl_scan.c runs for every detected mismatch.
Refs: openzfs#14330 openzfs#15275 openzfs#15464 openzfs#16065 openzfs#18587
ZFS-CI-Type: full
Signed-off-by: Jerzy Kołosowski <jerzy@kolosowscy.pl>
|
@jkolo thanks for investigating this issue and opening the PR. Even if it's still a bit of mystery exact how this could happen we definitely want to be able to detect and fix it.
We will need test coverage for this change and extending zinject as you've described I think the right approach. If you can include that functionality in this PR along with a new test case that'll make it easier to verify everything is working as intended.
Given how rare we believe this issue to be I don't think we want the additional baggage of a dedicated error counter. I agree, using the existing
I'm fine with leaving the repair process for a follow up PR. The main thing is making sure we're able to detect and report this. |
| /* | ||
| * Detect "unencrypted block in encrypted object set" corruption. | ||
| * On an encrypted dataset every BP — leaf data block or indirect — | ||
| * must have BP_USES_CRYPT set. The read path in dbuf.c converts |
There was a problem hiding this comment.
Let's mention the read path will also add these blocks to the error log.
Motivation and Context
The "unencrypted block in encrypted object set" failure class on
natively-encrypted datasets has been reported continuously from ZFS
2.0.x through 2.4.x across several open issues:
(closed by OpenZFS 2.2.0 copy_file_range(2) from unencrypted to encrypted panic #15464/Fix block cloning between unencrypted and encrypted datasets #15465 backport for 2.2.1, but new reports keep
arriving on 2.3.x — see e.g. the 2024-11 comment about Debian 12 +
rsync that re-opened the discussion)
(Ubuntu 24.04 / ZFS 2.2; still reproducing on 2.3.1 per the
2025-05-30 comment from @darkpixel)
fix that closed the original
zfs_clone_range()vector but didnot eliminate every path that produces this corruption shape)
PR #15677 (Dec 2023) converted the read-path panic in
dbuf.cto alogged error. Users hitting the bug today still cannot easily identify
which file is affected: the symptom is either a kernel panic in
zfs_mknode(CREATE path, stillVERIFY0(sa_buf_hold)) or anintermittent EIO from
dbuf_readwith no per-file attribution.Description
Add a BP check in
dsl_scan_visitbp(): when a dataset is encryptedyet a block pointer lacks
BP_USES_CRYPT, count it as a scan error,push the offending bookmark into the SPA error log (so it surfaces
in
zpool status -vas a permanent error against the affected file),and emit a
zfs_dbgmsgwith exact bookmark coordinates (includingthe BP level) for offline
zdbcorrelation.The check covers both leaf blocks (level 0, file data) and indirect
blocks (level > 0, MAC-of-MAC over children). Embedded BPs are
explicitly skipped — they inline data into the BP itself and never
carry the CRYPT flag by design. Hole BPs are skipped earlier in
dsl_scan_visitbp().Detection only — no repair. Users who hit it should delete the
affected file and restore from backup, recreate the pool without
ZFS native encryption (LUKS underneath still provides data-at-rest
protection on most affected deployments), or use the existing
zfs send -U+zfs recvrecovery flow (PR #18240).Relationship to PR #18474 (thorough scrub)
PR #18474 (alek-p, in review) adds a "thorough scrub" mode that
decrypts and decompresses every block during a scrub, catching
corruption where the on-disk checksum is valid but decryption or
decompression fails on blocks that carry
BP_IS_ENCRYPTED.The two efforts are complementary, not overlapping:
BP_IS_ENCRYPTEDset, data fails to decrypt/decompressBP_USES_CRYPTmissing on encrypted dataset (decrypt path never enters)#18474 walks past lost-CRYPT-flag blocks because they never enter
the decrypt path; #18587 walks past decrypt-failures because the BP
looks intact. Running both gives full coverage of the
encrypted-block-corruption space surfaced in the issues above.
How Has This Been Tested?
current branch (after rebase onto master).
zfs_dbgmsgper visited BP,removed in final commit) verified that the new check filter
correctly classifies BPs across both leaf and indirect levels:
os_encrypted=1objsets: every visited non-embedded BPwith
BP_USES_CRYPTclear was captured by the new branchos_encrypted=0objsets: zero false matchescheck targets. The state is reported by users on long-running
pools (often after
zfs receiveof encrypted streams onto afresh pool, or hardware-induced corruption per PANIC: unencrypted block in encrypted object set #16065). The
detection logic was verified by examining every BP visited
during scrubs of synthetic encrypted pools — the conditional
filter behaves as designed.
make checkstylepasses for the modified file.(no regression on synthetic pools that do not contain the
target corruption).
Types of changes
Checklist:
Signed-off-by.Open questions for reviewers
Test coverage. No existing reproducer in-tree creates the
exact "encrypted BP without
BP_USES_CRYPT" state. Usersreport it in production but I was unable to construct a
minimal reproducer in a VM. A follow-up zinject extension
(e.g.
ZINJECT_LOSE_CRYPT_FLAG, modeled on the no-op injectiontype added in zinject: "no-op" error injection #16085) would unblock a deterministic ZTS test;
happy to submit that as a prerequisite if reviewers prefer.
Should
zpool statussurface a count ofencryption flag mismatchesseparately from the standard "permanent errors"counter? Current change reuses the existing
spa_log_errorinfrastructure so the error appears in the standard list; a
dedicated counter would require a new event subtype which
feels out of scope for a detection-only patch.
Repair path. I prototyped three repair modes (smart flip
when MAC validates, re-encrypt rewrite, free fallback) but
pulled them out of this PR because none were verified against
a real reproducer. Happy to follow up with them in a separate
PR once detection is in place and we have observability into
how often this fires in practice.