Persist z_seq across znode eviction#18573
Open
ixhamza wants to merge 1 commit into
Open
Conversation
107a3df to
eeb4661
Compare
This was referenced May 26, 2026
14 tasks
amotin
reviewed
May 28, 2026
eeb4661 to
56245e3
Compare
56245e3 to
3b204af
Compare
behlendorf
reviewed
Jun 2, 2026
| */ | ||
| #define ZFS_PERSIST_SEQ(zp, bulk, count) \ | ||
| { \ | ||
| if ((zp)->z_is_sa) { \ |
Contributor
There was a problem hiding this comment.
I don't feel strongly about it, but my feeling is this would probably be more readable without the macro. I also wonder if the if ((zp)->z_is_sa) check is really needed. While I didn't audit every call site in general it looks like it would be straight forward to add this only to the SA code paths.
Member
Author
There was a problem hiding this comment.
Without the guard, sa_bulk_update panics on non-SA znodes. SA_ZPL_PROJID has the same constraint but is gated via if (projid != ZFS_INVALID_PROJID). z_seq is always set, so we check z_is_sa directly.
| &ctime, 16); | ||
| SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_SIZE(outzfsvfs), NULL, | ||
| &outzp->z_size, 8); | ||
| if (outzp->z_is_sa) |
Contributor
There was a problem hiding this comment.
This must be true, you shouldn't need the conditional here.
Commit 312bdab advertises STATX_ATTR_CHANGE_MONOTONIC and builds the NFSv4 change_cookie from (ctime.tv_sec << 32) | zp->z_seq. zp->z_seq is reset to a magic constant in zfs_znode_alloc(), so any event that drops the znode from cache (memory pressure, remount, reboot) regresses the lower bits of the cookie, a backward step within the same second. NFSv4 clients that trust this contract treat a regressed cookie as evidence that the file's metadata cannot be relied on. VMware ESXi over NFSv4.1 surfaces this as "The file specified is not a virtual disk", and a VM stored on the affected NFS-exported ZFS dataset fails to power on. Widen z_seq to 64 bit and present it directly as the change_cookie, dropping the ctime packing, so the cookie is a single monotonic counter that no longer depends on the clock. FreeBSD's va_filerev consumer also takes the wider value. Persist z_seq via a new SA attribute SA_ZPL_SEQ. An in-core marker zp->z_has_seq records whether the file already carries SA_ZPL_SEQ in its layout; it is derived at load time and never stored on disk, so no global pflag bit is consumed. ZFS_SEQ_MAY_GROW() keys off the marker to grow the SA layout only on the first add per file; ZFS_PERSIST_SEQ() then sets the marker and adds SEQ to the caller's bulk alongside the file's other SA attributes. zfs_znode_alloc() restores z_seq from SA_ZPL_SEQ when present and sets the marker; zfs_rezget() recomputes the marker in place on rollback/recv without disturbing the in-core z_seq, keeping the cookie monotonic. A file written before this change carries no SA_ZPL_SEQ; on Linux it is seeded with (ctime.tv_sec + 1) << 32 so the counter starts above any pre-change cookie and stays monotonic across the upgrade. A missing attribute is simply treated as not-yet-migrated, not an error. FreeBSD never folded ctime into va_filerev, so it needs no seed. No feature flag or on-disk format change is needed: the new SA attribute is keyed by name, so an implementation that does not know it preserves it opaquely, and the first modify lazily migrates each file. Covers both the Linux and FreeBSD ZPL. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
3b204af to
95dc629
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation and Context
Commit 312bdab advertises
STATX_ATTR_CHANGE_MONOTONICto knfsd and builds the NFSv4change_cookiefrom(ctime.tv_sec << 32) | zp->z_seq.zp->z_seqis reset to a magic constant inzfs_znode_alloc(), so any event that drops the znode from cache (memory pressure, remount, reboot) brings the file back with the samectime.tv_secupper bits but a smallerz_seqin the lower bits, regressing the cookie within the same second.NFSv4 clients that trust the monotonicity contract treat this as metadata they cannot rely on. VMware ESXi over NFSv4.1 reliably reproduces it with The file specified is not a virtual disk, causing a VM stored on the affected ZFS dataset to fail to power on.
Description
Persist
zp->z_seqvia a new SA attributeSA_ZPL_SEQso it survives znode eviction. A new pflag bitZFS_HAS_SEQmarks the file as carryingSA_ZPL_SEQin its layout, mirroring the existingZPL_PROJID/ZFS_PROJIDpattern. The bit gatesmay_growat SA tx-hold sites, choosingB_TRUEon the first add per file andB_FALSEthereafter, so steady-state operations pay no extra reservation.A
ZFS_PERSIST_SEQ()macro capturesz_seqand sets the bit into the caller's bulk in one step, persisting both atomically alongside the file's other SA attributes. Every site that bumpsz_sequses it.zfs_znode_alloc()restoresz_seqfromSA_ZPL_SEQwhen the bit is set.No on-disk format change requiring a feature flag is needed. Older binaries preserve the new attribute and bit opaquely. The first modify by a patched binary lazily migrates each file.
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by.