Skip to content

Persist z_seq across znode eviction#18573

Open
ixhamza wants to merge 1 commit into
openzfs:masterfrom
truenas:persist_znode_across_eviction
Open

Persist z_seq across znode eviction#18573
ixhamza wants to merge 1 commit into
openzfs:masterfrom
truenas:persist_znode_across_eviction

Conversation

@ixhamza
Copy link
Copy Markdown
Member

@ixhamza ixhamza commented May 21, 2026

Motivation and Context

Commit 312bdab advertises STATX_ATTR_CHANGE_MONOTONIC to knfsd and builds the NFSv4 change_cookie from (ctime.tv_sec << 32) | zp->z_seq. zp->z_seq is reset to a magic constant in zfs_znode_alloc(), so any event that drops the znode from cache (memory pressure, remount, reboot) brings the file back with the same ctime.tv_sec upper bits but a smaller z_seq in the lower bits, regressing the cookie within the same second.

NFSv4 clients that trust the monotonicity contract treat this as metadata they cannot rely on. VMware ESXi over NFSv4.1 reliably reproduces it with The file specified is not a virtual disk, causing a VM stored on the affected ZFS dataset to fail to power on.

Description

Persist zp->z_seq via a new SA attribute SA_ZPL_SEQ so it survives znode eviction. A new pflag bit ZFS_HAS_SEQ marks the file as carrying SA_ZPL_SEQ in its layout, mirroring the existing ZPL_PROJID/ZFS_PROJID pattern. The bit gates may_grow at SA tx-hold sites, choosing B_TRUE on the first add per file and B_FALSE thereafter, so steady-state operations pay no extra reservation.

A ZFS_PERSIST_SEQ() macro captures z_seq and sets the bit into the caller's bulk in one step, persisting both atomically alongside the file's other SA attributes. Every site that bumps z_seq uses it. zfs_znode_alloc() restores z_seq from SA_ZPL_SEQ when the bit is set.

No on-disk format change requiring a feature flag is needed. Older binaries preserve the new attribute and bit opaquely. The first modify by a patched binary lazily migrates each file.

How Has This Been Tested?

  • Before: ESXi VM fails to power on over NFSv4.1.
  • After: VM powers on successfully.
  • CI Testing

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Comment thread include/sys/zfs_znode.h Outdated
@ixhamza ixhamza force-pushed the persist_znode_across_eviction branch from eeb4661 to 56245e3 Compare June 1, 2026 21:14
@behlendorf behlendorf self-requested a review June 1, 2026 22:03
@ixhamza ixhamza force-pushed the persist_znode_across_eviction branch from 56245e3 to 3b204af Compare June 2, 2026 21:28
Comment thread include/sys/zfs_znode.h
*/
#define ZFS_PERSIST_SEQ(zp, bulk, count) \
{ \
if ((zp)->z_is_sa) { \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about it, but my feeling is this would probably be more readable without the macro. I also wonder if the if ((zp)->z_is_sa) check is really needed. While I didn't audit every call site in general it looks like it would be straight forward to add this only to the SA code paths.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the guard, sa_bulk_update panics on non-SA znodes. SA_ZPL_PROJID has the same constraint but is gated via if (projid != ZFS_INVALID_PROJID). z_seq is always set, so we check z_is_sa directly.

Comment thread module/zfs/zfs_vnops.c
&ctime, 16);
SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_SIZE(outzfsvfs), NULL,
&outzp->z_size, 8);
if (outzp->z_is_sa)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be true, you shouldn't need the conditional here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as above.

Comment thread module/zfs/zfs_vnops.c Outdated
Commit 312bdab advertises STATX_ATTR_CHANGE_MONOTONIC and builds
the NFSv4 change_cookie from (ctime.tv_sec << 32) | zp->z_seq.
zp->z_seq is reset to a magic constant in zfs_znode_alloc(), so any
event that drops the znode from cache (memory pressure, remount,
reboot) regresses the lower bits of the cookie, a backward step
within the same second.

NFSv4 clients that trust this contract treat a regressed cookie as
evidence that the file's metadata cannot be relied on. VMware ESXi
over NFSv4.1 surfaces this as "The file specified is not a virtual
disk", and a VM stored on the affected NFS-exported ZFS dataset
fails to power on.

Widen z_seq to 64 bit and present it directly as the change_cookie,
dropping the ctime packing, so the cookie is a single monotonic
counter that no longer depends on the clock. FreeBSD's va_filerev
consumer also takes the wider value.

Persist z_seq via a new SA attribute SA_ZPL_SEQ. An in-core marker
zp->z_has_seq records whether the file already carries SA_ZPL_SEQ in
its layout; it is derived at load time and never stored on disk, so
no global pflag bit is consumed. ZFS_SEQ_MAY_GROW() keys off the
marker to grow the SA layout only on the first add per file;
ZFS_PERSIST_SEQ() then sets the marker and adds SEQ to the caller's
bulk alongside the file's other SA attributes. zfs_znode_alloc()
restores z_seq from SA_ZPL_SEQ when present and sets the marker;
zfs_rezget() recomputes the marker in place on rollback/recv without
disturbing the in-core z_seq, keeping the cookie monotonic.

A file written before this change carries no SA_ZPL_SEQ; on Linux it
is seeded with (ctime.tv_sec + 1) << 32 so the counter starts above
any pre-change cookie and stays monotonic across the upgrade. A
missing attribute is simply treated as not-yet-migrated, not an
error. FreeBSD never folded ctime into va_filerev, so it needs no
seed.

No feature flag or on-disk format change is needed: the new SA
attribute is keyed by name, so an implementation that does not know
it preserves it opaquely, and the first modify lazily migrates each
file. Covers both the Linux and FreeBSD ZPL.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
@ixhamza ixhamza force-pushed the persist_znode_across_eviction branch from 3b204af to 95dc629 Compare June 3, 2026 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants