Skip to content

zap: add zap_cursor_init_by_dnode; cursor unit tests; mock dnode refcounts#18603

Closed
robn wants to merge 6 commits into
openzfs:masterfrom
robn:zap-unit-cursor
Closed

zap: add zap_cursor_init_by_dnode; cursor unit tests; mock dnode refcounts#18603
robn wants to merge 6 commits into
openzfs:masterfrom
robn:zap-unit-cursor

Conversation

@robn
Copy link
Copy Markdown
Member

@robn robn commented May 29, 2026

Motivation and Context

Continuing quest towards full unit test coverage for ZAPs

Description

The main thing is to add zap_cursor_init_by_dnode() (and its close friend, zap_cursor_init_serialized_by_dnode()).

This is not as simple as just renaming and adjusting the old code and making zap_cursor_init() wrap it, because cursors do not actually call zap_lock() and do stuff until the first call to zap_cursor_retrieve(), which may not be called at all. So instead we have to hold the dnode if provided until that call (if it ever comes), and for the objset+object version, take our own dnode hold there. And make sure we handle all the places where we don't end up using. Not that complicated in the end, but was a bit squirly along the way.

Edit: following review, we ended up doing away with all this, and just taking the ZAP hold (zap_lock() etc) from the init funcs. See the comments for more details. I kept the mock dnode refcounting though; it's not wrong and will undoubtedly be useful.

To help with this, I've added light refcounting to mock dnodes. Nothing fancy, but enough that in test cleanup, we can now test if the refcount has returned to 1 and fail the test if so. This has been wired through to all tests (actually into mock_zap_destroy(), which they all call). Then I used them to add some tests to help make sure I got it correct in zap_cursor_init_by_dnode(). Which I didn't, the first time! Very useful indeed!

Finally, the actual thing - cursor tests. These are just adding a bunch of things, walking over them and making sure the right things come back via a few different scenarios.

(All this is described in the commit messages and comments).

I've been holding back a couple of "cursor-related" test sets (zap_join and zap_value_search) until I had coverage for cursors proper. I was going to include them in this PR, but they have just enough of their own stuff going on that I think it would muddle review too much, so I'll get them in a future one.

I also note that zap_cursor_t and its setup is a little ugly now. I think its straightforward to tidy (mostly, split the "init" parts from the "cursor" parts in some way) but for now I wanted to keep the code changes as small as I could, and not change the memory footprint at all. I'll get to that once all this has settled.

How Has This Been Tested?

It's all tests! But I've run a handful of sanity checks from ZTS as well, just to make sure that normal zap_cursor_init() path isn't damaged, and those are fine too. I'll let CI alert to anything else.

Coverage before:

File                               Lines                Functions
-----------------------------------------------------------------
zap_micro.c       65.1%  (   155/   238)   78.6%  (    11/    14)
zap_leaf.c        53.4%  (   249/   466)   69.6%  (    16/    23)
zap_impl.c        50.4%  (   117/   232)   60.0%  (    15/    25)
zap_fat.c         41.5%  (   276/   665)   59.5%  (    22/    37)
zap.c             22.9%  (   139/   606)   14.1%  (    10/    71)
u8_textprep.c      0.0%  (     0/   802)    0.0%  (     0/    12)

after:

File                               Lines                Functions
-----------------------------------------------------------------
zap_micro.c       68.9%  (   164/   238)   92.9%  (    13/    14)
zap_leaf.c        60.9%  (   284/   466)   78.3%  (    18/    23)
zap_impl.c        57.8%  (   134/   232)   72.0%  (    18/    25)
zap_fat.c         47.1%  (   313/   665)   62.2%  (    23/    37)
zap.c             33.9%  (   207/   610)   23.0%  (    17/    74)
u8_textprep.c      0.0%  (     0/   802)    0.0%  (     0/    12)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Copy Markdown
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make some real benefit to postpone dnode hold in case of zap_cursor_init()? Is there some code that calls it, but then don't traverse, that would rationalize the complexity?

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label May 29, 2026
@behlendorf
Copy link
Copy Markdown
Contributor

Is there some code that calls it, but then don't traverse

The only case I can think of is when the ZAP is empty then there's nothing to traverse and no hold. But this doesn't seem like an optimization worth the complexity to me. Always taking the dhold in _init() and _init_by_node() would have the advantage of letting us assert in the zap_cursor_* functions there is always valid hold. While looking at the Lustre ZAP consumers I see one case when zap_cursor_retrieve() is called after zap_cursor_fini().

@robn
Copy link
Copy Markdown
Member Author

robn commented May 31, 2026

I would rather do the dnode and the zap_t hold in _init(), for sure. The tricky bit of doing it in zap_cursor_init() is that it makes it fallible, and we have to handle that:

  • changing the return to int and auditing all call sites is the "right" way, and not even very hard (they all have handling for zap_cursor_retrieve() failing, but it touches a lot of code right now.
  • using VERIFY0 for dnode_hold() and zap_lock_by_dnode(). Just no.
  • deferring the error response until the first call to zap_cursor_retrieve(). Keeps the interface the same, but its a bit of a cop-out when the point is to clean it all up.

I'll have a go of the first, see how it looks.

@robn robn force-pushed the zap-unit-cursor branch from 63675c1 to 6a75f75 Compare May 31, 2026 02:02
@robn
Copy link
Copy Markdown
Member Author

robn commented May 31, 2026

Ok, last push does all the setup in zap_cursor_init() and friends, changing them to int returns.

I started converting callers, but most ZAP cursor loops look this (from get_clones_stat_impl(), arbitrary choice):

	for (zap_cursor_init(&zc, mos,
	    dsl_dataset_phys(ds)->ds_next_clones_obj);
	    zap_cursor_retrieve(&zc, za) == 0;
	    zap_cursor_advance(&zc)) {

Error checking the init seperately ends up looking like:

	if (zap_cursor_init(&zc, mos,
	    dsl_dataset_phys(ds)->ds_next_clones_obj) == 0) {
		for(; zap_cursor_retrieve(&zc, za) == 0;
		    zap_cursor_advance(&zc)) {
			...
		}
		zap_cursor_fini(&zc);
	}

This is a lot harder to read, especially in the cases where the additional indent pushes the inner block beyond 80 columns. It's also kind of pointless because the loop already didn't care about cursor errors; treating (say) EIO and ENOENT exactly the same. There's probably a case to be made that these all need closer study to determine what they actually should do in case of an error, and I don't want to make that worse.

So instead I've gone closer to the "defer error report" idea, but since most callers don't actually care about the specific error, instead I've made it so a failed init will zero enough of the zap_cursor_t to let zap_cursor_retrieve() detect this case and just return EIO, and for zap_cursor_fini() to just do nothing. We can get away with that, because nothing looks inside zap_cursor_t directly (and rightly so, it says "this is opaque" in the comment). And if you are a more advanced caller and you do care about the distinction, then you can do the work.

I think I'm happy with this. Let me know if its cool, and I'll fold it back into the original commit and reorganise it all (no point cluttering up the commit history adding the dnode hold defer only to immediately remove it.

@robn robn force-pushed the zap-unit-cursor branch from 6a75f75 to 122cb2d Compare June 1, 2026 01:47
Copy link
Copy Markdown
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objections, looks cleaner to me. Just zc_flags change is not needed any more, unless you have other plans for it, and CI seems not happy about something.

Comment thread module/zfs/zap.c Outdated
@behlendorf
Copy link
Copy Markdown
Contributor

Yeah, I'm pretty happy with how this turned out too. This does seem to have uncovered some issue though which the zpool_error_scrub_* tests are able to reliably hit.

    KILLED /usr/local/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_error_scrub_002_pos
    KILLED /usr/local/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_error_scrub_003_pos
    KILLED /usr/local/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_error_scrub_004_pos

robn added 4 commits June 2, 2026 18:21
If the cursor were ever to actively hold resources, not finalising it
would mean leaking those resources whenever the scrub is paused.

The cursor is already reinitialized from the stored serialized form
if/when it is resumed, so there's nothing we need from the old one, just
to release it.

Sponsored-by: TrueNAS
Signed-off-by: Rob Norris <rob.norris@truenas.com>
This commit adds zap_cursor_init_by_dnode() (and
zap_cursor_init_serialized_by_dnode()), which allow the target ZAP to
provided via an existing dnode rather than the traditional objset+object
pair.

This requires some reorganisation of the way that zap_cursor_t is
initialised. Up until now, zap_cursor_init() has merely stored the
objset, object, serialized form and prefetch flag, and left it until
zap_cursor_retrieve() to actually call zap_lock(). This makes a
_by_dnode() form complicated, because it is a held resource that needs
to be released, but might not be used if zap_cursor_retrieve() is not
called. So there's a bunch of state tracking required.

However, all cursor users immediately follow zap_cursor_init() with
zap_cursor_retrieve(), so there's nothing gained by delaying holds. This
allows us to simplify things, by calling zap_lock() directly in
zap_cursor_init() and retaining it until zap_cursor_fini().

This does however means the _init() functions are now fallible, and can
return an error. This adds complexity to most of the call sites, which
are typically in a for loop of the form:

    for (zap_cursor_init(...);
      zap_cursor_retrieve(...) == 0;
      zap_cursor_advance(...))

To avoid needing to make significant changes at every call site, a
failed _init() call will also zero the cursor struct. If the caller
doesn't check the return and continues to zap_cursor_retrieve(), they
will get an EIO return, and zap_cursor_fini() will just return.

The existing zc_objset and zc_zapobj fields are retained to support
source backcompat for Lustre, which inspects them directly.

Sponsored-by: TrueNAS
Signed-off-by: Rob Norris <rob.norris@truenas.com>
The thing under test will be taking and releasing dnode refs/holds. By
counting them and exposing the current count, we can assert in test
cleanup that we haven't missed releasing any, especially in cases where
the hold is held across multiple test steps.

Sponsored-by: TrueNAS
Signed-off-by: Rob Norris <rob.norris@truenas.com>
It should be back at 1, where it started.

Sponsored-by: TrueNAS
Signed-off-by: Rob Norris <rob.norris@truenas.com>
@robn robn force-pushed the zap-unit-cursor branch from 122cb2d to 96c38f9 Compare June 2, 2026 08:29
@robn
Copy link
Copy Markdown
Member Author

robn commented Jun 2, 2026

Last push rebases to master and takes care of the outstanding comments.

  • the failing test was actually zpool_error_scrub_001_pos, which hung. It turns out that error scrub carries a live cursor around in dsl_scan_t, and it is reinitialised on resume, but not finalised on cancel. I wonder if that's ever been a source of leaks in the past? Anyway, a new first commit sorts that out, and the zpool_scrub tag runs to completion.
  • zc_objset and zc_zapobj have been restored, with some "do not use" commentary. This is source compatible, but not ABI compatible (layout change). I can do that if necessary but I'd rather not. Let me know.
  • reverted zc_flags to zc_prefetch. I prefer flags in general, especially because boolean_t is typically a whole int wide. But its a smaller diff, which is nice here. Easy enough to remake it as flags in the future.

Commit stack reordered, everything related to the ZAP changes proper now in a single commit. It's pretty nice I think, good feedback everyone!

Comment thread tests/unit/test_zap.c Outdated
robn added 2 commits June 2, 2026 23:29
These add a bunch of entries to the ZAP, and then ensure that a cursor
walk over the ZAP sees them all once and once only, and no others.

The serialization test takes it a bit further, by serializing and
recreating the cursor half way through and confirming it correctly picks
up from the same spot, and then recreating the cursor from serialized
again and confirming that it also see only the second set of entries.
This ensures that the serialized cursor state is fully self contained
and not reliant on anything left over in the ZAP itself at serialization
time.

Sponsored-by: TrueNAS
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Cursors defer taking holds until they're needed, so if a cursor is
created but not used, it may still hold resources that it would have
cleaned up along the way, but never got chance to.

(this really happened in the first version of
zap_cursor_init_by_dnode(), so not a contrived case!)

Sponsored-by: TrueNAS
Signed-off-by: Rob Norris <rob.norris@truenas.com>
@robn robn force-pushed the zap-unit-cursor branch from 96c38f9 to f8ea7e1 Compare June 2, 2026 13:29
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jun 2, 2026
@behlendorf behlendorf closed this in 68980eb Jun 2, 2026
behlendorf pushed a commit that referenced this pull request Jun 2, 2026
This commit adds zap_cursor_init_by_dnode() (and
zap_cursor_init_serialized_by_dnode()), which allow the target ZAP to
provided via an existing dnode rather than the traditional objset+object
pair.

This requires some reorganisation of the way that zap_cursor_t is
initialised. Up until now, zap_cursor_init() has merely stored the
objset, object, serialized form and prefetch flag, and left it until
zap_cursor_retrieve() to actually call zap_lock(). This makes a
_by_dnode() form complicated, because it is a held resource that needs
to be released, but might not be used if zap_cursor_retrieve() is not
called. So there's a bunch of state tracking required.

However, all cursor users immediately follow zap_cursor_init() with
zap_cursor_retrieve(), so there's nothing gained by delaying holds. This
allows us to simplify things, by calling zap_lock() directly in
zap_cursor_init() and retaining it until zap_cursor_fini().

This does however means the _init() functions are now fallible, and can
return an error. This adds complexity to most of the call sites, which
are typically in a for loop of the form:

    for (zap_cursor_init(...);
      zap_cursor_retrieve(...) == 0;
      zap_cursor_advance(...))

To avoid needing to make significant changes at every call site, a
failed _init() call will also zero the cursor struct. If the caller
doesn't check the return and continues to zap_cursor_retrieve(), they
will get an EIO return, and zap_cursor_fini() will just return.

The existing zc_objset and zc_zapobj fields are retained to support
source backcompat for Lustre, which inspects them directly.

Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18603
behlendorf pushed a commit that referenced this pull request Jun 2, 2026
The thing under test will be taking and releasing dnode refs/holds. By
counting them and exposing the current count, we can assert in test
cleanup that we haven't missed releasing any, especially in cases where
the hold is held across multiple test steps.

Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18603
behlendorf pushed a commit that referenced this pull request Jun 2, 2026
It should be back at 1, where it started.

Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18603
behlendorf pushed a commit that referenced this pull request Jun 2, 2026
These add a bunch of entries to the ZAP, and then ensure that a cursor
walk over the ZAP sees them all once and once only, and no others.

The serialization test takes it a bit further, by serializing and
recreating the cursor half way through and confirming it correctly picks
up from the same spot, and then recreating the cursor from serialized
again and confirming that it also see only the second set of entries.
This ensures that the serialized cursor state is fully self contained
and not reliant on anything left over in the ZAP itself at serialization
time.

Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18603
behlendorf pushed a commit that referenced this pull request Jun 2, 2026
Cursors defer taking holds until they're needed, so if a cursor is
created but not used, it may still hold resources that it would have
cleaned up along the way, but never got chance to.

(this really happened in the first version of
zap_cursor_init_by_dnode(), so not a contrived case!)

Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18603
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Accepted Ready to integrate (reviewed, tested)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants