zap: add `zap_cursor_init_by_dnode`; cursor unit tests; mock dnode refcounts by robn · Pull Request #18603 · openzfs/zfs

robn · 2026-05-29T06:42:08Z

Motivation and Context

Continuing quest towards full unit test coverage for ZAPs

Description

The main thing is to add zap_cursor_init_by_dnode() (and its close friend, zap_cursor_init_serialized_by_dnode()).

This is not as simple as just renaming and adjusting the old code and making zap_cursor_init() wrap it, because cursors do not actually call zap_lock() and do stuff until the first call to zap_cursor_retrieve(), which may not be called at all. So instead we have to hold the dnode if provided until that call (if it ever comes), and for the objset+object version, take our own dnode hold there. And make sure we handle all the places where we don't end up using. Not that complicated in the end, but was a bit squirly along the way.

Edit: following review, we ended up doing away with all this, and just taking the ZAP hold (zap_lock() etc) from the init funcs. See the comments for more details. I kept the mock dnode refcounting though; it's not wrong and will undoubtedly be useful.

To help with this, I've added light refcounting to mock dnodes. Nothing fancy, but enough that in test cleanup, we can now test if the refcount has returned to 1 and fail the test if so. This has been wired through to all tests (actually into mock_zap_destroy(), which they all call). Then I used them to add some tests to help make sure I got it correct in zap_cursor_init_by_dnode(). Which I didn't, the first time! Very useful indeed!

Finally, the actual thing - cursor tests. These are just adding a bunch of things, walking over them and making sure the right things come back via a few different scenarios.

(All this is described in the commit messages and comments).

I've been holding back a couple of "cursor-related" test sets (zap_join and zap_value_search) until I had coverage for cursors proper. I was going to include them in this PR, but they have just enough of their own stuff going on that I think it would muddle review too much, so I'll get them in a future one.

I also note that zap_cursor_t and its setup is a little ugly now. I think its straightforward to tidy (mostly, split the "init" parts from the "cursor" parts in some way) but for now I wanted to keep the code changes as small as I could, and not change the memory footprint at all. I'll get to that once all this has settled.

How Has This Been Tested?

It's all tests! But I've run a handful of sanity checks from ZTS as well, just to make sure that normal zap_cursor_init() path isn't damaged, and those are fine too. I'll let CI alert to anything else.

Coverage before:

File                               Lines                Functions
-----------------------------------------------------------------
zap_micro.c       65.1%  (   155/   238)   78.6%  (    11/    14)
zap_leaf.c        53.4%  (   249/   466)   69.6%  (    16/    23)
zap_impl.c        50.4%  (   117/   232)   60.0%  (    15/    25)
zap_fat.c         41.5%  (   276/   665)   59.5%  (    22/    37)
zap.c             22.9%  (   139/   606)   14.1%  (    10/    71)
u8_textprep.c      0.0%  (     0/   802)    0.0%  (     0/    12)

after:

File                               Lines                Functions
-----------------------------------------------------------------
zap_micro.c       68.9%  (   164/   238)   92.9%  (    13/    14)
zap_leaf.c        60.9%  (   284/   466)   78.3%  (    18/    23)
zap_impl.c        57.8%  (   134/   232)   72.0%  (    18/    25)
zap_fat.c         47.1%  (   313/   665)   62.2%  (    23/    37)
zap.c             33.9%  (   207/   610)   23.0%  (    17/    74)
u8_textprep.c      0.0%  (     0/   802)    0.0%  (     0/    12)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin

Does it make some real benefit to postpone dnode hold in case of zap_cursor_init()? Is there some code that calls it, but then don't traverse, that would rationalize the complexity?

behlendorf · 2026-05-29T21:15:01Z

Is there some code that calls it, but then don't traverse

The only case I can think of is when the ZAP is empty then there's nothing to traverse and no hold. But this doesn't seem like an optimization worth the complexity to me. Always taking the dhold in _init() and _init_by_node() would have the advantage of letting us assert in the zap_cursor_* functions there is always valid hold. While looking at the Lustre ZAP consumers I see one case when zap_cursor_retrieve() is called after zap_cursor_fini().

robn · 2026-05-31T00:13:04Z

I would rather do the dnode and the zap_t hold in _init(), for sure. The tricky bit of doing it in zap_cursor_init() is that it makes it fallible, and we have to handle that:

changing the return to int and auditing all call sites is the "right" way, and not even very hard (they all have handling for zap_cursor_retrieve() failing, but it touches a lot of code right now.
using VERIFY0 for dnode_hold() and zap_lock_by_dnode(). Just no.
deferring the error response until the first call to zap_cursor_retrieve(). Keeps the interface the same, but its a bit of a cop-out when the point is to clean it all up.

I'll have a go of the first, see how it looks.

robn · 2026-05-31T02:16:45Z

Ok, last push does all the setup in zap_cursor_init() and friends, changing them to int returns.

I started converting callers, but most ZAP cursor loops look this (from get_clones_stat_impl(), arbitrary choice):

	for (zap_cursor_init(&zc, mos,
	    dsl_dataset_phys(ds)->ds_next_clones_obj);
	    zap_cursor_retrieve(&zc, za) == 0;
	    zap_cursor_advance(&zc)) {

Error checking the init seperately ends up looking like:

	if (zap_cursor_init(&zc, mos,
	    dsl_dataset_phys(ds)->ds_next_clones_obj) == 0) {
		for(; zap_cursor_retrieve(&zc, za) == 0;
		    zap_cursor_advance(&zc)) {
			...
		}
		zap_cursor_fini(&zc);
	}

This is a lot harder to read, especially in the cases where the additional indent pushes the inner block beyond 80 columns. It's also kind of pointless because the loop already didn't care about cursor errors; treating (say) EIO and ENOENT exactly the same. There's probably a case to be made that these all need closer study to determine what they actually should do in case of an error, and I don't want to make that worse.

So instead I've gone closer to the "defer error report" idea, but since most callers don't actually care about the specific error, instead I've made it so a failed init will zero enough of the zap_cursor_t to let zap_cursor_retrieve() detect this case and just return EIO, and for zap_cursor_fini() to just do nothing. We can get away with that, because nothing looks inside zap_cursor_t directly (and rightly so, it says "this is opaque" in the comment). And if you are a more advanced caller and you do care about the distinction, then you can do the work.

I think I'm happy with this. Let me know if its cool, and I'll fold it back into the original commit and reorganise it all (no point cluttering up the commit history adding the dnode hold defer only to immediately remove it.

amotin

I have no objections, looks cleaner to me. Just zc_flags change is not needed any more, unless you have other plans for it, and CI seems not happy about something.

behlendorf · 2026-06-01T18:02:19Z

Yeah, I'm pretty happy with how this turned out too. This does seem to have uncovered some issue though which the zpool_error_scrub_* tests are able to reliably hit.

    KILLED /usr/local/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_error_scrub_002_pos
    KILLED /usr/local/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_error_scrub_003_pos
    KILLED /usr/local/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_error_scrub_004_pos

If the cursor were ever to actively hold resources, not finalising it would mean leaking those resources whenever the scrub is paused. The cursor is already reinitialized from the stored serialized form if/when it is resumed, so there's nothing we need from the old one, just to release it. Sponsored-by: TrueNAS Signed-off-by: Rob Norris <rob.norris@truenas.com>

This commit adds zap_cursor_init_by_dnode() (and zap_cursor_init_serialized_by_dnode()), which allow the target ZAP to provided via an existing dnode rather than the traditional objset+object pair. This requires some reorganisation of the way that zap_cursor_t is initialised. Up until now, zap_cursor_init() has merely stored the objset, object, serialized form and prefetch flag, and left it until zap_cursor_retrieve() to actually call zap_lock(). This makes a _by_dnode() form complicated, because it is a held resource that needs to be released, but might not be used if zap_cursor_retrieve() is not called. So there's a bunch of state tracking required. However, all cursor users immediately follow zap_cursor_init() with zap_cursor_retrieve(), so there's nothing gained by delaying holds. This allows us to simplify things, by calling zap_lock() directly in zap_cursor_init() and retaining it until zap_cursor_fini(). This does however means the _init() functions are now fallible, and can return an error. This adds complexity to most of the call sites, which are typically in a for loop of the form: for (zap_cursor_init(...); zap_cursor_retrieve(...) == 0; zap_cursor_advance(...)) To avoid needing to make significant changes at every call site, a failed _init() call will also zero the cursor struct. If the caller doesn't check the return and continues to zap_cursor_retrieve(), they will get an EIO return, and zap_cursor_fini() will just return. The existing zc_objset and zc_zapobj fields are retained to support source backcompat for Lustre, which inspects them directly. Sponsored-by: TrueNAS Signed-off-by: Rob Norris <rob.norris@truenas.com>

The thing under test will be taking and releasing dnode refs/holds. By counting them and exposing the current count, we can assert in test cleanup that we haven't missed releasing any, especially in cases where the hold is held across multiple test steps. Sponsored-by: TrueNAS Signed-off-by: Rob Norris <rob.norris@truenas.com>

It should be back at 1, where it started. Sponsored-by: TrueNAS Signed-off-by: Rob Norris <rob.norris@truenas.com>

robn · 2026-06-02T08:43:51Z

Last push rebases to master and takes care of the outstanding comments.

the failing test was actually zpool_error_scrub_001_pos, which hung. It turns out that error scrub carries a live cursor around in dsl_scan_t, and it is reinitialised on resume, but not finalised on cancel. I wonder if that's ever been a source of leaks in the past? Anyway, a new first commit sorts that out, and the zpool_scrub tag runs to completion.
zc_objset and zc_zapobj have been restored, with some "do not use" commentary. This is source compatible, but not ABI compatible (layout change). I can do that if necessary but I'd rather not. Let me know.
reverted zc_flags to zc_prefetch. I prefer flags in general, especially because boolean_t is typically a whole int wide. But its a smaller diff, which is nice here. Easy enough to remake it as flags in the future.

Commit stack reordered, everything related to the ZAP changes proper now in a single commit. It's pretty nice I think, good feedback everyone!

These add a bunch of entries to the ZAP, and then ensure that a cursor walk over the ZAP sees them all once and once only, and no others. The serialization test takes it a bit further, by serializing and recreating the cursor half way through and confirming it correctly picks up from the same spot, and then recreating the cursor from serialized again and confirming that it also see only the second set of entries. This ensures that the serialized cursor state is fully self contained and not reliant on anything left over in the ZAP itself at serialization time. Sponsored-by: TrueNAS Signed-off-by: Rob Norris <rob.norris@truenas.com>

Cursors defer taking holds until they're needed, so if a cursor is created but not used, it may still hold resources that it would have cleaned up along the way, but never got chance to. (this really happened in the first version of zap_cursor_init_by_dnode(), so not a contrived case!) Sponsored-by: TrueNAS Signed-off-by: Rob Norris <rob.norris@truenas.com>

This commit adds zap_cursor_init_by_dnode() (and zap_cursor_init_serialized_by_dnode()), which allow the target ZAP to provided via an existing dnode rather than the traditional objset+object pair. This requires some reorganisation of the way that zap_cursor_t is initialised. Up until now, zap_cursor_init() has merely stored the objset, object, serialized form and prefetch flag, and left it until zap_cursor_retrieve() to actually call zap_lock(). This makes a _by_dnode() form complicated, because it is a held resource that needs to be released, but might not be used if zap_cursor_retrieve() is not called. So there's a bunch of state tracking required. However, all cursor users immediately follow zap_cursor_init() with zap_cursor_retrieve(), so there's nothing gained by delaying holds. This allows us to simplify things, by calling zap_lock() directly in zap_cursor_init() and retaining it until zap_cursor_fini(). This does however means the _init() functions are now fallible, and can return an error. This adds complexity to most of the call sites, which are typically in a for loop of the form: for (zap_cursor_init(...); zap_cursor_retrieve(...) == 0; zap_cursor_advance(...)) To avoid needing to make significant changes at every call site, a failed _init() call will also zero the cursor struct. If the caller doesn't check the return and continues to zap_cursor_retrieve(), they will get an EIO return, and zap_cursor_fini() will just return. The existing zc_objset and zc_zapobj fields are retained to support source backcompat for Lustre, which inspects them directly. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18603

The thing under test will be taking and releasing dnode refs/holds. By counting them and exposing the current count, we can assert in test cleanup that we haven't missed releasing any, especially in cases where the hold is held across multiple test steps. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18603

It should be back at 1, where it started. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18603

These add a bunch of entries to the ZAP, and then ensure that a cursor walk over the ZAP sees them all once and once only, and no others. The serialization test takes it a bit further, by serializing and recreating the cursor half way through and confirming it correctly picks up from the same spot, and then recreating the cursor from serialized again and confirming that it also see only the second set of entries. This ensures that the serialized cursor state is fully self contained and not reliant on anything left over in the ZAP itself at serialization time. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18603

Cursors defer taking holds until they're needed, so if a cursor is created but not used, it may still hold resources that it would have cleaned up along the way, but never got chance to. (this really happened in the first version of zap_cursor_init_by_dnode(), so not a contrived case!) Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18603

amotin reviewed May 29, 2026

View reviewed changes

behlendorf added the Status: Code Review Needed Ready for review and testing label May 29, 2026

robn force-pushed the zap-unit-cursor branch from 63675c1 to 6a75f75 Compare May 31, 2026 02:02

robn force-pushed the zap-unit-cursor branch from 6a75f75 to 122cb2d Compare June 1, 2026 01:47

amotin approved these changes Jun 1, 2026

View reviewed changes

behlendorf reviewed Jun 1, 2026

View reviewed changes

Comment thread module/zfs/zap.c Outdated

robn added 4 commits June 2, 2026 18:21

unit/zap: check mock dnode refcount before destruction

1fb0b06

It should be back at 1, where it started. Sponsored-by: TrueNAS Signed-off-by: Rob Norris <rob.norris@truenas.com>

robn force-pushed the zap-unit-cursor branch from 122cb2d to 96c38f9 Compare June 2, 2026 08:29

amotin reviewed Jun 2, 2026

View reviewed changes

Comment thread tests/unit/test_zap.c Outdated

robn added 2 commits June 2, 2026 23:29

robn force-pushed the zap-unit-cursor branch from 96c38f9 to f8ea7e1 Compare June 2, 2026 13:29

amotin approved these changes Jun 2, 2026

View reviewed changes

behlendorf approved these changes Jun 2, 2026

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jun 2, 2026

behlendorf closed this in 68980eb Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zap: add `zap_cursor_init_by_dnode`; cursor unit tests; mock dnode refcounts#18603

zap: add `zap_cursor_init_by_dnode`; cursor unit tests; mock dnode refcounts#18603
robn wants to merge 6 commits into
openzfs:masterfrom
robn:zap-unit-cursor

robn commented May 29, 2026 •

edited

Loading

Uh oh!

amotin left a comment

Uh oh!

behlendorf commented May 29, 2026

Uh oh!

robn commented May 31, 2026

Uh oh!

robn commented May 31, 2026 •

edited

Loading

Uh oh!

amotin left a comment

Uh oh!

Uh oh!

behlendorf commented Jun 1, 2026

Uh oh!

robn commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

robn commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

behlendorf commented May 29, 2026

Uh oh!

robn commented May 31, 2026

Uh oh!

robn commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

behlendorf commented Jun 1, 2026

Uh oh!

robn commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

robn commented May 29, 2026 •

edited

Loading

robn commented May 31, 2026 •

edited

Loading