Skip to content

zap: TinyZAP for multi-uint64 entries.#18568

Open
akashb-22 wants to merge 1 commit into
openzfs:masterfrom
akashb-22:tinyzap_blob2
Open

zap: TinyZAP for multi-uint64 entries.#18568
akashb-22 wants to merge 1 commit into
openzfs:masterfrom
akashb-22:tinyzap_blob2

Conversation

@akashb-22
Copy link
Copy Markdown
Contributor

@akashb-22 akashb-22 commented May 20, 2026

Introduce TinyZAP, a new on-disk ZAP format between MicroZAP and FatZAP. MicroZAP is limited to 1xuint64 values and 49-char keys, any wider entry forces a full FatZAP upgrade. TinyZAP avoids this for the common case of multi-integer values (e.g., Lustre FIDs) and long key names.

Signed-off-by: Akash B akash-b@hpe.com

Motivation and Context

This PR introduces TinyZAP, a new on-disk ZAP format that sits between MicroZAP and FatZAP in the ZAP format. TinyZAP extends MicroZAP to efficiently handle multi-word values and long key names without the overhead of a full FatZAP upgrade.
The primary motivation is workloads like Lustre that store multi-integer values (e.g., FIDs: 2-3 x uint64_t) or long filenames in ZAP objects. Previously, these always created a FatZAP, consuming significantly more on-disk space and memory than necessary.

ZAP Format Hierarchy (After This Change)
MicroZAP -> TinyZAP -> FatZAP

Description

TinyZAP reuses the existing mzap_phys_t block format. The key change is repurposing one of the five padding words (mz_pad[5] -> mz_pad[4]) as mz_flags:

mzap_phys_t {
...
mz_flags        [8B]   TinyZAP: magic | chunk_log2 | stride
...
}

mz_flags Bit Layout:

[63:32]  0x54494E59 ("TINY") -TinyZAP magic marker
[31:24]  chunk_log2 - log₂ of chunk size (6=64B, 7=128B, 8=256B)
[23:16]  stride - value width in bytes (multiple of 8)
[15: 0]  reserved - zero

When mz_flags == 0, the block is a plain MicroZAP. When bits [63:32] equal 0x54494E59, the TinyZAP layout applies to all chunk slots.

Other details on ZAP upgrade conditions:

Supported geometries and their use-cases:

  chunk | stride | name_len | integers | use-case
  ------+--------+----------+----------+------------------------------
    64  |   16   |    44    |    2     | 2×uint64 (Lustre FID)
    64  |   24   |    36    |    3     | 3×uint64
    64  |   32   |    28    |    4     | 4×uint64
   128  |    8   |   116    |    1     | 1×uint64 + long name
   128  |   16   |   108    |    2     | 2×uint64 + long name
   128  |   48   |    76    |    6     | 3×FID (Lustre)
   256  |    8   |   244    |    1     | 1×uint64 + very long name
   256  |   16   |   236    |    2     | 2×uint64 + very long name

stride=8 with chunk=64 is intentionally skipped. It provides only 52 bytes for the name, barely 2 bytes more than MicroZAP's 50, so TinyZAP starts at chunk=128 for all stride=8 (single-integer, long-name) cases.

MicroZAP -> TinyZAP Conditions:
Promotion to TinyZAP is attempted automatically on the first zap_add() when the entry fails the MicroZAP test. All of the following must hold:

integer_size == 8(onlyuint64_tvalues supported)
stride = num_integers x 8 >= 8bytes
At least one chunk size (64/128/256) can accommodate:stride + 4 + TZAP_MIN_NAME_LEN <= chunk and strlen(key) < TZAP_NAME_LEN(chunk, stride)
The ZAP object currently has zero entries, stride is stamped once on the very first add and cannot be changed. 
The smallest fitting chunk is selected (chunk=64 first, then 128, then 256), except stride=8 always starts at chunk=128.

TinyZAP -> FatZAP Conditions:
Once a ZAP is promoted to TinyZAP, it stays TinyZAP as long as all subsequent entries match the stamped geometry. A FatZAP upgrade is forced when any of the following occur:

integer_size != 8
num_integers != stride / 8(different value width than stamped)
strlen(key) >= TZAP_NAME_LEN(chunk, stride)(name too long for the stamped chunk)
All chunk slots are full (zap_num_entries >= zap_num_chunks) and the block cannot grow further

Plain MicroZAP -> FatZAP (Unchanged)
If TinyZAP promotion fails (no fitting chunk, integer_size != 8, ZAP already has entries with a different geometry), the existing MicroZAP -> FatZAP path is taken (unchanged).

others:
SPA Feature Flag: org.hpe:tinyzap
A new pool feature SPA_FEATURE_TINYZAP is introduced (org.hpe:tinyzap):
Not read-only compatible: pools with TinyZAP objects cannot be imported read-only by older ZFS software.

How Has This Been Tested?

Before the patch using Lustre (FatZap):

# mkdir testdir1 && touch testfile1
# du --si test*
100k    testdir1
1.1k    testfile1

Performance:

416 tasks, 1248000 files/directories
SUMMARY rate: (of 3 iterations) (op/sec)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   Directory creation          40183.030      33505.043      37718.225       3666.264
   Directory stat             150247.149     140618.623     145442.903       4814.295
   Directory removal           69091.318      55062.863      60385.987       7601.242

Total space taken by 1.25 million directories: Total Size: 117.7G (98.86 KB/Inode)

After this patch using Lustre (TinyZap):

# mkdir testdir1 && touch testfile1
# du --si test*
1.1k    testdir1
1.1k    testfile1

Total space taken by 1.25 million directories: Total Size: 3.7G (3.09 KB/Inode)

Performance:

416 tasks, 1248000 files/directories
SUMMARY rate: (of 3 iterations) (op/sec)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   Directory creation         106757.610     100377.356     104069.288       3306.408
   Directory stat             296205.362     225769.238     262180.060      35278.604
   Directory removal          155819.816     134755.537     145367.644      10533.050

These were the summary of the results overall:
For draid2:9d:12c:1s-0 (flash MDT and 4 OSTs):
Directory creation improved by +176% - over 2.75x faster, exceeding 100K ops/sec
Directory removal improved by +141% - over 2.4x faster, exceeding 145K ops/sec
Directory stat improved by +97% at peak - nearly 2× faster, approaching 300K ops/sec

Space Efficiency:
Almost 99% reduction in empty directories.
TinyZap (1-2 KB) vs. FatZAP (100-130 KB).
For 1.25 million directories (TinyZap: 3.7G (3.09 KB/Inode) vs. FATZap: 117.7G (98.86 KB/Inode)), ~32x reduction

TODO:

  1. Fix the checkstyle and other rebase-related issues.
  2. I have some of the local bash tests, which I probably will add some of to zfs-tests in the next push.
  3. Running a few tests with Lustre, may delay a little here.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@akashb-22
Copy link
Copy Markdown
Contributor Author

@behlendorf @robn ^^ Please let me know your thoughts on this.

@robn robn self-requested a review May 20, 2026 22:26
@behlendorf behlendorf added the Status: Design Review Needed Architecture or design is under discussion label May 20, 2026
@akashb-22 akashb-22 force-pushed the tinyzap_blob2 branch 2 times, most recently from e87779e to ccf54f6 Compare May 21, 2026 14:36
Copy link
Copy Markdown
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shaping up nicely!

Comment thread module/zfs/zap.c
Comment thread module/zfs/zap.c Outdated
Comment thread module/zfs/zap_micro.c Outdated
Comment thread module/zcommon/zfeature_common.c Outdated
"Support for variable-stride, variable-chunk ZAP for "
"multi-integer and long-name directory entries without "
"FatZAP overhead.",
0, ZFEATURE_TYPE_BOOLEAN, NULL, sfeatures);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZFEATURE_FLAG_MOS may need to be set here. If not, then we need to make sure the MOS will not be upgraded to a TinyZAP. It shouldn't be, but really there's nothing preventing it except the upgrade policy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on at least ZFEATURE_FLAG_MOS.

Longest single string that could plausibly be a MOS ZAP key is a tie for 39 bytes, 'com.delphix:obsolete_counts_are_precise' and com.delphix:vdev_initialize_action_time.

Of course, we could be constructing keys somewhere (I don't think so). Longest on my pool is 36 bytes, org.zfsonlinux:vdev_trim_action_time

So there's wiggle room, so long as there's no other mystery private ZFS downstream that have some extra long keys in plays.

Should we disable TinyZAP on the MOS entirely? Probably no, since I don't have an argument for it beyond ZAPs being very old and I'm nervous, but I should at least say it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It's slightly fiddly to disable it for only objset 0 without a "never upgrade to TinyZAP" flag or a special case in the upgrade policy code, which both suck, so that's probably a reason against, but idk).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will include this in the next push.
I think MOS ZAP objects (e.g., the config ZAP and the features ZAP itself) are read before spa_feature_is_active() is reliable? I think adding ZFEATURE_FLAG_MOS would be the right thing, probably, or stay defensive, and I don't have any case to support it being helpful?

Comment thread module/zfs/zap_tiny.c
#
# Copyright (c) 2013, 2014 by Delphix. All rights reserved.
# Copyright 2016 Nexenta Systems, Inc. All rights reserved.
# Copyright (c) 2026, Hewlett Packard Enterprise Development LP.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rob's proposed unit test framework in #18564 would be an ideal way to exercise the new TinyZAP code. In addition to basic unit tests (add/remove/lookup) we can verify the various promotion paths behave as intended (MicroZAP -> TinyZAP -> FatZAP, etc).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the full suite isn't upstreamed yet, I'm intending to wire what I have up to this PR and see what falls out. I'll share the results soon.

But also, if this PR lands before the test suite does, I'll be sure to include coverage in the test suite. You get grandfathered in 👴

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'll check it out.
I think I'll gradually get the reviews and the required changes for this patch, and probably let's see how things go later.

Comment thread include/sys/zap_impl.h Outdated
Copy link
Copy Markdown
Member

@robn robn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally on-board with the idea, but this all seems very convoluted to me.

If I'm understanding all this correctly, is effectively the same as MicroZAP the same as a TinyZAP with chunk=6 (64B) and stride=1 (8B)?

If so, I'd suggest the code would be a lot nicer by actually making the entire implementation by about TinyZAPs (by structure if not by name), and just special case for MicroZAPs: if we don't see MZAP_FLAG_TINY, then use chunk=6, stride=1 and do the extra MZAP_NAME_LEN check in the add and upgrade paths.

If you fold all those checks and math into a small number of macros or inline functions (which you basically already) have, then it seems like this PR should be almost entirely a mechanical conversion, plus the feature flag handling code.

Because of this, my review comments are either small style nits, or design queries that I think would apply regardless of the structure. Whichever way it goes, I'll need another review round.

Comment thread include/sys/zap_impl.h Outdated
Comment thread include/sys/zap_impl.h
Comment thread module/zcommon/zfeature_common.c Outdated
"Support for variable-stride, variable-chunk ZAP for "
"multi-integer and long-name directory entries without "
"FatZAP overhead.",
0, ZFEATURE_TYPE_BOOLEAN, NULL, sfeatures);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on at least ZFEATURE_FLAG_MOS.

Longest single string that could plausibly be a MOS ZAP key is a tie for 39 bytes, 'com.delphix:obsolete_counts_are_precise' and com.delphix:vdev_initialize_action_time.

Of course, we could be constructing keys somewhere (I don't think so). Longest on my pool is 36 bytes, org.zfsonlinux:vdev_trim_action_time

So there's wiggle room, so long as there's no other mystery private ZFS downstream that have some extra long keys in plays.

Should we disable TinyZAP on the MOS entirely? Probably no, since I don't have an argument for it beyond ZAPs being very old and I'm nervous, but I should at least say it.

Comment thread module/zcommon/zfeature_common.c Outdated
"Support for variable-stride, variable-chunk ZAP for "
"multi-integer and long-name directory entries without "
"FatZAP overhead.",
0, ZFEATURE_TYPE_BOOLEAN, NULL, sfeatures);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It's slightly fiddly to disable it for only objset 0 without a "never upgrade to TinyZAP" flag or a special case in the upgrade policy code, which both suck, so that's probably a reason against, but idk).

Comment thread module/zfs/zap.c Outdated
Comment thread module/zfs/zap_micro.c Outdated
Comment thread module/zfs/zap_tiny.c Outdated
Comment thread include/sys/zap_impl.h Outdated
* Otherwise the ZAP upgrades directly to FatZAP as before.
*
* mz_flags layout:
* [63:32] 0x54494E59 ("TINY") - TinyZAP magic
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an overkill. We already have mz_block_type == ZBT_MICRO, only one additional bit is needed to indicate additional config is available.

Also, none of these are "flags". There's a few odd names in this PR, but this is the most egregious.

I would do something more like:

#define	MZAP_FLAG_TINY	(0 << 1)

typedef struct mzap_phys {
	uint64_t mz_block_type;	/* ZBT_MICRO */
	uint64_t mz_salt;
	uint64_t mz_normflags;
	uint8_t mz_flags;
	uint8_t mz_chunk_shift;
	uint8_t mz_value_max;
	uint8_t mz_pad1;
	uint64_t mz_pad2[4];
	/* actually variable size depending on block size */
	mzap_ent_phys_t mz_chunk[];
} mzap_phys_t;

(and adjust byteswap and everything to else to taste, but you get what I mean).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially working on the single bit approach (1 << 0), but then I had a lot of issues, which I think was due to reading up corrupt block or a partially written block, mz_flags could be any random/garbage value. That was my intent of going with a 32-bit magic "TINY", other than that, the geometry config fields in mzap_phys look good to me. What do you think? My only concern is about the detection strength of the flag or magic needs to be strong.

Comment thread module/zfs/zap_impl.c
Comment thread include/sys/zap_impl.h Outdated
Comment on lines +65 to +67
* first zap_add() and stamped into mz_flags. No create-time hint is
* required, the ZAP layer autodetects the optimal format based on
* the observed entry geometry (num_integers, strlen(key)).
Copy link
Copy Markdown
Member

@robn robn May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn on the automatic sizing. I actually really really like it, I just wonder if there's places we might make a suboptimal choice?

[thinking it through]

It seems like if that can happen its going to be on a real filesystem directory, and its only potentially a problem if you have a mix of filename lengths, and the first one was long enough to immediately promote it to a TinyZAP of some sort, but not enough for a FatZAP. Which means fewer chunks in the directory before upgrade to FatZAP.

But, in a directory of any significant size, FatZAP conversion is already on the table. Anyone who has done tuning for this has likely set zfs_zap_micro_max_size larger to avoid an early FatZAP upgrade. Also, I would suggest (without data) that users with enormous numbers of files in a directory are more likely to have longer filenames anyway, because you tend to be encoding some kind of naming or versioning convention in the filenames in order to actually find things. Or, for the gold standard, all your filenames are actually the same length because its backing an object store and they're SHA1 hex hashes (40 chars, already MicroZAP territory) or SHA256/512 (64/128), and you're happy for all your filenames to fit into single chunks .. until you get that FatZAP upgrade anyway!

Hmm, what about xattr=dir? That I have no intuition about, though we recommend xattr=sa anyway.

It's all single block. As long as it stays there, I don't think anyone much cares. Once it goes to FatZAP, its all terrible anyway. So I think probably not a problem, and we can revisit it later. And if we do manage to get a conversion path and not just upgrade & downgrade, then it probably won't matter at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the upgrade path (micro to tiny to fat) is still a bit evolving. As mentioned in a previous reply, we'd want to keep it more helpful and flexible.
For xattr=sa, I'd say at least 70-80% of our fatzap directories are now tinyzap, which is just really helpful. The fatzap part can't just be avoided.

@akashb-22
Copy link
Copy Markdown
Contributor Author

Changes in my latest push:

  1. Support populated microzap to tinyzap upgrade and chunk upgrade (64->128->256) for tinyzap.
  2. Change the implementation of TinyZAPs (by structure). Three independent uint8_t fields (mz_flags(MZAP_FLAG_TINY), mz_chunk_shift (log2(chunk): 6=64B, 7=128B, 8=256B), mz_value_ints (stride / 8))
  3. Fix mzap_normalization_conflict to handle TinyZAP entries.
  4. Added ZFEATURE_FLAG_MOS to disable TinyZAP on the MOS.
  5. Removed tzap_should_promote entirely and all handled in tzap_try_promote.
  6. Flex array member gives compiler error ("flexible array member in struct with no named members" compiler error). Changed to tze_data[0] /* zero length array */
  7. Added basic zap testcases covering a few scenarios. (TinyZAP upgrade paths and entries, etc.)
  8. Updated the supported chunk sizes and resulting geometry comments.
  9. Added TZAP_VERIFY_PHYS(__FUNCTION__) for debug on-disk validation
  10. Other review comments and fixes.

Things to be discussed:?

Flexible array member error. Changed it to [0]. (-fsanitize=bounds in debug builds, [0] is seems to be the correct choice)
 zfs/include/sys/zap_impl.h:200:17: error: flexible array member in a struct with no named members
  200 |         uint8_t tze_data[]; /* variable size */
      |                 ^~~~~~~~

@akashb-22 akashb-22 force-pushed the tinyzap_blob2 branch 2 times, most recently from 8147b1a to 5ef112e Compare May 26, 2026 17:41
@amotin
Copy link
Copy Markdown
Member

amotin commented May 26, 2026

Sorry if already mentioned, but I suppose this feature will not only be a read-incompatible, but also a send/receive incompatible with older receivers. While I was also thinking about some more efficient ZAP formats for purposes for BRT/DDT, read-incompatible feature means we need to update boot loaders for all OS'es, and add some more feature flags into replication streams.

MicroZAP is limited to 1×uint64 values and 49-char keys, any wider
entry forces a full FatZAP upgrade.  TinyZAP avoids this for the
common case of multi-integer values (e.g. Lustre FIDs) and long keys.

Introduce TinyZAP, a MicroZAP variant reuses mzap_phys_t, repurposing
the padding bytes after mz_normflags as three independent
uint8_t fields:

  mz_flags        bit 0 = MZAP_FLAG_TINY
  mz_chunk_shift  log2(chunk): 6=64B, 7=128B, 8=256B
  mz_value_ints   stride / 8  (number of uint64 values per entry)

Geometry is stamped automatically on the first zap_add() based on
observed entry shape. no create-time hint is required.  Subsequent
adds must match the stamped geometry or a FatZAP upgrade is triggered.

All ZAP operations (add, update, remove, lookup, cursor, byteswap,
upgrade to FatZAP) dispatch to TinyZAP paths when zap_stride != 0.

Signed-off-by: Akash B <akash-b@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Design Review Needed Architecture or design is under discussion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants