zap: TinyZAP for multi-uint64 entries.#18568
Conversation
|
@behlendorf @robn ^^ Please let me know your thoughts on this. |
e87779e to
ccf54f6
Compare
behlendorf
left a comment
There was a problem hiding this comment.
This is shaping up nicely!
| "Support for variable-stride, variable-chunk ZAP for " | ||
| "multi-integer and long-name directory entries without " | ||
| "FatZAP overhead.", | ||
| 0, ZFEATURE_TYPE_BOOLEAN, NULL, sfeatures); |
There was a problem hiding this comment.
ZFEATURE_FLAG_MOS may need to be set here. If not, then we need to make sure the MOS will not be upgraded to a TinyZAP. It shouldn't be, but really there's nothing preventing it except the upgrade policy.
There was a problem hiding this comment.
Agree on at least ZFEATURE_FLAG_MOS.
Longest single string that could plausibly be a MOS ZAP key is a tie for 39 bytes, 'com.delphix:obsolete_counts_are_precise' and com.delphix:vdev_initialize_action_time.
Of course, we could be constructing keys somewhere (I don't think so). Longest on my pool is 36 bytes, org.zfsonlinux:vdev_trim_action_time
So there's wiggle room, so long as there's no other mystery private ZFS downstream that have some extra long keys in plays.
Should we disable TinyZAP on the MOS entirely? Probably no, since I don't have an argument for it beyond ZAPs being very old and I'm nervous, but I should at least say it.
There was a problem hiding this comment.
(It's slightly fiddly to disable it for only objset 0 without a "never upgrade to TinyZAP" flag or a special case in the upgrade policy code, which both suck, so that's probably a reason against, but idk).
There was a problem hiding this comment.
Yeah, I will include this in the next push.
I think MOS ZAP objects (e.g., the config ZAP and the features ZAP itself) are read before spa_feature_is_active() is reliable? I think adding ZFEATURE_FLAG_MOS would be the right thing, probably, or stay defensive, and I don't have any case to support it being helpful?
| # | ||
| # Copyright (c) 2013, 2014 by Delphix. All rights reserved. | ||
| # Copyright 2016 Nexenta Systems, Inc. All rights reserved. | ||
| # Copyright (c) 2026, Hewlett Packard Enterprise Development LP. |
There was a problem hiding this comment.
Rob's proposed unit test framework in #18564 would be an ideal way to exercise the new TinyZAP code. In addition to basic unit tests (add/remove/lookup) we can verify the various promotion paths behave as intended (MicroZAP -> TinyZAP -> FatZAP, etc).
There was a problem hiding this comment.
Since the full suite isn't upstreamed yet, I'm intending to wire what I have up to this PR and see what falls out. I'll share the results soon.
But also, if this PR lands before the test suite does, I'll be sure to include coverage in the test suite. You get grandfathered in 👴
There was a problem hiding this comment.
Cool, I'll check it out.
I think I'll gradually get the reviews and the required changes for this patch, and probably let's see how things go later.
robn
left a comment
There was a problem hiding this comment.
I'm totally on-board with the idea, but this all seems very convoluted to me.
If I'm understanding all this correctly, is effectively the same as MicroZAP the same as a TinyZAP with chunk=6 (64B) and stride=1 (8B)?
If so, I'd suggest the code would be a lot nicer by actually making the entire implementation by about TinyZAPs (by structure if not by name), and just special case for MicroZAPs: if we don't see MZAP_FLAG_TINY, then use chunk=6, stride=1 and do the extra MZAP_NAME_LEN check in the add and upgrade paths.
If you fold all those checks and math into a small number of macros or inline functions (which you basically already) have, then it seems like this PR should be almost entirely a mechanical conversion, plus the feature flag handling code.
Because of this, my review comments are either small style nits, or design queries that I think would apply regardless of the structure. Whichever way it goes, I'll need another review round.
| "Support for variable-stride, variable-chunk ZAP for " | ||
| "multi-integer and long-name directory entries without " | ||
| "FatZAP overhead.", | ||
| 0, ZFEATURE_TYPE_BOOLEAN, NULL, sfeatures); |
There was a problem hiding this comment.
Agree on at least ZFEATURE_FLAG_MOS.
Longest single string that could plausibly be a MOS ZAP key is a tie for 39 bytes, 'com.delphix:obsolete_counts_are_precise' and com.delphix:vdev_initialize_action_time.
Of course, we could be constructing keys somewhere (I don't think so). Longest on my pool is 36 bytes, org.zfsonlinux:vdev_trim_action_time
So there's wiggle room, so long as there's no other mystery private ZFS downstream that have some extra long keys in plays.
Should we disable TinyZAP on the MOS entirely? Probably no, since I don't have an argument for it beyond ZAPs being very old and I'm nervous, but I should at least say it.
| "Support for variable-stride, variable-chunk ZAP for " | ||
| "multi-integer and long-name directory entries without " | ||
| "FatZAP overhead.", | ||
| 0, ZFEATURE_TYPE_BOOLEAN, NULL, sfeatures); |
There was a problem hiding this comment.
(It's slightly fiddly to disable it for only objset 0 without a "never upgrade to TinyZAP" flag or a special case in the upgrade policy code, which both suck, so that's probably a reason against, but idk).
| * Otherwise the ZAP upgrades directly to FatZAP as before. | ||
| * | ||
| * mz_flags layout: | ||
| * [63:32] 0x54494E59 ("TINY") - TinyZAP magic |
There was a problem hiding this comment.
This is an overkill. We already have mz_block_type == ZBT_MICRO, only one additional bit is needed to indicate additional config is available.
Also, none of these are "flags". There's a few odd names in this PR, but this is the most egregious.
I would do something more like:
#define MZAP_FLAG_TINY (0 << 1)
typedef struct mzap_phys {
uint64_t mz_block_type; /* ZBT_MICRO */
uint64_t mz_salt;
uint64_t mz_normflags;
uint8_t mz_flags;
uint8_t mz_chunk_shift;
uint8_t mz_value_max;
uint8_t mz_pad1;
uint64_t mz_pad2[4];
/* actually variable size depending on block size */
mzap_ent_phys_t mz_chunk[];
} mzap_phys_t;(and adjust byteswap and everything to else to taste, but you get what I mean).
There was a problem hiding this comment.
I was initially working on the single bit approach (1 << 0), but then I had a lot of issues, which I think was due to reading up corrupt block or a partially written block, mz_flags could be any random/garbage value. That was my intent of going with a 32-bit magic "TINY", other than that, the geometry config fields in mzap_phys look good to me. What do you think? My only concern is about the detection strength of the flag or magic needs to be strong.
| * first zap_add() and stamped into mz_flags. No create-time hint is | ||
| * required, the ZAP layer autodetects the optimal format based on | ||
| * the observed entry geometry (num_integers, strlen(key)). |
There was a problem hiding this comment.
I'm torn on the automatic sizing. I actually really really like it, I just wonder if there's places we might make a suboptimal choice?
[thinking it through]
It seems like if that can happen its going to be on a real filesystem directory, and its only potentially a problem if you have a mix of filename lengths, and the first one was long enough to immediately promote it to a TinyZAP of some sort, but not enough for a FatZAP. Which means fewer chunks in the directory before upgrade to FatZAP.
But, in a directory of any significant size, FatZAP conversion is already on the table. Anyone who has done tuning for this has likely set zfs_zap_micro_max_size larger to avoid an early FatZAP upgrade. Also, I would suggest (without data) that users with enormous numbers of files in a directory are more likely to have longer filenames anyway, because you tend to be encoding some kind of naming or versioning convention in the filenames in order to actually find things. Or, for the gold standard, all your filenames are actually the same length because its backing an object store and they're SHA1 hex hashes (40 chars, already MicroZAP territory) or SHA256/512 (64/128), and you're happy for all your filenames to fit into single chunks .. until you get that FatZAP upgrade anyway!
Hmm, what about xattr=dir? That I have no intuition about, though we recommend xattr=sa anyway.
It's all single block. As long as it stays there, I don't think anyone much cares. Once it goes to FatZAP, its all terrible anyway. So I think probably not a problem, and we can revisit it later. And if we do manage to get a conversion path and not just upgrade & downgrade, then it probably won't matter at all.
There was a problem hiding this comment.
Yeah, the upgrade path (micro to tiny to fat) is still a bit evolving. As mentioned in a previous reply, we'd want to keep it more helpful and flexible.
For xattr=sa, I'd say at least 70-80% of our fatzap directories are now tinyzap, which is just really helpful. The fatzap part can't just be avoided.
|
Changes in my latest push:
Things to be discussed:? |
8147b1a to
5ef112e
Compare
|
Sorry if already mentioned, but I suppose this feature will not only be a read-incompatible, but also a send/receive incompatible with older receivers. While I was also thinking about some more efficient ZAP formats for purposes for BRT/DDT, read-incompatible feature means we need to update boot loaders for all OS'es, and add some more feature flags into replication streams. |
MicroZAP is limited to 1×uint64 values and 49-char keys, any wider entry forces a full FatZAP upgrade. TinyZAP avoids this for the common case of multi-integer values (e.g. Lustre FIDs) and long keys. Introduce TinyZAP, a MicroZAP variant reuses mzap_phys_t, repurposing the padding bytes after mz_normflags as three independent uint8_t fields: mz_flags bit 0 = MZAP_FLAG_TINY mz_chunk_shift log2(chunk): 6=64B, 7=128B, 8=256B mz_value_ints stride / 8 (number of uint64 values per entry) Geometry is stamped automatically on the first zap_add() based on observed entry shape. no create-time hint is required. Subsequent adds must match the stamped geometry or a FatZAP upgrade is triggered. All ZAP operations (add, update, remove, lookup, cursor, byteswap, upgrade to FatZAP) dispatch to TinyZAP paths when zap_stride != 0. Signed-off-by: Akash B <akash-b@hpe.com>
Introduce TinyZAP, a new on-disk ZAP format between MicroZAP and FatZAP. MicroZAP is limited to 1xuint64 values and 49-char keys, any wider entry forces a full FatZAP upgrade. TinyZAP avoids this for the common case of multi-integer values (e.g., Lustre FIDs) and long key names.
Signed-off-by: Akash B akash-b@hpe.com
Motivation and Context
This PR introduces TinyZAP, a new on-disk ZAP format that sits between MicroZAP and FatZAP in the ZAP format. TinyZAP extends MicroZAP to efficiently handle multi-word values and long key names without the overhead of a full FatZAP upgrade.
The primary motivation is workloads like Lustre that store multi-integer values (e.g., FIDs:
2-3 x uint64_t) or long filenames in ZAP objects. Previously, these always created a FatZAP, consuming significantly more on-disk space and memory than necessary.ZAP Format Hierarchy (After This Change)
MicroZAP -> TinyZAP -> FatZAP
Description
TinyZAP reuses the existing mzap_phys_t block format. The key change is repurposing one of the five padding words (mz_pad[5] -> mz_pad[4]) as mz_flags:
mz_flags Bit Layout:
When
mz_flags == 0, the block is a plain MicroZAP. When bits[63:32]equal0x54494E59, the TinyZAP layout applies to all chunk slots.Other details on ZAP upgrade conditions:
stride=8withchunk=64is intentionally skipped. It provides only 52 bytes for the name, barely 2 bytes more than MicroZAP's 50, so TinyZAP starts atchunk=128for allstride=8(single-integer, long-name) cases.MicroZAP -> TinyZAP Conditions:
Promotion to TinyZAP is attempted automatically on the first zap_add() when the entry fails the MicroZAP test. All of the following must hold:
TinyZAP -> FatZAP Conditions:
Once a ZAP is promoted to TinyZAP, it stays TinyZAP as long as all subsequent entries match the stamped geometry. A FatZAP upgrade is forced when any of the following occur:
Plain MicroZAP -> FatZAP (Unchanged)
If TinyZAP promotion fails (no fitting chunk, integer_size != 8, ZAP already has entries with a different geometry), the existing MicroZAP -> FatZAP path is taken (unchanged).
others:
SPA Feature Flag:
org.hpe:tinyzapA new pool feature SPA_FEATURE_TINYZAP is introduced (org.hpe:tinyzap):
Not read-only compatible: pools with TinyZAP objects cannot be imported read-only by older ZFS software.
How Has This Been Tested?
Before the patch using Lustre (FatZap):
Performance:
Total space taken by 1.25 million directories: Total Size: 117.7G (98.86 KB/Inode)
After this patch using Lustre (TinyZap):
Total space taken by 1.25 million directories: Total Size: 3.7G (3.09 KB/Inode)
Performance:
These were the summary of the results overall:
For
draid2:9d:12c:1s-0(flash MDT and 4 OSTs):Directory creation improved by +176% - over 2.75x faster, exceeding 100K ops/sec
Directory removal improved by +141% - over 2.4x faster, exceeding 145K ops/sec
Directory stat improved by +97% at peak - nearly 2× faster, approaching 300K ops/sec
Space Efficiency:
Almost 99% reduction in empty directories.
TinyZap (1-2 KB) vs. FatZAP (100-130 KB).
For 1.25 million directories (TinyZap: 3.7G (3.09 KB/Inode) vs. FATZap: 117.7G (98.86 KB/Inode)), ~32x reduction
TODO:
Types of changes
Checklist:
Signed-off-by.