Skip to content

Test misc-next (regular, SELF)#1629

Open
kdave wants to merge 10000 commits into
btrfs:ci-kvmfrom
kdave:misc-next
Open

Test misc-next (regular, SELF)#1629
kdave wants to merge 10000 commits into
btrfs:ci-kvmfrom
kdave:misc-next

Conversation

@kdave

@kdave kdave commented Apr 17, 2026

Copy link
Copy Markdown
Member

No description provided.

@kdave kdave force-pushed the misc-next branch 8 times, most recently from a464848 to 8f84140 Compare April 27, 2026 14:35
@kdave kdave force-pushed the misc-next branch 2 times, most recently from d93f97d to 7d5a51d Compare April 29, 2026 14:08
@kdave kdave force-pushed the misc-next branch 2 times, most recently from 14fb724 to d39211d Compare May 12, 2026 15:26
@kdave kdave force-pushed the misc-next branch 2 times, most recently from 8c55fe0 to 2fddc74 Compare May 16, 2026 01:02
@kdave kdave force-pushed the misc-next branch 10 times, most recently from 014f22d to 6b43c97 Compare May 29, 2026 00:01
@kdave kdave force-pushed the misc-next branch 4 times, most recently from 7263914 to 5342ffb Compare June 8, 2026 13:54
fdmanana and others added 26 commits June 30, 2026 00:11
…nded

The loop intends to copy the data in chunks up to 1M but we allocate the
pages array for the entire length and don't cap it to 1M. Fix this by
computing 'nr_pages' using 'copy_len' instead of 'length'.

While at it, also make 'nr_pages' and 'copy_len' const, as they never
change, to make the code more clear.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In Meta production, we have observed a large number of hosts running
kernels newer than 6.13 which hit hung tasks on
btrfs_read_folio()->lock_extents_for_read(). Looking through the history
in this codepath reveals an interesting history.

in 6.12, we merged
commit ac325fc ("btrfs: do not hold the extent lock for entire read")
which holds the extent lock very narrowly while looking up the
extent_map. However, this proved to introduce a serious race with DIO
writes which was fixed in 6.14 with
commit acc18e1 ("btrfs: fix stale page cache after race between readahead and direct IO write")

That latter fix subtly changed the extent unlock point from the pre-6.12
regime. In 6.11, each read endio unlocked the extent it finished
reading, but in 6.14, the extent is locked/unlocked as a unit around the
entire readahead loop, while the individual folios are still unlocked as
the endios finish. This is mostly the same behavior, as all successful
reads will populate the page cache, so subsequent reads won't enter
btrfs and hit the extent lock. But in the case where the readahead
fails, perhaps because of a memory allocation failure doing compressed
reads, the page will not be brought up to date and a later read of an
overlapping range *will* block on the extent lock.

Why is this a problem?

On sufficiently large loaded systems, I have observed that direct
reclaim can run for minutes. Given that, consider two tasks on such a
system reading an overlapping range of a compressed file:

  Task 1 locks the whole range and starts to read. Some allocation for
  the compressed read for folio F fails and we carry on while holding the
  extent lock for the full range.

  Task 2 wants to read F, which is not uptodate and in page cache, so it
  blocks on the extent lock held by Task 1.

  Task 1 keeps getting stuck in direct reclaim (likely, we already
  supposed an allocation failure above)

  Task 2 stays blocked on the extent lock the whole time.

If you consider the effects of readahead_expand and imagine a file with
a 128k compressed extent followed by many smaller compressed extents,
you can imagine that the expanded window will result in subsequent reads
hitting many extents (128k/4k = 32) per lock window in the worst case.

The system likeley wouldn't be all that healthy anyway, so this is
likely not a critical improvement, but it does alleviate this one source
of stress and one thread's slowdown escalating to others.

To bring this behavior back to the old model, we should unlock the
extent at each loop of the readahead loop rather than in one shot at the
end. This allows such overlapping reads to proceed as they should.
Writes are fine because either the page has already been read and has an
appropriate state in the page cache to be invalidated (or not uptodate)
or it is still-to-be-read and the extent lock is still held protecting
it.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
…ck groups

A swap file on btrfs will pin down block groups that cover the swap file
extent.

Pinned down block groups will be skipped for scrub and relocation.

These degradation on critical btrfs maintenance operations is never
properly educated to end users, and have already caused problems
including:

- Scrub finished too quick
  Because the enabled swap file has pinned down most of the block
  groups. Thus any file extents in those block groups, even not utilized
  by the swap file, will be skipped from scrub.

- Unbalanced data and metadata usage, meanwhile relocation won't help
  The same reason, pinned down block groups will not be considered as
  relocation target, thus data extents that are not utilized by the swap
  file can still be skipped from relocation.

Although we already have kernel messages for both scrub and balance, the
balance one is still info level.

To better communicate those potential long term problems, add the
following output into dmesg:

- Change the message level to warn for __btrfs_balance()

- Total pinned down block group number and size during swapfile activation
- Total released block group number and size during swapfile deactivation
  The above messages have info level.

- The fact that pinned down block groups will not be scrubbed nor
  balanced
  The above message has warning level.

The example output would look like the following, for enabling a 1.2G
swapfile, which pinned down 2G block groups:

 BTRFS info (device dm-3): swapfile activated on root 5 ino 257, pinned down 2147483648 bytes from 2 block group(s)
 BTRFS warning (device dm-3): block groups with swapfile extents will not be scrubbed or balanced
 Adding 1257468k swap on /mnt/btrfs/foobar.  Priority:-1 extents:1 across:1257468k
 BTRFS info (device dm-3): swapfile deactivated on root 5 ino 257, released 2147483648 bytes from 2 block group(s)

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable-sized buffer buf in struct btrfs_ioctl_search_args_v2 is
declared as __u64[], but it holds a packed byte stream of search results,
where all offsets into the buffer are in bytes.

Declaring buf as __u64[] makes it easy for user space to write incorrect
pointer arithmetic: adding a byte offset directly to a __u64 pointer
scales the offset by 8, landing at byte position offset*8 instead of
offset.

This recently caused an infinite loop in btrfs-progs: the accessor read
all-zero data from misaddressed items, which fed zeroed search keys back
into the ioctl loop and spun forever. The issue was worked around at the
time by disabling TREE_SEARCH_V2 entirely in btrfs-progs (d73e69824854:
"btrfs-progs: temporarily disable usage of v2 of search tree ioctl").

The kernel side already treats buf as a byte buffer, so change the
declaration to __u8[] to match the actual semantics and prevent similar
misuse in user space. The change is ABI compatible: both the structure size
and alignment are unchanged.

Fixes: cc68a8a ("btrfs: new ioctl TREE_SEARCH_V2")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: You-Kai Zheng <ykzheng@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside btrfs we always pair -EUCLEAN error with an error message to
indicate which data is corrupted.

However there are 3 cases inside lzo decompression where there is no
error message for corrupted headers.

Add those missing error messages to show exactly where the corruption
is.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
A crafted btrfs image can trigger the following crash:

  BUG: unable to handle page fault for address: ffffd1dc42884000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  CPU: 9 UID: 0 PID: 1034 Comm: poc Not tainted 7.1.0-rc4-custom+ #383 PREEMPT(full)  46af0a92938a63be7132e0dfd71e62327c51d5c2
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
  RIP: 0010:memcpy+0xc/0x10
  Call Trace:
   <TASK>
   read_extent_buffer+0xe4/0x100 [btrfs 3cf0785dd58fec8c5ff84633b772f17ce1f92a8f]
   btrfs_get_name+0x15e/0x1e0 [btrfs 3cf0785dd58fec8c5ff84633b772f17ce1f92a8f]
   reconnect_path+0x165/0x390
   exportfs_decode_fh_raw+0x337/0x400
   ? drop_caches_sysctl_handler+0xb0/0xb0
   </TASK>
  ---[ end trace 0000000000000000 ]---
  RIP: 0010:memcpy+0xc/0x10
  Kernel panic - not syncing: Fatal exception

[CAUSE]
TThe crafted image has the following corrupted INODE_REF item:

         item 9 key (258 INODE_REF 257) itemoff 11544 itemsize 4106
         	index 2 namelen 4096 name: d\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000

The itemsize matches the namelen, but the namelen is 4096, way larger
than normal name length limit (BTRFS_NAME_LEN, 255).

Meanwhile the memory of the @name is only 255 byte sized, this will cause
out-of-boundary access, and cause the above crash.

[FIX]
Add extra namelen verification for INODE_REF, just like what we have
done in ROOT_REF checks.

Now the crafted image can be rejected gracefully:

 BTRFS critical (device dm-2): corrupt leaf: root=5 block=30572544 slot=14 ino=259, invalid inode ref name length, has 4096 expect [1, 255]
 BTRFS error (device dm-2): read time tree block corruption detected on logical 30572544 mirror 2

Reported-by: Xiang Mei <xmei5@asu.edu>
Link: https://lore.kernel.org/linux-btrfs/aik0hEV6ehKx6Ldv@Air.local/
Acked-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
[ Rebase, add a Link: tag, add an simple cause analyze ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
V2 space cache has been the default mkfs option since btrfs-progs v5.15,
and commit 1e7bec1 ("btrfs: emit a warning about space cache v1
being deprecated") has already added a warning to show v1 space cache
has been deprecated.

It has been long enough that we should remove v1 space cache completely.

As the first step, disable v1 space cache by:

- Make "space_cache" mount option fallback to "nospace_cache"

- Make "space_cache=v1" fall back to "nospace_cache"

  This is safer than forcing "space_cache=v2", as forcing v2 cache
  requires removal of v1 cache and regenerating v2 cache.
  Such operation can be slow, and takes extra metadata space, thus
  it is not always safe for existing filesystems.

With this done, v1 cache mount will always fallback to nospace cache,
and mount option will not be able to force v1 space cache usage.

For example, even for a fs with v1 cache:

  # btrfs ins dump-super test.img
  superblock: bytenr=65536, device=test.img
  ---------------------------------------------------------
  csum_type		0 (crc32c)
  csum_size		4
  csum			0xdce44b2c [match]
  bytenr		65536
  flags			0x1
  			( WRITTEN )
  magic			_BHRfS_M [match]
  fsid			7d7c3bba-8211-4206-868d-10eedd5703f8
  metadata_uuid		00000000-0000-0000-0000-000000000000
  label
  generation		9
  root			30605312
  [...]
  compat_ro_flags	0x0                     <<< No FST feature
  incompat_flags	0x361
  			( MIXED_BACKREF |
  			  BIG_METADATA |
  			  EXTENDED_IREF |
  			  SKINNY_METADATA |
  			  NO_HOLES )
  cache_generation	9                       <<< Matches generation
  uuid_tree_generation	9

Attempting to mount it will lead to no space cache other than v1 space cache:

  # mount test.img /mnt/btrfs
  # dmesg -t | tail -n 5
  BTRFS: device fsid 7d7c3bba-8211-4206-868d-10eedd5703f8 devid 1 transid 9 /dev/loop0 (7:0) scanned by mount (1264)
  BTRFS info (device loop0): first mount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8
  BTRFS info (device loop0): using crc32c checksum algorithm
  BTRFS info (device loop0): turning on async discard
  BTRFS info (device loop0): last unmount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8

Even forcing v1 cache will not work, but fallback to the usual
nospace_cache:

  # mount test.img -o space_cache=v1 /mnt/btrfs
  # dmesg -t | tail -n 6
  BTRFS warning: v1 space cache is deprecated, fallback to no space cache
  BTRFS: device fsid 7d7c3bba-8211-4206-868d-10eedd5703f8 devid 1 transid 9 /dev/loop0 (7:0) scanned by mount (1264)
  BTRFS info (device loop0): first mount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8
  BTRFS info (device loop0): using crc32c checksum algorithm
  BTRFS info (device loop0): turning on async discard
  BTRFS info (device loop0): last unmount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8

And there will be no way to force converting a v2 cache back to v1, such
attempt will only clear free space tree and fallback to no space cache.

  # mkfs.btrfs -f -O fst,^bgt test.img
  # mount -o clear_cache,space_cache=v1 test.img /mnt/btrfs
  # dmesg -t | tail -n 11
  BTRFS warning: v1 space cache is deprecated, fallback to no space cache
  BTRFS: device fsid f59daad2-3ab5-4f33-b752-a36cfb09b674 devid 1 transid 8 /dev/loop0 (7:0) scanned by mount (1419)
  BTRFS info (device loop0): first mount of filesystem f59daad2-3ab5-4f33-b752-a36cfb09b674
  BTRFS info (device loop0): using crc32c checksum algorithm
  BTRFS info (device loop0): rebuilding free space tree
  BTRFS info (device loop0): disabling free space tree
  BTRFS info (device loop0): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
  BTRFS info (device loop0): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
  BTRFS info (device loop0): checking UUID tree
  BTRFS info (device loop0): turning on async discard
  BTRFS info (device loop0): force clearing of disk cache
  # mount | grep /mnt/btrfs
  /mnt/test.img on /mnt/btrfs type btrfs (rw,relatime,discard=async,nospace_cache,subvolid=5,subvol=/)

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit bac3c29 ("btrfs: remove 2K block size support") there
is no 2K block size support inside btrfs anymore.

Remove the stale comments of btrfs_supported_blocksize().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since v5.15 btrfs has support for block size < page size, but we still
only support 4K block size, while there is no special reason that we
cannot support 8K/16K/32K block sizes for 64K page size.

That 4K limit is completely arbitrary, and mostly to reduce test runtime
so we do not need to test all the extra block size combinations.

However that also limits the user choices, some users may understand
what they are doing, and want larger block sizes.  In that case, fixed
4K block size for subpage routine is blocking our way.

Just remove that fixed 4K requirement for block size < page size.

This should not affect regular end users, since mkfs is already using 4K
block size as default for quite a while, and the existing bs == ps support is
always there.

But for power users, this allows extra block size support, and may
provide extra test coverage.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Decentralize transaction aborts in create_reloc_root(), so that it is
obvious which call failed and what caused the transaction abort.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When dumping a tree block, btrfs_header::owner is printed as
unsigned, which can result in numbers that are hard to read, e.g.:

  BTRFS info (device loop0): leaf 8908800 gen 16 total ptrs 28 free space 1676 owner 18446744073709551607

For the above output, 18446744073709551607 is (s64)-9, the root id of data
reloc tree.

Despite those predefined root ids that are already negative, existing
subvolume trees will not have any negative values, as subvolume trees can
only utilize the lower 48 bits, so there will be no output change for
existing subvolumes, thus no extra confusion.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…erge

On a zoned FS, btrfs_delayed_refs_rsv_refill() returns -EAGAIN whenever
the over-committed metadata plus the zone_unusable bytes exceeds the
usable size in a metadata block-group to avoid heavy over-commit of
metadata and early ENOSPC in one transaction.

If this happens while doing reclaim, the transaction is getting aborted.

Treat -EAGAIN as a soft, retryable condition in case of block-group
reclaim.

Reported-by: Damien Le Moal <dlemoal@kernel.org>
Fixes: 7bcb04d ("btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment is wrong, because it's not about storing the ID of new
directories that were already created, instead it's about storing utimes
values for directories (both new and existing). The comment is wrong
because it was copy pasted from SEND_MAX_DIR_CREATED_CACHE_SIZE, but
forgot to update it afterwards.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…mon prefixes

In case the current inode's path is a prefix of the given path, the helper
is_current_inode_path() will return true, which causes the single caller
to reset the current inode's path. While this is not a functional issue,
it makes the caller recompute the current inode's path later. It could
also become a problem in the future in case get new callers for
is_current_inode_path() in more sensitive contexts.

Example: the current inode path is "/foo/bar" and the path we compare
against is "/foo/bar_xyz".

Fix this by returning true only if we have exact matches.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a syzbot report that the check inside get_new_location()
triggered:

  BTRFS info (device loop0): found 31 extents, stage: move data extents
  BTRFS info (device loop0): leaf 8908800 gen 16 total ptrs 28 free space 1676 owner 18446744073709551607
         item 0 key (256 INODE_ITEM 0) itemoff 3835 itemsize 160
                 inode generation 5 transid 0 size 0 nbytes 0
                 block group 0 mode 40755 links 1 uid 0 gid 0
                 rdev 0 sequence 0 flags 0x0
                 atime 1669132761.0
                 ctime 1669132761.0
                 mtime 1669132761.0
                 otime 0.0
         item 1 key (256 INODE_REF 256) itemoff 3823 itemsize 12
                 index 0 name_len 2
         item 2 key (258 INODE_ITEM 0) itemoff 3663 itemsize 160
                 inode generation 1 transid 16 size 733184 nbytes 106496
                 block group 0 mode 100600 links 0 uid 0 gid 0
                 rdev 0 sequence 24 flags 0x18
         item 3 key (258 EXTENT_DATA 0) itemoff 3595 itemsize 68
                 generation 16 type 0
                 inline extent data size 47 ram_bytes 4096 compression 1
  [...]
         item 27 key (18446744073709551611 ORPHAN_ITEM 258) itemoff 2376 itemsize 0
  BTRFS error (device loop0): unexpected non-zero offset in file extent item for data reloc inode 258 key offset 0 offset 9277520992061368337
  ------------[ cut here ]------------
  btrfs_abort_should_print_stack(__error)

[CAUSE]
The above dump tree shows the first file extent item is inlined, which
should make no sense for data reloc inodes, as such inodes just
represent where the data extents are in the relocation destination chunk.

However the relocation path preallocates space for each block,
then dirties them, cluster by cluster.
It's possible to have a single block at the beginning of the block
group, and no other block in the same cluster.

So relocation will preallocate a file extent for that block and dirty
the first block.  Then memory pressure forces the data reloc inode to be
written back, before any other blocks are dirtied/allocated.

Finally commit 3eaf5f0 ("btrfs: extract inlined creation into a dedicated
delalloc helper") changed the sequence of delalloc. Before that commit we
always tried NOCOW first, so that dirtied block would be written back into
the preallocated space, and appear as a regular extent.

But with that commit, we always try inline first, and since compression
is forced, we try compressing the first block, and then inline the
compressed data, resulting in the above inlined file extent in the data
reloc tree.

Then the check in get_new_location() will check the file offset, without
checking if the file extent is inlined or not, resulting in the above
failure.

[FIX]
Do not allow compression for data reloc inodes.

Since data reloc inode sizes are always block aligned, as long as we do
not compress, @data_len will always be at least one block, and
that will cause can_cow_file_range_inline() to return false, thus no
inlined extent will be created.

Reported-by: syzbot+d950c6ba09b79f6e1864@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a373dc5.764cf64f.168fbe.0001.GAE@google.com/
Fixes: 3eaf5f0 ("btrfs: extract inlined creation into a dedicated delalloc helper")
CC: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit a6908f8 ("btrfs: validate data reloc tree file extent item
members") introduced extra checks on file extent items for data reloc
inodes, but it checked the file extent offset without checking if the file
extent is inlined.

This can lead to either false alerts (as the offset member is inside the
inlined data) or even reading beyond the item range.

This has already triggered a warning in a syzbot report.
Although the root fix is to avoid compression for data reloc inodes, for
the sake of consistency, reject inlined file extents first.

Fixes: a6908f8 ("btrfs: validate data reloc tree file extent item members")
CC: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The nodesize and sectorsize are all u32 values, there is no need to use
u64 for local usage.

Furthermore some call sites also use "blocksize" or "bs" for sectorsize,
also change them to use the minimal type u32 instead.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs does not support variable stripe length yet, all RAID0/5/6/10
chunks have the fixed stripe length 64K for now.

Furthermore, btrfs_fs_info::stripesize is not the real chunk stripe
length, it's always the same value as sectorsize.

Remove btrfs_fs_info::stripesize, and for the only callsite utilizing
that member, replace it with fs_info->sectorsize instead.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…etattr()

btrfs_getattr() unconditionally reads BTRFS_I(inode)->new_delalloc_bytes
and adds it (sector-aligned) to stat->blocks for every inode type.
However, new_delalloc_bytes lives in a union with last_dir_index_offset:

    union {
        u64 new_delalloc_bytes;     /* files only */
        u64 last_dir_index_offset;  /* directories only */
    };

For a directory inode this memory holds last_dir_index_offset, which is
set during directory logging (e.g. flush_dir_items_batch()) to the
offset of the last logged BTRFS_DIR_INDEX_KEY.  That offset grows with
the number of entries ever created in the directory (dir indexes are
monotonic and never reused), so it can be arbitrarily large.

As a result, after a directory has been logged (e.g. via an fsync that
triggers directory logging), btrfs_getattr() reports inflated st_blocks
for that directory.  The inflation is purely in-core and disappears
after the inode is evicted and reloaded (btrfs_alloc_inode() zeroes the
union), e.g. after a remount.

Reproducer (on a btrfs filesystem):

    D=/mnt/btrfs/d
    mkdir -p $D
    for i in $(seq 1 20000); do touch $D/f$i; done
    sync                      # commit, push dir index high
    touch $D/trigger          # dirty the dir in a new transaction
    xfs_io -c fsync $D        # log the directory -> sets last_dir_index_offset
    stat -c '%b' $D           # st_blocks is now inflated (e.g. 40)
    # umount + mount -> st_blocks drops back to the correct value

The evict path already knows this union is type-dependent and guards the
corresponding WARN_ON with !S_ISDIR() in btrfs_destroy_inode(); only
btrfs_getattr() was missing the equivalent check.

Only read new_delalloc_bytes for regular files, which are the only
inodes that ever set it.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…tion

While running fsstress with autodefrag and flushoncommit, hit a deadlock
due to the fact that defrag reserves delalloc space while it's holding
dirty and locked folios, besides the extent range lock. The stack traces
are the following:

   [958.624] task:kworker/u50:3   state:D stack:0     pid:20365 tgid:20365 ppid:2      task_flags:0x4208060 flags:0x00080000
   [958.626] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
   [958.627] Call Trace:
   [958.628]  <TASK>
   [958.628]  __schedule+0x4be/0x10f0
   [958.629]  ? preempt_count_add+0x69/0xa0
   [958.630]  schedule+0x26/0xd0
   [958.631]  wait_current_trans+0x102/0x160 [btrfs]
   [958.632]  ? __pfx_autoremove_wake_function+0x10/0x10
   [958.633]  start_transaction+0x374/0x900 [btrfs]
   [958.634]  btrfs_commit_current_transaction+0x1d/0x70 [btrfs]
   [958.635]  flush_space+0xca/0x5e0 [btrfs]
   [958.636]  ? _raw_spin_unlock+0x15/0x30
   [958.637]  ? btrfs_reduce_alloc_profile+0x8c/0x190 [btrfs]
   [958.639]  ? _raw_spin_unlock+0x15/0x30
   [958.640]  ? calc_available_free_space.isra.0+0x6f/0x110 [btrfs]
   [958.641]  do_async_reclaim_metadata_space+0x84/0x190 [btrfs]
   [958.642]  btrfs_async_reclaim_metadata_space+0x64/0x80 [btrfs]
   [958.644]  process_one_work+0x19d/0x3a0
   [958.644]  worker_thread+0x1c4/0x330
   [958.645]  ? __pfx_worker_thread+0x10/0x10
   [958.646]  kthread+0xfc/0x130
   [958.647]  ? __pfx_kthread+0x10/0x10
   [958.648]  ret_from_fork+0x1f7/0x2c0
   [958.648]  ? __pfx_kthread+0x10/0x10
   [958.649]  ret_from_fork_asm+0x1a/0x30
   [958.650]  </TASK>
   [958.651] task:kworker/u49:7   state:D stack:0     pid:52990 tgid:52990 ppid:2      task_flags:0x4208060 flags:0x00080000
   [958.653] Workqueue: writeback wb_workfn (flush-btrfs-334)
   [958.655] Call Trace:
   [958.655]  <TASK>
   [958.656]  __schedule+0x4be/0x10f0
   [958.657]  ? __blk_flush_plug+0xe9/0x140
   [958.658]  schedule+0x26/0xd0
   [958.658]  io_schedule+0x42/0x70
   [958.659]  folio_wait_bit_common+0x12b/0x330
   [958.660]  ? folio_wait_bit_common+0x100/0x330
   [958.662]  ? __pfx_wake_page_function+0x10/0x10
   [958.663]  extent_write_cache_pages+0x599/0x830 [btrfs]
   [958.664]  ? acpi_fwnode_get_reference_args+0x1fa/0x270
   [958.665]  btrfs_writepages+0x77/0x130 [btrfs]
   [958.666]  ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs]
   [958.667]  do_writepages+0xc6/0x160
   [958.668]  __writeback_single_inode+0x42/0x310
   [958.669]  writeback_sb_inodes+0x231/0x570
   [958.670]  wb_writeback+0x8a/0x340
   [958.671]  wb_workfn+0xbf/0x450
   [958.672]  ? finish_task_switch.isra.0+0xc1/0x350
   [958.673]  process_one_work+0x19d/0x3a0
   [958.673]  worker_thread+0x1c4/0x330
   [958.674]  ? __pfx_worker_thread+0x10/0x10
   [958.675]  kthread+0xfc/0x130
   [958.676]  ? __pfx_kthread+0x10/0x10
   [958.676]  ret_from_fork+0x1f7/0x2c0
   [958.677]  ? __pfx_kthread+0x10/0x10
   [958.678]  ret_from_fork_asm+0x1a/0x30
   [958.679]  </TASK>
   [958.679] task:btrfs-cleaner   state:D stack:0     pid:296750 tgid:296750 ppid:2      task_flags:0x208040 flags:0x00080000
   [958.681] Call Trace:
   [958.682]  <TASK>
   [958.682]  __schedule+0x4be/0x10f0
   [958.683]  schedule+0x26/0xd0
   [958.684]  handle_reserve_ticket+0x1b9/0x2c0 [btrfs]
   [958.685]  ? __pfx_autoremove_wake_function+0x10/0x10
   [958.686]  reserve_bytes+0x283/0x4c0 [btrfs]
   [958.687]  btrfs_reserve_metadata_bytes+0x18/0xb0 [btrfs]
   [958.688]  btrfs_delalloc_reserve_metadata+0x121/0x320 [btrfs]
   [958.690]  btrfs_delalloc_reserve_space+0x46/0xb0 [btrfs]
   [958.691]  btrfs_defrag_file+0x903/0x1110 [btrfs]
   [958.692]  btrfs_run_defrag_inodes+0x334/0x430 [btrfs]
   [958.694]  cleaner_kthread+0x97/0x1c0 [btrfs]
   [958.694]  ? __pfx_cleaner_kthread+0x10/0x10 [btrfs]
   [958.696]  kthread+0xfc/0x130
   [958.696]  ? __pfx_kthread+0x10/0x10
   [958.697]  ret_from_fork+0x1f7/0x2c0
   [958.698]  ? __pfx_kthread+0x10/0x10
   [958.699]  ret_from_fork_asm+0x1a/0x30
   [958.700]  </TASK>
   [958.716] task:fsstress        state:D stack:0     pid:296769 tgid:296769 ppid:296768 task_flags:0x400140 flags:0x00080000
   [958.718] Call Trace:
   [958.719]  <TASK>
   [958.719]  __schedule+0x4be/0x10f0
   [958.720]  ? preempt_count_add+0x69/0xa0
   [958.721]  schedule+0x26/0xd0
   [958.722]  wb_wait_for_completion+0x79/0xc0
   [958.723]  ? __pfx_autoremove_wake_function+0x10/0x10
   [958.724]  __writeback_inodes_sb_nr+0xc5/0xf0
   [958.725]  try_to_writeback_inodes_sb+0x55/0x70
   [958.726]  btrfs_commit_transaction+0x19d/0xeb0 [btrfs]
   [958.727]  ? start_transaction+0x343/0x900 [btrfs]
   [958.728]  btrfs_mksubvol+0x28b/0x4e0 [btrfs]
   [958.729]  btrfs_mksnapshot+0x74/0xa0 [btrfs]
   [958.730]  __btrfs_ioctl_snap_create+0x194/0x210 [btrfs]
   [958.732]  btrfs_ioctl_snap_create_v2+0xef/0x150 [btrfs]
   [958.733]  btrfs_ioctl+0x7ec/0x2a70 [btrfs]
   [958.734]  ? __virt_addr_valid+0xe4/0x180
   [958.735]  ? __check_object_size+0x1cd/0x1f0
   [958.736]  ? kmem_cache_free+0x146/0x380
   [958.737]  ? _raw_spin_unlock+0x15/0x30
   [958.738]  ? do_sys_openat2+0x83/0xd0
   [958.739]  __x64_sys_ioctl+0x92/0xe0
   [958.740]  do_syscall_64+0x60/0x590
   [958.741]  ? clear_bhb_loop+0x60/0xb0
   [958.742]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [958.743] RIP: 0033:0x7f4431e108db
   [958.744] RSP: 002b:00007ffcd147db20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   [958.746] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f4431e108db
   [958.747] RDX: 00007ffcd147eb90 RSI: 0000000050009417 RDI: 0000000000000005
   [958.749] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
   [958.751] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffcd147fbf0
   [958.752] R13: 00007ffcd147eb90 R14: 0000000000000005 R15: 0000000000000003
   [958.754]  </TASK>

What happens is the following:

1) The cleaner kthread is running autodefrag, and in defrag_one_range()
   it acquired all the folios for the range and locked them.

   Then it locked the extent range in the inode's iotree.

   It got two subranges from defrag_collect_targets(), the first one
   with folio A and the second one with folio B.

   After it defragged the first subrange, folio A remains locked and
   dirty - it's only unlocked when defrag_one_range() returns.

   When it attempts to defrag the second subrange (containing folio B),
   btrfs_delalloc_reserve_space() creates a space reservation ticket,
   due to lack of free metadata space and blocks waiting for the async
   metadata reclaim task to free space and wake it up;

2) The async reclaim metadata task attempts to commit the current
   transaction, but it blocks because there is another task that
   started the commit first;

3) A task creating a snapshot is committing the transaction and
   because the fs was mounted with flushoncommit, it calls
   try_to_writeback_inodes_sb(), which spawns a task to flush
   delalloc and waits for it to complete;

4) The task flushing delalloc (kworker/u49:7), finds that folio A for
   the inode being defragged is dirty, so it tries to lock it...

   But it blocks because folio A is locked by the defrag task (the
   cleaner kthread) which is blocked waiting for the reservation
   ticket to be served, but the async reclaim metadata task is
   blocked waiting for the transaction commit, which in turn is
   blocked waiting for the delalloc flush task, which is trying to
   lock folio A, resulting in a deadlock.

The same type of problem can happen if the async reclaim task starts to
flush delalloc, as that requires both locking the folio and the extent
range in the inode's io tree, and in this case we don't need the fs to
be mounted with flushoncommit. This type of problem has ocurred several
times in the past with reflinks for example, where we had a dirty folio
while holding the extent range locked and then starting a transaction
blocked waiting for the async reclaim task due to lack of free metadata
space.

So fix this by reserving delalloc space before locking folios and locking
the extent range in the inode's iotree. We can not simply unlock the
folios for each subrange given by defrag_collect_targets() after we defrag
it because the same folio may be present too in the next subrange (due to
large folios).

Fixes: 22b398e ("btrfs: defrag: introduce helper to defrag a contiguous prepared range")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Syzbot reported the following warning recently:

   [157.672][ T6611] BTRFS info (device loop0): turning on flush-on-commit
   [157.672][ T6611] BTRFS info (device loop0): enabling free space tree
   [157.672][ T6611] BTRFS info (device loop0): enabling auto defrag
   [157.672][ T6611] BTRFS info (device loop0): use lzo compression, level 1
   [157.672][ T6611] BTRFS info (device loop0): max_inline set to 4096
   [158.094][ T5608] BTRFS info (device loop2): last unmount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
   [160.073][ T6656] BTRFS info (device loop0 state M): max_inline set to 4096
   [160.418][ T5611] BTRFS info (device loop0): last unmount of filesystem ab8108e1-bea5-4a9f-94c9-a3ff208d732a
   [160.432][ T6662] loop2: detected capacity change from 0 to 32768
   [160.438][ T6662] BTRFS: device fsid c9fe44da-de57-406a-8241-57ec7d4412cf devid 1 transid 8 /dev/loop2 (7:2) scanned by syz.2.74 (6662)
   [160.459][ T6662] BTRFS info (device loop2): first mount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
   [160.459][ T6662] BTRFS info (device loop2): using crc32c checksum algorithm
   [160.634][ T1187] ------------[ cut here ]------------
   [160.634][ T1187] test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state)
   [160.634][ T1187] WARNING: fs/btrfs/inode.c:3596 at btrfs_add_delayed_iput+0x2e3/0x340, CPU#0: kworker/u8:10/1187
   [160.634][ T1187] Modules linked in:
   [160.634][ T1187] CPU: 0 UID: 0 PID: 1187 Comm: kworker/u8:10 Not tainted syzkaller #0 PREEMPT_{RT,(full)}
   [160.634][ T1187] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
   [160.634][ T1187] Workqueue: btrfs-endio-write btrfs_work_helper
   [160.634][ T1187] RIP: 0010:btrfs_add_delayed_iput+0x2e3/0x340
   [160.634][ T1187] Code: 53 a3 45 (...)
   [160.634][ T1187] RSP: 0018:ffffc900065d77c8 EFLAGS: 00010293
   [160.634][ T1187] RAX: ffffffff83e5f502 RBX: ffff88805aba0000 RCX: ffff888029768000
   [160.634][ T1187] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
   [160.634][ T1187] RBP: dffffc0000000000 R08: 0000000000000000 R09: 0000000000000000
   [160.634][ T1187] R10: dffffc0000000000 R11: ffffed100b574497 R12: 0000000000000001
   [160.634][ T1187] R13: dffffc0000000000 R14: ffff888061194788 R15: 0000000000000200
   [160.634][ T1187] FS:  0000000000000000(0000) GS:ffff888126186000(0000) knlGS:0000000000000000
   [160.634][ T1187] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [160.634][ T1187] CR2: 00007fe553a3f000 CR3: 00000000596c2000 CR4: 00000000003526f0
   [160.634][ T1187] Call Trace:
   [160.634][ T1187]  <TASK>
   [160.634][ T1187]  btrfs_put_ordered_extent+0x18f/0x430
   [160.634][ T1187]  btrfs_finish_one_ordered+0xf63/0x2680
   [160.634][ T1187]  ? __pfx_btrfs_finish_one_ordered+0x10/0x10
   [160.634][ T1187]  ? do_raw_spin_lock+0x12b/0x2f0
   [160.634][ T1187]  ? lock_acquire+0x106/0x350
   [160.634][ T1187]  ? __pfx_do_raw_spin_lock+0x10/0x10
   [160.634][ T1187]  btrfs_work_helper+0x38b/0xc20
   [160.634][ T1187]  ? process_scheduled_works+0xa70/0x1860
   [160.634][ T1187]  process_scheduled_works+0xb5d/0x1860
   [160.634][ T1187]  ? __pfx_process_scheduled_works+0x10/0x10
   [160.634][ T1187]  ? assign_work+0x3d5/0x5e0
   [160.634][ T1187]  worker_thread+0xa53/0xfc0
   [160.634][ T1187]  kthread+0x388/0x470
   [160.634][ T1187]  ? __pfx_worker_thread+0x10/0x10
   [160.635][ T1187]  ? __pfx_kthread+0x10/0x10
   [160.635][ T1187]  ret_from_fork+0x514/0xb70
   [160.635][ T1187]  ? __pfx_ret_from_fork+0x10/0x10
   [160.635][ T1187]  ? __switch_to+0xc79/0x1410
   [160.635][ T1187]  ? __pfx_kthread+0x10/0x10
   [160.635][ T1187]  ret_from_fork_asm+0x1a/0x30
   [160.635][ T1187]  </TASK>
   [160.635][ T1187] Kernel panic - not syncing: kernel: panic_on_warn set ...

It means we add a delayed iput created after we last ran delayed iputs in
close_ctree() and set the flag BTRFS_FS_STATE_NO_DELAYED_IPUT in fs_info.

This happens when using autodefrag and more likely to happen if we use
flushoncommit too. The steps are the following:

1) Unmount starts, all delalloc is flushed and we enter close_ctree();

2) In close_ctree() we park the cleaner kthread, but while we wait for it
   to park, it's in:

     btrfs_run_defrag_inodes()
        btrfs_run_defrag_inode()
           btrfs_defrag_file()
              defrag_one_cluster()
                 defrag_one_range()
                    defrag_one_locked_target()

   And dirties some folios from an inode;

3) The cleaner kthread parks and we proceed in close_ctree(), waiting
   for all ordered extents, running delayed iputs and setting the flag
   BTRFS_FS_STATE_NO_DELAYED_IPUT in fs_info;

4) Later in close_ctree() we call btrfs_commit_super(), which commits the
   current transaction. Because we are mounted with flushoncommit, the
   transaction commit flushes delalloc and waits for the resulting ordered
   extent to complete;

5) The ordered extents from the flushed delalloc created by autodefrag
   complete and create delayed iputs, triggering the warning:

     WARN_ON_ONCE(test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state));

   in btrfs_add_delayed_iput()

6) Further below in close_ctree() we will hit the following assertion:

     ASSERT(list_empty(&fs_info->delayed_iputs));

   Since we don't expect any more delayed iputs.

Fix this by flushing delalloc and waiting for the ordered extents right
after we parked the cleaner kthread and waiting for autodefrag in
close_ctree().

Reported-by: syzbot+6a843bf8604711c8fab0@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a1ee507.b4221f80.1326c5.0004.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to have one list for each loop to defrag each subrange and
then another one to free each subrange (struct defrag_target_range).
We can do it in a single loop, freeing each subrange after defragging,
plus no need to delete each subrange from the list since we immediately
free it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use AUTO_KFREE() for the folios array, avoiding two kfree() calls, one of
them in a very specific error path.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When freeing the entries from the list there is no need to initialize
the list member in an entry, since we are immediately freeing it. So use
simple list_del() instead of list_del_init().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to call list_del_init() against each entry when freeing
the list, as the list is local and we are freeing the entry.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reported a bug that there can be conflicting OEs for the same
range:

  BTRFS critical (device loop4): panic in insert_ordered_extent:264: overlapping ordered extents, existing oe file_offset 16384 num_bytes 430080 flags 0x1089, new oe file_offset 16384 num_bytes 430080 flags 0x80 (errno=-17 Object alrea[  179.162726][ T6897] BTRFS critical (device loop4): panic in insert_ordered_extent:264: overlapping ordered extents, existing oe file_offset 16384 num_bytes 430080 flags 0x1089, new oe file_offset 16384 num_bytes 430080 flags 0x80 (errno=-17 Object already exists)
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ordered-data.c:264!
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
  RIP: 0010:btrfs_alloc_ordered_extent+0x943/0xad0
  Call Trace:
   <TASK>
   cow_file_range+0x744/0x12a0
   fallback_to_cow+0x5ea/0xa00
   run_delalloc_nocow+0x110c/0x17a0
   btrfs_run_delalloc_range+0xbe4/0x1c20
   writepage_delalloc+0x104d/0x1ba0
   btrfs_writepages+0x1667/0x28b0
   do_writepages+0x338/0x560
   filemap_fdatawrite_range+0x1f2/0x300
   btrfs_fdatawrite_range+0x54/0xf0
   btrfs_direct_write+0x6a0/0xc30
   btrfs_do_write_iter+0x329/0x790
   do_iter_readv_writev+0x624/0x8d0
   vfs_writev+0x34c/0x990
   __se_sys_pwritev2+0x17a/0x2a0
   do_syscall_64+0x174/0x580
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   </TASK>
  ---[ end trace 0000000000000000 ]---

[CAUSE]
Since commit ff66fe6 ("btrfs: fix incorrect buffered IO fallback
for append direct writes"), if the direct IO finished short, we will
revert the isize back to the original one, so that append writes can be
respected during the buffered fallback.

Normally we rely on lock_and_cleanup_extent_if_need() function during
buffered writeback to wait for any existing ordered extents.

But that ordered extent waiting only happens if the start_pos is inside
the isize.
Since we have reverted the isize during failed direct IO, we will not
wait for any ordered extents.

This means we can have a race where the direct IO OE is still in the
tree, finished but not yet removed, then we're inserting the OE for the
buffered write, causing the above crash.

[FIX]
Make the OE wait to be unconditional, to handle the reverted isize
situation.

And since lock_and_cleanup_extent_if_need() now either lock the
extents or return -EAGAIN, also remove the branches that handles
no-extent-locked cases, and rename it to remove the "_if_need" suffix.

The following micro benchmark shows the runtime difference for
btrfs_buffered_write(), doing `xfs_io -f -c "pwrite 0 1m"` workload,
all values are the average runtime in nano seconds.

      function runtime              |   before    |     after
 -----------------------------------+-------------+---------------
 lock_and_cleanup_extent_if_need()  |     58.2    |    183.0
 btrfs_buffered_write()             |   2115.6    |   2973.3

The overall runtime of btrfs_buffered_write() is still pretty
tiny (still less than 3 micro seconds), I'd say the extra cost is still
acceptable.

An alternative to fix this problem is to wait ordered extents during
iomap_end() where the isize revert is done.

But that solution will break nowait requirement, as if a nowait direct
IO finished short, we have to wait for the OEs unconditionally or the
next append buffered IO can still hit the same problem.

So here we have to move the wait cost to buffered write, but at least
the code is slightly more streamline.

Reported-by: syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=ba2afde329fc27e3f22e
Fixes: ff66fe6 ("btrfs: fix incorrect buffered IO fallback for append direct writes")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
kdave and others added 2 commits June 30, 2026 01:13
Any commits after this one are for testing and evaluation only.

Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_root_in_trans() has a lockless fast path for shareable
roots. It skips reloc_mutex when root->last_trans matches the current
transaction and BTRFS_ROOT_IN_TRANS_SETUP is clear.

The writer side publishes that state in two phases: it sets
IN_TRANS_SETUP before updating root->last_trans, then clears the bit
after btrfs_init_reloc_root() finishes. However, the reader-side
smp_rmb() is before both loads, so it does not order the last_trans load
against the later bit test. A reader can observe the new last_trans value
while missing the setup bit and return before the relocation-root setup
is complete.

Read root->last_trans first, then issue the read barrier before testing
IN_TRANS_SETUP. Also use clear_bit_unlock() for the writer's final clear
and test_bit_acquire() for the successful fast path, so the lockless
return observes the setup done before the bit was cleared.

Fixes: 7585717 ("Btrfs: fix relocation races")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet