Skip to content

Test for-next ARM64 64K (regular, SELF)#1630

Open
kdave wants to merge 10000 commits into
ci-arm-kvmfrom
for-next
Open

Test for-next ARM64 64K (regular, SELF)#1630
kdave wants to merge 10000 commits into
ci-arm-kvmfrom
for-next

Conversation

@kdave

@kdave kdave commented Apr 17, 2026

Copy link
Copy Markdown
Member

No description provided.

@adam900710 adam900710 force-pushed the for-next branch 2 times, most recently from ad252c6 to af81080 Compare April 18, 2026 04:42
@kdave kdave force-pushed the for-next branch 2 times, most recently from 30c6cb0 to 73d4bbd Compare April 22, 2026 19:47
@kdave kdave force-pushed the for-next branch 2 times, most recently from 26f5cfa to 2189fe7 Compare April 24, 2026 11:09
@kdave kdave force-pushed the for-next branch 2 times, most recently from 5280eae to 52d1b61 Compare April 27, 2026 14:33
@adam900710 adam900710 force-pushed the for-next branch 3 times, most recently from 40c2283 to 09752d4 Compare April 28, 2026 00:45
@kdave kdave force-pushed the for-next branch 2 times, most recently from 29451dd to dc188da Compare April 28, 2026 06:01
@adam900710 adam900710 force-pushed the for-next branch 2 times, most recently from 4a55cf6 to 436ac81 Compare May 3, 2026 08:53
@fdmanana fdmanana force-pushed the for-next branch 2 times, most recently from e32c6db to 49a0b34 Compare May 4, 2026 15:50
@kdave kdave force-pushed the for-next branch 4 times, most recently from 4137f02 to f2ac86e Compare May 12, 2026 15:03
@kdave kdave force-pushed the for-next branch 2 times, most recently from db2485b to 0c78978 Compare May 16, 2026 00:59
adam900710 and others added 4 commits June 30, 2026 00:11
Since commit bac3c29 ("btrfs: remove 2K block size support") there
is no 2K block size support inside btrfs anymore.

Remove the stale comments of btrfs_supported_blocksize().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since v5.15 btrfs has support for block size < page size, but we still
only support 4K block size, while there is no special reason that we
cannot support 8K/16K/32K block sizes for 64K page size.

That 4K limit is completely arbitrary, and mostly to reduce test runtime
so we do not need to test all the extra block size combinations.

However that also limits the user choices, some users may understand
what they are doing, and want larger block sizes.  In that case, fixed
4K block size for subpage routine is blocking our way.

Just remove that fixed 4K requirement for block size < page size.

This should not affect regular end users, since mkfs is already using 4K
block size as default for quite a while, and the existing bs == ps support is
always there.

But for power users, this allows extra block size support, and may
provide extra test coverage.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Decentralize transaction aborts in create_reloc_root(), so that it is
obvious which call failed and what caused the transaction abort.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When dumping a tree block, btrfs_header::owner is printed as
unsigned, which can result in numbers that are hard to read, e.g.:

  BTRFS info (device loop0): leaf 8908800 gen 16 total ptrs 28 free space 1676 owner 18446744073709551607

For the above output, 18446744073709551607 is (s64)-9, the root id of data
reloc tree.

Despite those predefined root ids that are already negative, existing
subvolume trees will not have any negative values, as subvolume trees can
only utilize the lower 48 bits, so there will be no output change for
existing subvolumes, thus no extra confusion.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
morbidrsa and others added 5 commits June 30, 2026 00:12
…erge

On a zoned FS, btrfs_delayed_refs_rsv_refill() returns -EAGAIN whenever
the over-committed metadata plus the zone_unusable bytes exceeds the
usable size in a metadata block-group to avoid heavy over-commit of
metadata and early ENOSPC in one transaction.

If this happens while doing reclaim, the transaction is getting aborted.

Treat -EAGAIN as a soft, retryable condition in case of block-group
reclaim.

Reported-by: Damien Le Moal <dlemoal@kernel.org>
Fixes: 7bcb04d ("btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment is wrong, because it's not about storing the ID of new
directories that were already created, instead it's about storing utimes
values for directories (both new and existing). The comment is wrong
because it was copy pasted from SEND_MAX_DIR_CREATED_CACHE_SIZE, but
forgot to update it afterwards.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…mon prefixes

In case the current inode's path is a prefix of the given path, the helper
is_current_inode_path() will return true, which causes the single caller
to reset the current inode's path. While this is not a functional issue,
it makes the caller recompute the current inode's path later. It could
also become a problem in the future in case get new callers for
is_current_inode_path() in more sensitive contexts.

Example: the current inode path is "/foo/bar" and the path we compare
against is "/foo/bar_xyz".

Fix this by returning true only if we have exact matches.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a syzbot report that the check inside get_new_location()
triggered:

  BTRFS info (device loop0): found 31 extents, stage: move data extents
  BTRFS info (device loop0): leaf 8908800 gen 16 total ptrs 28 free space 1676 owner 18446744073709551607
         item 0 key (256 INODE_ITEM 0) itemoff 3835 itemsize 160
                 inode generation 5 transid 0 size 0 nbytes 0
                 block group 0 mode 40755 links 1 uid 0 gid 0
                 rdev 0 sequence 0 flags 0x0
                 atime 1669132761.0
                 ctime 1669132761.0
                 mtime 1669132761.0
                 otime 0.0
         item 1 key (256 INODE_REF 256) itemoff 3823 itemsize 12
                 index 0 name_len 2
         item 2 key (258 INODE_ITEM 0) itemoff 3663 itemsize 160
                 inode generation 1 transid 16 size 733184 nbytes 106496
                 block group 0 mode 100600 links 0 uid 0 gid 0
                 rdev 0 sequence 24 flags 0x18
         item 3 key (258 EXTENT_DATA 0) itemoff 3595 itemsize 68
                 generation 16 type 0
                 inline extent data size 47 ram_bytes 4096 compression 1
  [...]
         item 27 key (18446744073709551611 ORPHAN_ITEM 258) itemoff 2376 itemsize 0
  BTRFS error (device loop0): unexpected non-zero offset in file extent item for data reloc inode 258 key offset 0 offset 9277520992061368337
  ------------[ cut here ]------------
  btrfs_abort_should_print_stack(__error)

[CAUSE]
The above dump tree shows the first file extent item is inlined, which
should make no sense for data reloc inodes, as such inodes just
represent where the data extents are in the relocation destination chunk.

However the relocation path preallocates space for each block,
then dirties them, cluster by cluster.
It's possible to have a single block at the beginning of the block
group, and no other block in the same cluster.

So relocation will preallocate a file extent for that block and dirty
the first block.  Then memory pressure forces the data reloc inode to be
written back, before any other blocks are dirtied/allocated.

Finally commit 3eaf5f0 ("btrfs: extract inlined creation into a dedicated
delalloc helper") changed the sequence of delalloc. Before that commit we
always tried NOCOW first, so that dirtied block would be written back into
the preallocated space, and appear as a regular extent.

But with that commit, we always try inline first, and since compression
is forced, we try compressing the first block, and then inline the
compressed data, resulting in the above inlined file extent in the data
reloc tree.

Then the check in get_new_location() will check the file offset, without
checking if the file extent is inlined or not, resulting in the above
failure.

[FIX]
Do not allow compression for data reloc inodes.

Since data reloc inode sizes are always block aligned, as long as we do
not compress, @data_len will always be at least one block, and
that will cause can_cow_file_range_inline() to return false, thus no
inlined extent will be created.

Reported-by: syzbot+d950c6ba09b79f6e1864@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a373dc5.764cf64f.168fbe.0001.GAE@google.com/
Fixes: 3eaf5f0 ("btrfs: extract inlined creation into a dedicated delalloc helper")
CC: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit a6908f8 ("btrfs: validate data reloc tree file extent item
members") introduced extra checks on file extent items for data reloc
inodes, but it checked the file extent offset without checking if the file
extent is inlined.

This can lead to either false alerts (as the offset member is inside the
inlined data) or even reading beyond the item range.

This has already triggered a warning in a syzbot report.
Although the root fix is to avoid compression for data reloc inodes, for
the sake of consistency, reject inlined file extents first.

Fixes: a6908f8 ("btrfs: validate data reloc tree file extent item members")
CC: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
adam900710 and others added 6 commits June 30, 2026 00:32
The nodesize and sectorsize are all u32 values, there is no need to use
u64 for local usage.

Furthermore some call sites also use "blocksize" or "bs" for sectorsize,
also change them to use the minimal type u32 instead.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs does not support variable stripe length yet, all RAID0/5/6/10
chunks have the fixed stripe length 64K for now.

Furthermore, btrfs_fs_info::stripesize is not the real chunk stripe
length, it's always the same value as sectorsize.

Remove btrfs_fs_info::stripesize, and for the only callsite utilizing
that member, replace it with fs_info->sectorsize instead.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…etattr()

btrfs_getattr() unconditionally reads BTRFS_I(inode)->new_delalloc_bytes
and adds it (sector-aligned) to stat->blocks for every inode type.
However, new_delalloc_bytes lives in a union with last_dir_index_offset:

    union {
        u64 new_delalloc_bytes;     /* files only */
        u64 last_dir_index_offset;  /* directories only */
    };

For a directory inode this memory holds last_dir_index_offset, which is
set during directory logging (e.g. flush_dir_items_batch()) to the
offset of the last logged BTRFS_DIR_INDEX_KEY.  That offset grows with
the number of entries ever created in the directory (dir indexes are
monotonic and never reused), so it can be arbitrarily large.

As a result, after a directory has been logged (e.g. via an fsync that
triggers directory logging), btrfs_getattr() reports inflated st_blocks
for that directory.  The inflation is purely in-core and disappears
after the inode is evicted and reloaded (btrfs_alloc_inode() zeroes the
union), e.g. after a remount.

Reproducer (on a btrfs filesystem):

    D=/mnt/btrfs/d
    mkdir -p $D
    for i in $(seq 1 20000); do touch $D/f$i; done
    sync                      # commit, push dir index high
    touch $D/trigger          # dirty the dir in a new transaction
    xfs_io -c fsync $D        # log the directory -> sets last_dir_index_offset
    stat -c '%b' $D           # st_blocks is now inflated (e.g. 40)
    # umount + mount -> st_blocks drops back to the correct value

The evict path already knows this union is type-dependent and guards the
corresponding WARN_ON with !S_ISDIR() in btrfs_destroy_inode(); only
btrfs_getattr() was missing the equivalent check.

Only read new_delalloc_bytes for regular files, which are the only
inodes that ever set it.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…tion

While running fsstress with autodefrag and flushoncommit, hit a deadlock
due to the fact that defrag reserves delalloc space while it's holding
dirty and locked folios, besides the extent range lock. The stack traces
are the following:

   [958.624] task:kworker/u50:3   state:D stack:0     pid:20365 tgid:20365 ppid:2      task_flags:0x4208060 flags:0x00080000
   [958.626] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
   [958.627] Call Trace:
   [958.628]  <TASK>
   [958.628]  __schedule+0x4be/0x10f0
   [958.629]  ? preempt_count_add+0x69/0xa0
   [958.630]  schedule+0x26/0xd0
   [958.631]  wait_current_trans+0x102/0x160 [btrfs]
   [958.632]  ? __pfx_autoremove_wake_function+0x10/0x10
   [958.633]  start_transaction+0x374/0x900 [btrfs]
   [958.634]  btrfs_commit_current_transaction+0x1d/0x70 [btrfs]
   [958.635]  flush_space+0xca/0x5e0 [btrfs]
   [958.636]  ? _raw_spin_unlock+0x15/0x30
   [958.637]  ? btrfs_reduce_alloc_profile+0x8c/0x190 [btrfs]
   [958.639]  ? _raw_spin_unlock+0x15/0x30
   [958.640]  ? calc_available_free_space.isra.0+0x6f/0x110 [btrfs]
   [958.641]  do_async_reclaim_metadata_space+0x84/0x190 [btrfs]
   [958.642]  btrfs_async_reclaim_metadata_space+0x64/0x80 [btrfs]
   [958.644]  process_one_work+0x19d/0x3a0
   [958.644]  worker_thread+0x1c4/0x330
   [958.645]  ? __pfx_worker_thread+0x10/0x10
   [958.646]  kthread+0xfc/0x130
   [958.647]  ? __pfx_kthread+0x10/0x10
   [958.648]  ret_from_fork+0x1f7/0x2c0
   [958.648]  ? __pfx_kthread+0x10/0x10
   [958.649]  ret_from_fork_asm+0x1a/0x30
   [958.650]  </TASK>
   [958.651] task:kworker/u49:7   state:D stack:0     pid:52990 tgid:52990 ppid:2      task_flags:0x4208060 flags:0x00080000
   [958.653] Workqueue: writeback wb_workfn (flush-btrfs-334)
   [958.655] Call Trace:
   [958.655]  <TASK>
   [958.656]  __schedule+0x4be/0x10f0
   [958.657]  ? __blk_flush_plug+0xe9/0x140
   [958.658]  schedule+0x26/0xd0
   [958.658]  io_schedule+0x42/0x70
   [958.659]  folio_wait_bit_common+0x12b/0x330
   [958.660]  ? folio_wait_bit_common+0x100/0x330
   [958.662]  ? __pfx_wake_page_function+0x10/0x10
   [958.663]  extent_write_cache_pages+0x599/0x830 [btrfs]
   [958.664]  ? acpi_fwnode_get_reference_args+0x1fa/0x270
   [958.665]  btrfs_writepages+0x77/0x130 [btrfs]
   [958.666]  ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs]
   [958.667]  do_writepages+0xc6/0x160
   [958.668]  __writeback_single_inode+0x42/0x310
   [958.669]  writeback_sb_inodes+0x231/0x570
   [958.670]  wb_writeback+0x8a/0x340
   [958.671]  wb_workfn+0xbf/0x450
   [958.672]  ? finish_task_switch.isra.0+0xc1/0x350
   [958.673]  process_one_work+0x19d/0x3a0
   [958.673]  worker_thread+0x1c4/0x330
   [958.674]  ? __pfx_worker_thread+0x10/0x10
   [958.675]  kthread+0xfc/0x130
   [958.676]  ? __pfx_kthread+0x10/0x10
   [958.676]  ret_from_fork+0x1f7/0x2c0
   [958.677]  ? __pfx_kthread+0x10/0x10
   [958.678]  ret_from_fork_asm+0x1a/0x30
   [958.679]  </TASK>
   [958.679] task:btrfs-cleaner   state:D stack:0     pid:296750 tgid:296750 ppid:2      task_flags:0x208040 flags:0x00080000
   [958.681] Call Trace:
   [958.682]  <TASK>
   [958.682]  __schedule+0x4be/0x10f0
   [958.683]  schedule+0x26/0xd0
   [958.684]  handle_reserve_ticket+0x1b9/0x2c0 [btrfs]
   [958.685]  ? __pfx_autoremove_wake_function+0x10/0x10
   [958.686]  reserve_bytes+0x283/0x4c0 [btrfs]
   [958.687]  btrfs_reserve_metadata_bytes+0x18/0xb0 [btrfs]
   [958.688]  btrfs_delalloc_reserve_metadata+0x121/0x320 [btrfs]
   [958.690]  btrfs_delalloc_reserve_space+0x46/0xb0 [btrfs]
   [958.691]  btrfs_defrag_file+0x903/0x1110 [btrfs]
   [958.692]  btrfs_run_defrag_inodes+0x334/0x430 [btrfs]
   [958.694]  cleaner_kthread+0x97/0x1c0 [btrfs]
   [958.694]  ? __pfx_cleaner_kthread+0x10/0x10 [btrfs]
   [958.696]  kthread+0xfc/0x130
   [958.696]  ? __pfx_kthread+0x10/0x10
   [958.697]  ret_from_fork+0x1f7/0x2c0
   [958.698]  ? __pfx_kthread+0x10/0x10
   [958.699]  ret_from_fork_asm+0x1a/0x30
   [958.700]  </TASK>
   [958.716] task:fsstress        state:D stack:0     pid:296769 tgid:296769 ppid:296768 task_flags:0x400140 flags:0x00080000
   [958.718] Call Trace:
   [958.719]  <TASK>
   [958.719]  __schedule+0x4be/0x10f0
   [958.720]  ? preempt_count_add+0x69/0xa0
   [958.721]  schedule+0x26/0xd0
   [958.722]  wb_wait_for_completion+0x79/0xc0
   [958.723]  ? __pfx_autoremove_wake_function+0x10/0x10
   [958.724]  __writeback_inodes_sb_nr+0xc5/0xf0
   [958.725]  try_to_writeback_inodes_sb+0x55/0x70
   [958.726]  btrfs_commit_transaction+0x19d/0xeb0 [btrfs]
   [958.727]  ? start_transaction+0x343/0x900 [btrfs]
   [958.728]  btrfs_mksubvol+0x28b/0x4e0 [btrfs]
   [958.729]  btrfs_mksnapshot+0x74/0xa0 [btrfs]
   [958.730]  __btrfs_ioctl_snap_create+0x194/0x210 [btrfs]
   [958.732]  btrfs_ioctl_snap_create_v2+0xef/0x150 [btrfs]
   [958.733]  btrfs_ioctl+0x7ec/0x2a70 [btrfs]
   [958.734]  ? __virt_addr_valid+0xe4/0x180
   [958.735]  ? __check_object_size+0x1cd/0x1f0
   [958.736]  ? kmem_cache_free+0x146/0x380
   [958.737]  ? _raw_spin_unlock+0x15/0x30
   [958.738]  ? do_sys_openat2+0x83/0xd0
   [958.739]  __x64_sys_ioctl+0x92/0xe0
   [958.740]  do_syscall_64+0x60/0x590
   [958.741]  ? clear_bhb_loop+0x60/0xb0
   [958.742]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [958.743] RIP: 0033:0x7f4431e108db
   [958.744] RSP: 002b:00007ffcd147db20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   [958.746] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f4431e108db
   [958.747] RDX: 00007ffcd147eb90 RSI: 0000000050009417 RDI: 0000000000000005
   [958.749] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
   [958.751] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffcd147fbf0
   [958.752] R13: 00007ffcd147eb90 R14: 0000000000000005 R15: 0000000000000003
   [958.754]  </TASK>

What happens is the following:

1) The cleaner kthread is running autodefrag, and in defrag_one_range()
   it acquired all the folios for the range and locked them.

   Then it locked the extent range in the inode's iotree.

   It got two subranges from defrag_collect_targets(), the first one
   with folio A and the second one with folio B.

   After it defragged the first subrange, folio A remains locked and
   dirty - it's only unlocked when defrag_one_range() returns.

   When it attempts to defrag the second subrange (containing folio B),
   btrfs_delalloc_reserve_space() creates a space reservation ticket,
   due to lack of free metadata space and blocks waiting for the async
   metadata reclaim task to free space and wake it up;

2) The async reclaim metadata task attempts to commit the current
   transaction, but it blocks because there is another task that
   started the commit first;

3) A task creating a snapshot is committing the transaction and
   because the fs was mounted with flushoncommit, it calls
   try_to_writeback_inodes_sb(), which spawns a task to flush
   delalloc and waits for it to complete;

4) The task flushing delalloc (kworker/u49:7), finds that folio A for
   the inode being defragged is dirty, so it tries to lock it...

   But it blocks because folio A is locked by the defrag task (the
   cleaner kthread) which is blocked waiting for the reservation
   ticket to be served, but the async reclaim metadata task is
   blocked waiting for the transaction commit, which in turn is
   blocked waiting for the delalloc flush task, which is trying to
   lock folio A, resulting in a deadlock.

The same type of problem can happen if the async reclaim task starts to
flush delalloc, as that requires both locking the folio and the extent
range in the inode's io tree, and in this case we don't need the fs to
be mounted with flushoncommit. This type of problem has ocurred several
times in the past with reflinks for example, where we had a dirty folio
while holding the extent range locked and then starting a transaction
blocked waiting for the async reclaim task due to lack of free metadata
space.

So fix this by reserving delalloc space before locking folios and locking
the extent range in the inode's iotree. We can not simply unlock the
folios for each subrange given by defrag_collect_targets() after we defrag
it because the same folio may be present too in the next subrange (due to
large folios).

Fixes: 22b398e ("btrfs: defrag: introduce helper to defrag a contiguous prepared range")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Syzbot reported the following warning recently:

   [157.672][ T6611] BTRFS info (device loop0): turning on flush-on-commit
   [157.672][ T6611] BTRFS info (device loop0): enabling free space tree
   [157.672][ T6611] BTRFS info (device loop0): enabling auto defrag
   [157.672][ T6611] BTRFS info (device loop0): use lzo compression, level 1
   [157.672][ T6611] BTRFS info (device loop0): max_inline set to 4096
   [158.094][ T5608] BTRFS info (device loop2): last unmount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
   [160.073][ T6656] BTRFS info (device loop0 state M): max_inline set to 4096
   [160.418][ T5611] BTRFS info (device loop0): last unmount of filesystem ab8108e1-bea5-4a9f-94c9-a3ff208d732a
   [160.432][ T6662] loop2: detected capacity change from 0 to 32768
   [160.438][ T6662] BTRFS: device fsid c9fe44da-de57-406a-8241-57ec7d4412cf devid 1 transid 8 /dev/loop2 (7:2) scanned by syz.2.74 (6662)
   [160.459][ T6662] BTRFS info (device loop2): first mount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
   [160.459][ T6662] BTRFS info (device loop2): using crc32c checksum algorithm
   [160.634][ T1187] ------------[ cut here ]------------
   [160.634][ T1187] test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state)
   [160.634][ T1187] WARNING: fs/btrfs/inode.c:3596 at btrfs_add_delayed_iput+0x2e3/0x340, CPU#0: kworker/u8:10/1187
   [160.634][ T1187] Modules linked in:
   [160.634][ T1187] CPU: 0 UID: 0 PID: 1187 Comm: kworker/u8:10 Not tainted syzkaller #0 PREEMPT_{RT,(full)}
   [160.634][ T1187] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
   [160.634][ T1187] Workqueue: btrfs-endio-write btrfs_work_helper
   [160.634][ T1187] RIP: 0010:btrfs_add_delayed_iput+0x2e3/0x340
   [160.634][ T1187] Code: 53 a3 45 (...)
   [160.634][ T1187] RSP: 0018:ffffc900065d77c8 EFLAGS: 00010293
   [160.634][ T1187] RAX: ffffffff83e5f502 RBX: ffff88805aba0000 RCX: ffff888029768000
   [160.634][ T1187] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
   [160.634][ T1187] RBP: dffffc0000000000 R08: 0000000000000000 R09: 0000000000000000
   [160.634][ T1187] R10: dffffc0000000000 R11: ffffed100b574497 R12: 0000000000000001
   [160.634][ T1187] R13: dffffc0000000000 R14: ffff888061194788 R15: 0000000000000200
   [160.634][ T1187] FS:  0000000000000000(0000) GS:ffff888126186000(0000) knlGS:0000000000000000
   [160.634][ T1187] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [160.634][ T1187] CR2: 00007fe553a3f000 CR3: 00000000596c2000 CR4: 00000000003526f0
   [160.634][ T1187] Call Trace:
   [160.634][ T1187]  <TASK>
   [160.634][ T1187]  btrfs_put_ordered_extent+0x18f/0x430
   [160.634][ T1187]  btrfs_finish_one_ordered+0xf63/0x2680
   [160.634][ T1187]  ? __pfx_btrfs_finish_one_ordered+0x10/0x10
   [160.634][ T1187]  ? do_raw_spin_lock+0x12b/0x2f0
   [160.634][ T1187]  ? lock_acquire+0x106/0x350
   [160.634][ T1187]  ? __pfx_do_raw_spin_lock+0x10/0x10
   [160.634][ T1187]  btrfs_work_helper+0x38b/0xc20
   [160.634][ T1187]  ? process_scheduled_works+0xa70/0x1860
   [160.634][ T1187]  process_scheduled_works+0xb5d/0x1860
   [160.634][ T1187]  ? __pfx_process_scheduled_works+0x10/0x10
   [160.634][ T1187]  ? assign_work+0x3d5/0x5e0
   [160.634][ T1187]  worker_thread+0xa53/0xfc0
   [160.634][ T1187]  kthread+0x388/0x470
   [160.634][ T1187]  ? __pfx_worker_thread+0x10/0x10
   [160.635][ T1187]  ? __pfx_kthread+0x10/0x10
   [160.635][ T1187]  ret_from_fork+0x514/0xb70
   [160.635][ T1187]  ? __pfx_ret_from_fork+0x10/0x10
   [160.635][ T1187]  ? __switch_to+0xc79/0x1410
   [160.635][ T1187]  ? __pfx_kthread+0x10/0x10
   [160.635][ T1187]  ret_from_fork_asm+0x1a/0x30
   [160.635][ T1187]  </TASK>
   [160.635][ T1187] Kernel panic - not syncing: kernel: panic_on_warn set ...

It means we add a delayed iput created after we last ran delayed iputs in
close_ctree() and set the flag BTRFS_FS_STATE_NO_DELAYED_IPUT in fs_info.

This happens when using autodefrag and more likely to happen if we use
flushoncommit too. The steps are the following:

1) Unmount starts, all delalloc is flushed and we enter close_ctree();

2) In close_ctree() we park the cleaner kthread, but while we wait for it
   to park, it's in:

     btrfs_run_defrag_inodes()
        btrfs_run_defrag_inode()
           btrfs_defrag_file()
              defrag_one_cluster()
                 defrag_one_range()
                    defrag_one_locked_target()

   And dirties some folios from an inode;

3) The cleaner kthread parks and we proceed in close_ctree(), waiting
   for all ordered extents, running delayed iputs and setting the flag
   BTRFS_FS_STATE_NO_DELAYED_IPUT in fs_info;

4) Later in close_ctree() we call btrfs_commit_super(), which commits the
   current transaction. Because we are mounted with flushoncommit, the
   transaction commit flushes delalloc and waits for the resulting ordered
   extent to complete;

5) The ordered extents from the flushed delalloc created by autodefrag
   complete and create delayed iputs, triggering the warning:

     WARN_ON_ONCE(test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state));

   in btrfs_add_delayed_iput()

6) Further below in close_ctree() we will hit the following assertion:

     ASSERT(list_empty(&fs_info->delayed_iputs));

   Since we don't expect any more delayed iputs.

Fix this by flushing delalloc and waiting for the ordered extents right
after we parked the cleaner kthread and waiting for autodefrag in
close_ctree().

Reported-by: syzbot+6a843bf8604711c8fab0@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a1ee507.b4221f80.1326c5.0004.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to have one list for each loop to defrag each subrange and
then another one to free each subrange (struct defrag_target_range).
We can do it in a single loop, freeing each subrange after defragging,
plus no need to delete each subrange from the list since we immediately
free it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fdmanana and others added 4 commits June 30, 2026 01:13
Use AUTO_KFREE() for the folios array, avoiding two kfree() calls, one of
them in a very specific error path.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When freeing the entries from the list there is no need to initialize
the list member in an entry, since we are immediately freeing it. So use
simple list_del() instead of list_del_init().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to call list_del_init() against each entry when freeing
the list, as the list is local and we are freeing the entry.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reported a bug that there can be conflicting OEs for the same
range:

  BTRFS critical (device loop4): panic in insert_ordered_extent:264: overlapping ordered extents, existing oe file_offset 16384 num_bytes 430080 flags 0x1089, new oe file_offset 16384 num_bytes 430080 flags 0x80 (errno=-17 Object alrea[  179.162726][ T6897] BTRFS critical (device loop4): panic in insert_ordered_extent:264: overlapping ordered extents, existing oe file_offset 16384 num_bytes 430080 flags 0x1089, new oe file_offset 16384 num_bytes 430080 flags 0x80 (errno=-17 Object already exists)
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ordered-data.c:264!
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
  RIP: 0010:btrfs_alloc_ordered_extent+0x943/0xad0
  Call Trace:
   <TASK>
   cow_file_range+0x744/0x12a0
   fallback_to_cow+0x5ea/0xa00
   run_delalloc_nocow+0x110c/0x17a0
   btrfs_run_delalloc_range+0xbe4/0x1c20
   writepage_delalloc+0x104d/0x1ba0
   btrfs_writepages+0x1667/0x28b0
   do_writepages+0x338/0x560
   filemap_fdatawrite_range+0x1f2/0x300
   btrfs_fdatawrite_range+0x54/0xf0
   btrfs_direct_write+0x6a0/0xc30
   btrfs_do_write_iter+0x329/0x790
   do_iter_readv_writev+0x624/0x8d0
   vfs_writev+0x34c/0x990
   __se_sys_pwritev2+0x17a/0x2a0
   do_syscall_64+0x174/0x580
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   </TASK>
  ---[ end trace 0000000000000000 ]---

[CAUSE]
Since commit ff66fe6 ("btrfs: fix incorrect buffered IO fallback
for append direct writes"), if the direct IO finished short, we will
revert the isize back to the original one, so that append writes can be
respected during the buffered fallback.

Normally we rely on lock_and_cleanup_extent_if_need() function during
buffered writeback to wait for any existing ordered extents.

But that ordered extent waiting only happens if the start_pos is inside
the isize.
Since we have reverted the isize during failed direct IO, we will not
wait for any ordered extents.

This means we can have a race where the direct IO OE is still in the
tree, finished but not yet removed, then we're inserting the OE for the
buffered write, causing the above crash.

[FIX]
Make the OE wait to be unconditional, to handle the reverted isize
situation.

And since lock_and_cleanup_extent_if_need() now either lock the
extents or return -EAGAIN, also remove the branches that handles
no-extent-locked cases, and rename it to remove the "_if_need" suffix.

The following micro benchmark shows the runtime difference for
btrfs_buffered_write(), doing `xfs_io -f -c "pwrite 0 1m"` workload,
all values are the average runtime in nano seconds.

      function runtime              |   before    |     after
 -----------------------------------+-------------+---------------
 lock_and_cleanup_extent_if_need()  |     58.2    |    183.0
 btrfs_buffered_write()             |   2115.6    |   2973.3

The overall runtime of btrfs_buffered_write() is still pretty
tiny (still less than 3 micro seconds), I'd say the extra cost is still
acceptable.

An alternative to fix this problem is to wait ordered extents during
iomap_end() where the isize revert is done.

But that solution will break nowait requirement, as if a nowait direct
IO finished short, we have to wait for the OEs unconditionally or the
next append buffered IO can still hit the same problem.

So here we have to move the wait cost to buffered write, but at least
the code is slightly more streamline.

Reported-by: syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=ba2afde329fc27e3f22e
Fixes: ff66fe6 ("btrfs: fix incorrect buffered IO fallback for append direct writes")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…node()

In btrfs_backref_free_node() we have the following assertion:

  ASSERT(node->eb == NULL, "node->eb->start=%llu", node->eb->start);

and a user reported the following crash:

  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
  CPU: 0 UID: 0 PID: 10422 Comm: syz.0.17 Not tainted 7.1.0-02765-g6b5a2b7d9bc1-dirty #44 PREEMPT(full)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  RIP: 0010:btrfs_backref_free_node fs/btrfs/backref.c:3057 [inline]
  RIP: 0010:btrfs_backref_free_node+0xb9/0x200 fs/btrfs/backref.c:3051
  Code: 00 fc ff (...)
  RSP: 0018:ffa0000006b0f3c0 EFLAGS: 00010246
  RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff840eb78b
  RDX: 0000000000000000 RSI: ffffffff840eafa5 RDI: ff110000742ab768
  RBP: ff110000742ab700 R08: 0000000000000000 R09: 0000000000000000
  R10: ff110000742ab700 R11: 00000000000a81f9 R12: ff11000107a92020
  R13: ff1100005c182ea8 R14: 0000000000000000 R15: dffffc0000000000
  FS:  0000555575536500(0000) GS:ff11000183985000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fa3d0e9d580 CR3: 000000002232a000 CR4: 0000000000753ef0
  PKRU: 00000000
  Call Trace:
   <TASK>
   btrfs_backref_cleanup_node+0x27/0x30 fs/btrfs/backref.c:3133
   relocate_tree_block fs/btrfs/relocation.c:2604 [inline]
   relocate_tree_blocks+0x11b0/0x1a20 fs/btrfs/relocation.c:2707
   relocate_block_group+0x499/0xf30 fs/btrfs/relocation.c:3635
   do_nonremap_reloc fs/btrfs/relocation.c:5323 [inline]
   btrfs_relocate_block_group+0x1749/0x5fb0 fs/btrfs/relocation.c:5490
   btrfs_relocate_chunk+0x12b/0x950 fs/btrfs/volumes.c:3647
   __btrfs_balance fs/btrfs/volumes.c:4586 [inline]
   btrfs_balance+0x1c7f/0x55c0 fs/btrfs/volumes.c:4973
   btrfs_ioctl_balance fs/btrfs/ioctl.c:3474 [inline]
   btrfs_ioctl+0x38a4/0x5d20 fs/btrfs/ioctl.c:5570
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:597 [inline]
   __se_sys_ioctl fs/ioctl.c:583 [inline]
   __x64_sys_ioctl+0x18f/0x210 fs/ioctl.c:583
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0x11f/0x860 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7fb38e3b56dd
   Code: 02 b8 ff (...)
   RSP: 002b:00007fff04115788 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 00007fb38f6b0020 RCX: 00007fb38e3b56dd
   RDX: 00002000000003c0 RSI: 00000000c4009420 RDI: 0000000000000004
   RBP: 00007fb38e451b48 R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 0000000000000000 R14: 00007fb38f6b0020 R15: 00007fb38f6b002c
   </TASK>

It seems that this happens on some systems for some reason, when the
ASSERT() macro calls the inline function verify_assert_printk_format()
to evaluate the format string and arguments, causing the NULL pointer
dereference on node->eb.

So change the assertion to check for a NULL node->eb before dereferencing
it. Also, while at it, make the assertion more useful by printing the
owner of the extent buffer as well as its level.

Reported-by: Yue Sun <samsun1006219@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/20260626065542.38413-1-samsun1006219@gmail.com/
Fixes: c4e7778 ("btrfs: use verbose assertions in backref.c")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
adam900710 and others added 4 commits July 2, 2026 03:47
Previously btrfs forces direct writes to fall back to buffered ones if the
inode has data checksum or the profile has duplication.

That fallback is to avoid the content being modified that the final
content may mismatch with the checksum or the other mirrors.

That brings a pretty huge performance cost, which already caused some
concern at that time.

But later upstream commit c9d1148 ("iomap: add a flag to bounce
buffer direct I/O") introduced a new method by copying the content into
new pages, and do all the operations based on the newly allocated pages.

So let btrfs to utilize the new flag for direct writes if we require
stable folios.

There is a quick benchmark, using the following fio setup:

 fio --name=randwrite --filename $mnt/foobar --ioengine=libaio --size=4G \
     --rw=randwrite --iodepth=64 --runtime=60 --time_based --direct=1 \
     --bs=$blocksize

Unit is MiB/s.

 Blocksize | Zero-copy (*) | Buffered |   Bounce
-----------+---------------+----------+-----------
        4K |          35.1 |     17.1 |      33.8
       64K |           522 |      251 |       492

*: This is done by reverting the commit 968f19c ("btrfs: always
   fallback to buffered write if the inode requires checksum")

Although with page bouncing the performance is only around 95% of
true-zero copy, it's still almost double the performance of buffered
fallback.

There will be a small change in behavior, since we're using
IOMAP_DIO_BOUNCE flag to allocate new folios, NOWAIT flag will
immediately fail.

So for true NOWAIT direct IOs, NODATASUM and RAID0/SINGLE profiles are
still required.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The percpu_counter dirty_metadata_bytes is updated by negating eb->len
and passing it to percpu_counter_add_batch(), whose amount parameter is
s64.  Since commit 84cda1a ("btrfs: cache folio size and shift in
extent_buffer"), eb->len is u32.  The u32 result of -eb->len, when
widened to the s64 parameter, becomes a large positive value instead of
the intended negative value.  For eb->len == 16384 the counter adds
+4294950912 instead of subtracting 16384.

The counter therefore grows on every metadata writeback instead of
shrinking by the extent buffer size, permanently exceeding
BTRFS_DIRTY_METADATA_THRESH and causing __btrfs_btree_balance_dirty()
to trigger balance_dirty_pages_ratelimited() unconditionally, adding
unnecessary writeback pressure.

Cast eb->len to s64 before negation at both call sites so the
subtraction is performed in signed 64-bit arithmetic.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Fixes: 84cda1a ("btrfs: cache folio size and shift in extent_buffer")
Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When btrfs_drop_extent_map_range() splits an extent map, the new split
maps inherit the original map's flags through a local 'flags' variable.
Commit f86f7a7 ("btrfs: use the flags of an extent map to identify
the compression type") changed the EXTENT_FLAG_LOGGING clearing to
operate on em->flags instead of that local 'flags' copy, so a split of
an extent map that is currently being logged wrongly inherits
EXTENT_FLAG_LOGGING.

The flag is then never cleared on the split, and when it is freed while
still on the inode's modified_extents list (for example by the extent
map shrinker) it trips the WARN_ON(!list_empty(&em->list)) in
btrfs_free_extent_map() and leads to a use-after-free.

Clear EXTENT_FLAG_LOGGING from the local 'flags' copy used for the
splits and only clear EXTENT_FLAG_PINNED from em->flags, restoring the
behaviour prior to f86f7a7.

Fixes: f86f7a7 ("btrfs: use the flags of an extent map to identify the compression type")
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Boris Burkov <boris@bur.io>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'tree_id' parameter in btrfs_search_path_in_tree() was only being
used in order to fetch the root tree to be considered for the
search. For this same reason this function was also requiring a 'struct
btrfs_fs_info' parameter. This commit replaces these two parameters with
a single 'struct btrfs_root' one, which identifies from which root tree
the search should happen.

This function only has one caller, the inode lookup ioctl, which knows
how to provide the root tree for each case. In fact, if args->treeid ==
0, then we don't even have to allocate a new root tree object, and we
can reuse the one provided by the ioctl system call, thus avoiding an
extra allocation.

Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet