Test for-next ARM64 64K (regular, SELF)#1630
Open
kdave wants to merge 10000 commits into
Open
Conversation
ad252c6 to
af81080
Compare
30c6cb0 to
73d4bbd
Compare
26f5cfa to
2189fe7
Compare
5280eae to
52d1b61
Compare
40c2283 to
09752d4
Compare
29451dd to
dc188da
Compare
4a55cf6 to
436ac81
Compare
e32c6db to
49a0b34
Compare
4137f02 to
f2ac86e
Compare
db2485b to
0c78978
Compare
Since commit bac3c29 ("btrfs: remove 2K block size support") there is no 2K block size support inside btrfs anymore. Remove the stale comments of btrfs_supported_blocksize(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Since v5.15 btrfs has support for block size < page size, but we still only support 4K block size, while there is no special reason that we cannot support 8K/16K/32K block sizes for 64K page size. That 4K limit is completely arbitrary, and mostly to reduce test runtime so we do not need to test all the extra block size combinations. However that also limits the user choices, some users may understand what they are doing, and want larger block sizes. In that case, fixed 4K block size for subpage routine is blocking our way. Just remove that fixed 4K requirement for block size < page size. This should not affect regular end users, since mkfs is already using 4K block size as default for quite a while, and the existing bs == ps support is always there. But for power users, this allows extra block size support, and may provide extra test coverage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Decentralize transaction aborts in create_reloc_root(), so that it is obvious which call failed and what caused the transaction abort. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
When dumping a tree block, btrfs_header::owner is printed as unsigned, which can result in numbers that are hard to read, e.g.: BTRFS info (device loop0): leaf 8908800 gen 16 total ptrs 28 free space 1676 owner 18446744073709551607 For the above output, 18446744073709551607 is (s64)-9, the root id of data reloc tree. Despite those predefined root ids that are already negative, existing subvolume trees will not have any negative values, as subvolume trees can only utilize the lower 48 bits, so there will be no output change for existing subvolumes, thus no extra confusion. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…erge On a zoned FS, btrfs_delayed_refs_rsv_refill() returns -EAGAIN whenever the over-committed metadata plus the zone_unusable bytes exceeds the usable size in a metadata block-group to avoid heavy over-commit of metadata and early ENOSPC in one transaction. If this happens while doing reclaim, the transaction is getting aborted. Treat -EAGAIN as a soft, retryable condition in case of block-group reclaim. Reported-by: Damien Le Moal <dlemoal@kernel.org> Fixes: 7bcb04d ("btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
The comment is wrong, because it's not about storing the ID of new directories that were already created, instead it's about storing utimes values for directories (both new and existing). The comment is wrong because it was copy pasted from SEND_MAX_DIR_CREATED_CACHE_SIZE, but forgot to update it afterwards. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…mon prefixes In case the current inode's path is a prefix of the given path, the helper is_current_inode_path() will return true, which causes the single caller to reset the current inode's path. While this is not a functional issue, it makes the caller recompute the current inode's path later. It could also become a problem in the future in case get new callers for is_current_inode_path() in more sensitive contexts. Example: the current inode path is "/foo/bar" and the path we compare against is "/foo/bar_xyz". Fix this by returning true only if we have exact matches. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[BUG] There is a syzbot report that the check inside get_new_location() triggered: BTRFS info (device loop0): found 31 extents, stage: move data extents BTRFS info (device loop0): leaf 8908800 gen 16 total ptrs 28 free space 1676 owner 18446744073709551607 item 0 key (256 INODE_ITEM 0) itemoff 3835 itemsize 160 inode generation 5 transid 0 size 0 nbytes 0 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 sequence 0 flags 0x0 atime 1669132761.0 ctime 1669132761.0 mtime 1669132761.0 otime 0.0 item 1 key (256 INODE_REF 256) itemoff 3823 itemsize 12 index 0 name_len 2 item 2 key (258 INODE_ITEM 0) itemoff 3663 itemsize 160 inode generation 1 transid 16 size 733184 nbytes 106496 block group 0 mode 100600 links 0 uid 0 gid 0 rdev 0 sequence 24 flags 0x18 item 3 key (258 EXTENT_DATA 0) itemoff 3595 itemsize 68 generation 16 type 0 inline extent data size 47 ram_bytes 4096 compression 1 [...] item 27 key (18446744073709551611 ORPHAN_ITEM 258) itemoff 2376 itemsize 0 BTRFS error (device loop0): unexpected non-zero offset in file extent item for data reloc inode 258 key offset 0 offset 9277520992061368337 ------------[ cut here ]------------ btrfs_abort_should_print_stack(__error) [CAUSE] The above dump tree shows the first file extent item is inlined, which should make no sense for data reloc inodes, as such inodes just represent where the data extents are in the relocation destination chunk. However the relocation path preallocates space for each block, then dirties them, cluster by cluster. It's possible to have a single block at the beginning of the block group, and no other block in the same cluster. So relocation will preallocate a file extent for that block and dirty the first block. Then memory pressure forces the data reloc inode to be written back, before any other blocks are dirtied/allocated. Finally commit 3eaf5f0 ("btrfs: extract inlined creation into a dedicated delalloc helper") changed the sequence of delalloc. Before that commit we always tried NOCOW first, so that dirtied block would be written back into the preallocated space, and appear as a regular extent. But with that commit, we always try inline first, and since compression is forced, we try compressing the first block, and then inline the compressed data, resulting in the above inlined file extent in the data reloc tree. Then the check in get_new_location() will check the file offset, without checking if the file extent is inlined or not, resulting in the above failure. [FIX] Do not allow compression for data reloc inodes. Since data reloc inode sizes are always block aligned, as long as we do not compress, @data_len will always be at least one block, and that will cause can_cow_file_range_inline() to return false, thus no inlined extent will be created. Reported-by: syzbot+d950c6ba09b79f6e1864@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6a373dc5.764cf64f.168fbe.0001.GAE@google.com/ Fixes: 3eaf5f0 ("btrfs: extract inlined creation into a dedicated delalloc helper") CC: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Commit a6908f8 ("btrfs: validate data reloc tree file extent item members") introduced extra checks on file extent items for data reloc inodes, but it checked the file extent offset without checking if the file extent is inlined. This can lead to either false alerts (as the offset member is inside the inlined data) or even reading beyond the item range. This has already triggered a warning in a syzbot report. Although the root fix is to avoid compression for data reloc inodes, for the sake of consistency, reject inlined file extents first. Fixes: a6908f8 ("btrfs: validate data reloc tree file extent item members") CC: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The nodesize and sectorsize are all u32 values, there is no need to use u64 for local usage. Furthermore some call sites also use "blocksize" or "bs" for sectorsize, also change them to use the minimal type u32 instead. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs does not support variable stripe length yet, all RAID0/5/6/10 chunks have the fixed stripe length 64K for now. Furthermore, btrfs_fs_info::stripesize is not the real chunk stripe length, it's always the same value as sectorsize. Remove btrfs_fs_info::stripesize, and for the only callsite utilizing that member, replace it with fs_info->sectorsize instead. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…etattr()
btrfs_getattr() unconditionally reads BTRFS_I(inode)->new_delalloc_bytes
and adds it (sector-aligned) to stat->blocks for every inode type.
However, new_delalloc_bytes lives in a union with last_dir_index_offset:
union {
u64 new_delalloc_bytes; /* files only */
u64 last_dir_index_offset; /* directories only */
};
For a directory inode this memory holds last_dir_index_offset, which is
set during directory logging (e.g. flush_dir_items_batch()) to the
offset of the last logged BTRFS_DIR_INDEX_KEY. That offset grows with
the number of entries ever created in the directory (dir indexes are
monotonic and never reused), so it can be arbitrarily large.
As a result, after a directory has been logged (e.g. via an fsync that
triggers directory logging), btrfs_getattr() reports inflated st_blocks
for that directory. The inflation is purely in-core and disappears
after the inode is evicted and reloaded (btrfs_alloc_inode() zeroes the
union), e.g. after a remount.
Reproducer (on a btrfs filesystem):
D=/mnt/btrfs/d
mkdir -p $D
for i in $(seq 1 20000); do touch $D/f$i; done
sync # commit, push dir index high
touch $D/trigger # dirty the dir in a new transaction
xfs_io -c fsync $D # log the directory -> sets last_dir_index_offset
stat -c '%b' $D # st_blocks is now inflated (e.g. 40)
# umount + mount -> st_blocks drops back to the correct value
The evict path already knows this union is type-dependent and guards the
corresponding WARN_ON with !S_ISDIR() in btrfs_destroy_inode(); only
btrfs_getattr() was missing the equivalent check.
Only read new_delalloc_bytes for regular files, which are the only
inodes that ever set it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…tion While running fsstress with autodefrag and flushoncommit, hit a deadlock due to the fact that defrag reserves delalloc space while it's holding dirty and locked folios, besides the extent range lock. The stack traces are the following: [958.624] task:kworker/u50:3 state:D stack:0 pid:20365 tgid:20365 ppid:2 task_flags:0x4208060 flags:0x00080000 [958.626] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [958.627] Call Trace: [958.628] <TASK> [958.628] __schedule+0x4be/0x10f0 [958.629] ? preempt_count_add+0x69/0xa0 [958.630] schedule+0x26/0xd0 [958.631] wait_current_trans+0x102/0x160 [btrfs] [958.632] ? __pfx_autoremove_wake_function+0x10/0x10 [958.633] start_transaction+0x374/0x900 [btrfs] [958.634] btrfs_commit_current_transaction+0x1d/0x70 [btrfs] [958.635] flush_space+0xca/0x5e0 [btrfs] [958.636] ? _raw_spin_unlock+0x15/0x30 [958.637] ? btrfs_reduce_alloc_profile+0x8c/0x190 [btrfs] [958.639] ? _raw_spin_unlock+0x15/0x30 [958.640] ? calc_available_free_space.isra.0+0x6f/0x110 [btrfs] [958.641] do_async_reclaim_metadata_space+0x84/0x190 [btrfs] [958.642] btrfs_async_reclaim_metadata_space+0x64/0x80 [btrfs] [958.644] process_one_work+0x19d/0x3a0 [958.644] worker_thread+0x1c4/0x330 [958.645] ? __pfx_worker_thread+0x10/0x10 [958.646] kthread+0xfc/0x130 [958.647] ? __pfx_kthread+0x10/0x10 [958.648] ret_from_fork+0x1f7/0x2c0 [958.648] ? __pfx_kthread+0x10/0x10 [958.649] ret_from_fork_asm+0x1a/0x30 [958.650] </TASK> [958.651] task:kworker/u49:7 state:D stack:0 pid:52990 tgid:52990 ppid:2 task_flags:0x4208060 flags:0x00080000 [958.653] Workqueue: writeback wb_workfn (flush-btrfs-334) [958.655] Call Trace: [958.655] <TASK> [958.656] __schedule+0x4be/0x10f0 [958.657] ? __blk_flush_plug+0xe9/0x140 [958.658] schedule+0x26/0xd0 [958.658] io_schedule+0x42/0x70 [958.659] folio_wait_bit_common+0x12b/0x330 [958.660] ? folio_wait_bit_common+0x100/0x330 [958.662] ? __pfx_wake_page_function+0x10/0x10 [958.663] extent_write_cache_pages+0x599/0x830 [btrfs] [958.664] ? acpi_fwnode_get_reference_args+0x1fa/0x270 [958.665] btrfs_writepages+0x77/0x130 [btrfs] [958.666] ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs] [958.667] do_writepages+0xc6/0x160 [958.668] __writeback_single_inode+0x42/0x310 [958.669] writeback_sb_inodes+0x231/0x570 [958.670] wb_writeback+0x8a/0x340 [958.671] wb_workfn+0xbf/0x450 [958.672] ? finish_task_switch.isra.0+0xc1/0x350 [958.673] process_one_work+0x19d/0x3a0 [958.673] worker_thread+0x1c4/0x330 [958.674] ? __pfx_worker_thread+0x10/0x10 [958.675] kthread+0xfc/0x130 [958.676] ? __pfx_kthread+0x10/0x10 [958.676] ret_from_fork+0x1f7/0x2c0 [958.677] ? __pfx_kthread+0x10/0x10 [958.678] ret_from_fork_asm+0x1a/0x30 [958.679] </TASK> [958.679] task:btrfs-cleaner state:D stack:0 pid:296750 tgid:296750 ppid:2 task_flags:0x208040 flags:0x00080000 [958.681] Call Trace: [958.682] <TASK> [958.682] __schedule+0x4be/0x10f0 [958.683] schedule+0x26/0xd0 [958.684] handle_reserve_ticket+0x1b9/0x2c0 [btrfs] [958.685] ? __pfx_autoremove_wake_function+0x10/0x10 [958.686] reserve_bytes+0x283/0x4c0 [btrfs] [958.687] btrfs_reserve_metadata_bytes+0x18/0xb0 [btrfs] [958.688] btrfs_delalloc_reserve_metadata+0x121/0x320 [btrfs] [958.690] btrfs_delalloc_reserve_space+0x46/0xb0 [btrfs] [958.691] btrfs_defrag_file+0x903/0x1110 [btrfs] [958.692] btrfs_run_defrag_inodes+0x334/0x430 [btrfs] [958.694] cleaner_kthread+0x97/0x1c0 [btrfs] [958.694] ? __pfx_cleaner_kthread+0x10/0x10 [btrfs] [958.696] kthread+0xfc/0x130 [958.696] ? __pfx_kthread+0x10/0x10 [958.697] ret_from_fork+0x1f7/0x2c0 [958.698] ? __pfx_kthread+0x10/0x10 [958.699] ret_from_fork_asm+0x1a/0x30 [958.700] </TASK> [958.716] task:fsstress state:D stack:0 pid:296769 tgid:296769 ppid:296768 task_flags:0x400140 flags:0x00080000 [958.718] Call Trace: [958.719] <TASK> [958.719] __schedule+0x4be/0x10f0 [958.720] ? preempt_count_add+0x69/0xa0 [958.721] schedule+0x26/0xd0 [958.722] wb_wait_for_completion+0x79/0xc0 [958.723] ? __pfx_autoremove_wake_function+0x10/0x10 [958.724] __writeback_inodes_sb_nr+0xc5/0xf0 [958.725] try_to_writeback_inodes_sb+0x55/0x70 [958.726] btrfs_commit_transaction+0x19d/0xeb0 [btrfs] [958.727] ? start_transaction+0x343/0x900 [btrfs] [958.728] btrfs_mksubvol+0x28b/0x4e0 [btrfs] [958.729] btrfs_mksnapshot+0x74/0xa0 [btrfs] [958.730] __btrfs_ioctl_snap_create+0x194/0x210 [btrfs] [958.732] btrfs_ioctl_snap_create_v2+0xef/0x150 [btrfs] [958.733] btrfs_ioctl+0x7ec/0x2a70 [btrfs] [958.734] ? __virt_addr_valid+0xe4/0x180 [958.735] ? __check_object_size+0x1cd/0x1f0 [958.736] ? kmem_cache_free+0x146/0x380 [958.737] ? _raw_spin_unlock+0x15/0x30 [958.738] ? do_sys_openat2+0x83/0xd0 [958.739] __x64_sys_ioctl+0x92/0xe0 [958.740] do_syscall_64+0x60/0x590 [958.741] ? clear_bhb_loop+0x60/0xb0 [958.742] entry_SYSCALL_64_after_hwframe+0x76/0x7e [958.743] RIP: 0033:0x7f4431e108db [958.744] RSP: 002b:00007ffcd147db20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [958.746] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f4431e108db [958.747] RDX: 00007ffcd147eb90 RSI: 0000000050009417 RDI: 0000000000000005 [958.749] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [958.751] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffcd147fbf0 [958.752] R13: 00007ffcd147eb90 R14: 0000000000000005 R15: 0000000000000003 [958.754] </TASK> What happens is the following: 1) The cleaner kthread is running autodefrag, and in defrag_one_range() it acquired all the folios for the range and locked them. Then it locked the extent range in the inode's iotree. It got two subranges from defrag_collect_targets(), the first one with folio A and the second one with folio B. After it defragged the first subrange, folio A remains locked and dirty - it's only unlocked when defrag_one_range() returns. When it attempts to defrag the second subrange (containing folio B), btrfs_delalloc_reserve_space() creates a space reservation ticket, due to lack of free metadata space and blocks waiting for the async metadata reclaim task to free space and wake it up; 2) The async reclaim metadata task attempts to commit the current transaction, but it blocks because there is another task that started the commit first; 3) A task creating a snapshot is committing the transaction and because the fs was mounted with flushoncommit, it calls try_to_writeback_inodes_sb(), which spawns a task to flush delalloc and waits for it to complete; 4) The task flushing delalloc (kworker/u49:7), finds that folio A for the inode being defragged is dirty, so it tries to lock it... But it blocks because folio A is locked by the defrag task (the cleaner kthread) which is blocked waiting for the reservation ticket to be served, but the async reclaim metadata task is blocked waiting for the transaction commit, which in turn is blocked waiting for the delalloc flush task, which is trying to lock folio A, resulting in a deadlock. The same type of problem can happen if the async reclaim task starts to flush delalloc, as that requires both locking the folio and the extent range in the inode's io tree, and in this case we don't need the fs to be mounted with flushoncommit. This type of problem has ocurred several times in the past with reflinks for example, where we had a dirty folio while holding the extent range locked and then starting a transaction blocked waiting for the async reclaim task due to lack of free metadata space. So fix this by reserving delalloc space before locking folios and locking the extent range in the inode's iotree. We can not simply unlock the folios for each subrange given by defrag_collect_targets() after we defrag it because the same folio may be present too in the next subrange (due to large folios). Fixes: 22b398e ("btrfs: defrag: introduce helper to defrag a contiguous prepared range") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Syzbot reported the following warning recently:
[157.672][ T6611] BTRFS info (device loop0): turning on flush-on-commit
[157.672][ T6611] BTRFS info (device loop0): enabling free space tree
[157.672][ T6611] BTRFS info (device loop0): enabling auto defrag
[157.672][ T6611] BTRFS info (device loop0): use lzo compression, level 1
[157.672][ T6611] BTRFS info (device loop0): max_inline set to 4096
[158.094][ T5608] BTRFS info (device loop2): last unmount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
[160.073][ T6656] BTRFS info (device loop0 state M): max_inline set to 4096
[160.418][ T5611] BTRFS info (device loop0): last unmount of filesystem ab8108e1-bea5-4a9f-94c9-a3ff208d732a
[160.432][ T6662] loop2: detected capacity change from 0 to 32768
[160.438][ T6662] BTRFS: device fsid c9fe44da-de57-406a-8241-57ec7d4412cf devid 1 transid 8 /dev/loop2 (7:2) scanned by syz.2.74 (6662)
[160.459][ T6662] BTRFS info (device loop2): first mount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
[160.459][ T6662] BTRFS info (device loop2): using crc32c checksum algorithm
[160.634][ T1187] ------------[ cut here ]------------
[160.634][ T1187] test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state)
[160.634][ T1187] WARNING: fs/btrfs/inode.c:3596 at btrfs_add_delayed_iput+0x2e3/0x340, CPU#0: kworker/u8:10/1187
[160.634][ T1187] Modules linked in:
[160.634][ T1187] CPU: 0 UID: 0 PID: 1187 Comm: kworker/u8:10 Not tainted syzkaller #0 PREEMPT_{RT,(full)}
[160.634][ T1187] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
[160.634][ T1187] Workqueue: btrfs-endio-write btrfs_work_helper
[160.634][ T1187] RIP: 0010:btrfs_add_delayed_iput+0x2e3/0x340
[160.634][ T1187] Code: 53 a3 45 (...)
[160.634][ T1187] RSP: 0018:ffffc900065d77c8 EFLAGS: 00010293
[160.634][ T1187] RAX: ffffffff83e5f502 RBX: ffff88805aba0000 RCX: ffff888029768000
[160.634][ T1187] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[160.634][ T1187] RBP: dffffc0000000000 R08: 0000000000000000 R09: 0000000000000000
[160.634][ T1187] R10: dffffc0000000000 R11: ffffed100b574497 R12: 0000000000000001
[160.634][ T1187] R13: dffffc0000000000 R14: ffff888061194788 R15: 0000000000000200
[160.634][ T1187] FS: 0000000000000000(0000) GS:ffff888126186000(0000) knlGS:0000000000000000
[160.634][ T1187] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[160.634][ T1187] CR2: 00007fe553a3f000 CR3: 00000000596c2000 CR4: 00000000003526f0
[160.634][ T1187] Call Trace:
[160.634][ T1187] <TASK>
[160.634][ T1187] btrfs_put_ordered_extent+0x18f/0x430
[160.634][ T1187] btrfs_finish_one_ordered+0xf63/0x2680
[160.634][ T1187] ? __pfx_btrfs_finish_one_ordered+0x10/0x10
[160.634][ T1187] ? do_raw_spin_lock+0x12b/0x2f0
[160.634][ T1187] ? lock_acquire+0x106/0x350
[160.634][ T1187] ? __pfx_do_raw_spin_lock+0x10/0x10
[160.634][ T1187] btrfs_work_helper+0x38b/0xc20
[160.634][ T1187] ? process_scheduled_works+0xa70/0x1860
[160.634][ T1187] process_scheduled_works+0xb5d/0x1860
[160.634][ T1187] ? __pfx_process_scheduled_works+0x10/0x10
[160.634][ T1187] ? assign_work+0x3d5/0x5e0
[160.634][ T1187] worker_thread+0xa53/0xfc0
[160.634][ T1187] kthread+0x388/0x470
[160.634][ T1187] ? __pfx_worker_thread+0x10/0x10
[160.635][ T1187] ? __pfx_kthread+0x10/0x10
[160.635][ T1187] ret_from_fork+0x514/0xb70
[160.635][ T1187] ? __pfx_ret_from_fork+0x10/0x10
[160.635][ T1187] ? __switch_to+0xc79/0x1410
[160.635][ T1187] ? __pfx_kthread+0x10/0x10
[160.635][ T1187] ret_from_fork_asm+0x1a/0x30
[160.635][ T1187] </TASK>
[160.635][ T1187] Kernel panic - not syncing: kernel: panic_on_warn set ...
It means we add a delayed iput created after we last ran delayed iputs in
close_ctree() and set the flag BTRFS_FS_STATE_NO_DELAYED_IPUT in fs_info.
This happens when using autodefrag and more likely to happen if we use
flushoncommit too. The steps are the following:
1) Unmount starts, all delalloc is flushed and we enter close_ctree();
2) In close_ctree() we park the cleaner kthread, but while we wait for it
to park, it's in:
btrfs_run_defrag_inodes()
btrfs_run_defrag_inode()
btrfs_defrag_file()
defrag_one_cluster()
defrag_one_range()
defrag_one_locked_target()
And dirties some folios from an inode;
3) The cleaner kthread parks and we proceed in close_ctree(), waiting
for all ordered extents, running delayed iputs and setting the flag
BTRFS_FS_STATE_NO_DELAYED_IPUT in fs_info;
4) Later in close_ctree() we call btrfs_commit_super(), which commits the
current transaction. Because we are mounted with flushoncommit, the
transaction commit flushes delalloc and waits for the resulting ordered
extent to complete;
5) The ordered extents from the flushed delalloc created by autodefrag
complete and create delayed iputs, triggering the warning:
WARN_ON_ONCE(test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state));
in btrfs_add_delayed_iput()
6) Further below in close_ctree() we will hit the following assertion:
ASSERT(list_empty(&fs_info->delayed_iputs));
Since we don't expect any more delayed iputs.
Fix this by flushing delalloc and waiting for the ordered extents right
after we parked the cleaner kthread and waiting for autodefrag in
close_ctree().
Reported-by: syzbot+6a843bf8604711c8fab0@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a1ee507.b4221f80.1326c5.0004.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to have one list for each loop to defrag each subrange and then another one to free each subrange (struct defrag_target_range). We can do it in a single loop, freeing each subrange after defragging, plus no need to delete each subrange from the list since we immediately free it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Use AUTO_KFREE() for the folios array, avoiding two kfree() calls, one of them in a very specific error path. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
When freeing the entries from the list there is no need to initialize the list member in an entry, since we are immediately freeing it. So use simple list_del() instead of list_del_init(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to call list_del_init() against each entry when freeing the list, as the list is local and we are freeing the entry. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[BUG] Syzbot reported a bug that there can be conflicting OEs for the same range: BTRFS critical (device loop4): panic in insert_ordered_extent:264: overlapping ordered extents, existing oe file_offset 16384 num_bytes 430080 flags 0x1089, new oe file_offset 16384 num_bytes 430080 flags 0x80 (errno=-17 Object alrea[ 179.162726][ T6897] BTRFS critical (device loop4): panic in insert_ordered_extent:264: overlapping ordered extents, existing oe file_offset 16384 num_bytes 430080 flags 0x1089, new oe file_offset 16384 num_bytes 430080 flags 0x80 (errno=-17 Object already exists) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ordered-data.c:264! Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026 RIP: 0010:btrfs_alloc_ordered_extent+0x943/0xad0 Call Trace: <TASK> cow_file_range+0x744/0x12a0 fallback_to_cow+0x5ea/0xa00 run_delalloc_nocow+0x110c/0x17a0 btrfs_run_delalloc_range+0xbe4/0x1c20 writepage_delalloc+0x104d/0x1ba0 btrfs_writepages+0x1667/0x28b0 do_writepages+0x338/0x560 filemap_fdatawrite_range+0x1f2/0x300 btrfs_fdatawrite_range+0x54/0xf0 btrfs_direct_write+0x6a0/0xc30 btrfs_do_write_iter+0x329/0x790 do_iter_readv_writev+0x624/0x8d0 vfs_writev+0x34c/0x990 __se_sys_pwritev2+0x17a/0x2a0 do_syscall_64+0x174/0x580 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> ---[ end trace 0000000000000000 ]--- [CAUSE] Since commit ff66fe6 ("btrfs: fix incorrect buffered IO fallback for append direct writes"), if the direct IO finished short, we will revert the isize back to the original one, so that append writes can be respected during the buffered fallback. Normally we rely on lock_and_cleanup_extent_if_need() function during buffered writeback to wait for any existing ordered extents. But that ordered extent waiting only happens if the start_pos is inside the isize. Since we have reverted the isize during failed direct IO, we will not wait for any ordered extents. This means we can have a race where the direct IO OE is still in the tree, finished but not yet removed, then we're inserting the OE for the buffered write, causing the above crash. [FIX] Make the OE wait to be unconditional, to handle the reverted isize situation. And since lock_and_cleanup_extent_if_need() now either lock the extents or return -EAGAIN, also remove the branches that handles no-extent-locked cases, and rename it to remove the "_if_need" suffix. The following micro benchmark shows the runtime difference for btrfs_buffered_write(), doing `xfs_io -f -c "pwrite 0 1m"` workload, all values are the average runtime in nano seconds. function runtime | before | after -----------------------------------+-------------+--------------- lock_and_cleanup_extent_if_need() | 58.2 | 183.0 btrfs_buffered_write() | 2115.6 | 2973.3 The overall runtime of btrfs_buffered_write() is still pretty tiny (still less than 3 micro seconds), I'd say the extra cost is still acceptable. An alternative to fix this problem is to wait ordered extents during iomap_end() where the isize revert is done. But that solution will break nowait requirement, as if a nowait direct IO finished short, we have to wait for the OEs unconditionally or the next append buffered IO can still hit the same problem. So here we have to move the wait cost to buffered write, but at least the code is slightly more streamline. Reported-by: syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=ba2afde329fc27e3f22e Fixes: ff66fe6 ("btrfs: fix incorrect buffered IO fallback for append direct writes") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…node() In btrfs_backref_free_node() we have the following assertion: ASSERT(node->eb == NULL, "node->eb->start=%llu", node->eb->start); and a user reported the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 UID: 0 PID: 10422 Comm: syz.0.17 Not tainted 7.1.0-02765-g6b5a2b7d9bc1-dirty #44 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:btrfs_backref_free_node fs/btrfs/backref.c:3057 [inline] RIP: 0010:btrfs_backref_free_node+0xb9/0x200 fs/btrfs/backref.c:3051 Code: 00 fc ff (...) RSP: 0018:ffa0000006b0f3c0 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff840eb78b RDX: 0000000000000000 RSI: ffffffff840eafa5 RDI: ff110000742ab768 RBP: ff110000742ab700 R08: 0000000000000000 R09: 0000000000000000 R10: ff110000742ab700 R11: 00000000000a81f9 R12: ff11000107a92020 R13: ff1100005c182ea8 R14: 0000000000000000 R15: dffffc0000000000 FS: 0000555575536500(0000) GS:ff11000183985000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa3d0e9d580 CR3: 000000002232a000 CR4: 0000000000753ef0 PKRU: 00000000 Call Trace: <TASK> btrfs_backref_cleanup_node+0x27/0x30 fs/btrfs/backref.c:3133 relocate_tree_block fs/btrfs/relocation.c:2604 [inline] relocate_tree_blocks+0x11b0/0x1a20 fs/btrfs/relocation.c:2707 relocate_block_group+0x499/0xf30 fs/btrfs/relocation.c:3635 do_nonremap_reloc fs/btrfs/relocation.c:5323 [inline] btrfs_relocate_block_group+0x1749/0x5fb0 fs/btrfs/relocation.c:5490 btrfs_relocate_chunk+0x12b/0x950 fs/btrfs/volumes.c:3647 __btrfs_balance fs/btrfs/volumes.c:4586 [inline] btrfs_balance+0x1c7f/0x55c0 fs/btrfs/volumes.c:4973 btrfs_ioctl_balance fs/btrfs/ioctl.c:3474 [inline] btrfs_ioctl+0x38a4/0x5d20 fs/btrfs/ioctl.c:5570 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl fs/ioctl.c:583 [inline] __x64_sys_ioctl+0x18f/0x210 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x11f/0x860 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fb38e3b56dd Code: 02 b8 ff (...) RSP: 002b:00007fff04115788 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fb38f6b0020 RCX: 00007fb38e3b56dd RDX: 00002000000003c0 RSI: 00000000c4009420 RDI: 0000000000000004 RBP: 00007fb38e451b48 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007fb38f6b0020 R15: 00007fb38f6b002c </TASK> It seems that this happens on some systems for some reason, when the ASSERT() macro calls the inline function verify_assert_printk_format() to evaluate the format string and arguments, causing the NULL pointer dereference on node->eb. So change the assertion to check for a NULL node->eb before dereferencing it. Also, while at it, make the assertion more useful by printing the owner of the extent buffer as well as its level. Reported-by: Yue Sun <samsun1006219@gmail.com> Link: https://lore.kernel.org/linux-btrfs/20260626065542.38413-1-samsun1006219@gmail.com/ Fixes: c4e7778 ("btrfs: use verbose assertions in backref.c") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Previously btrfs forces direct writes to fall back to buffered ones if the inode has data checksum or the profile has duplication. That fallback is to avoid the content being modified that the final content may mismatch with the checksum or the other mirrors. That brings a pretty huge performance cost, which already caused some concern at that time. But later upstream commit c9d1148 ("iomap: add a flag to bounce buffer direct I/O") introduced a new method by copying the content into new pages, and do all the operations based on the newly allocated pages. So let btrfs to utilize the new flag for direct writes if we require stable folios. There is a quick benchmark, using the following fio setup: fio --name=randwrite --filename $mnt/foobar --ioengine=libaio --size=4G \ --rw=randwrite --iodepth=64 --runtime=60 --time_based --direct=1 \ --bs=$blocksize Unit is MiB/s. Blocksize | Zero-copy (*) | Buffered | Bounce -----------+---------------+----------+----------- 4K | 35.1 | 17.1 | 33.8 64K | 522 | 251 | 492 *: This is done by reverting the commit 968f19c ("btrfs: always fallback to buffered write if the inode requires checksum") Although with page bouncing the performance is only around 95% of true-zero copy, it's still almost double the performance of buffered fallback. There will be a small change in behavior, since we're using IOMAP_DIO_BOUNCE flag to allocate new folios, NOWAIT flag will immediately fail. So for true NOWAIT direct IOs, NODATASUM and RAID0/SINGLE profiles are still required. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The percpu_counter dirty_metadata_bytes is updated by negating eb->len and passing it to percpu_counter_add_batch(), whose amount parameter is s64. Since commit 84cda1a ("btrfs: cache folio size and shift in extent_buffer"), eb->len is u32. The u32 result of -eb->len, when widened to the s64 parameter, becomes a large positive value instead of the intended negative value. For eb->len == 16384 the counter adds +4294950912 instead of subtracting 16384. The counter therefore grows on every metadata writeback instead of shrinking by the extent buffer size, permanently exceeding BTRFS_DIRTY_METADATA_THRESH and causing __btrfs_btree_balance_dirty() to trigger balance_dirty_pages_ratelimited() unconditionally, adding unnecessary writeback pressure. Cast eb->len to s64 before negation at both call sites so the subtraction is performed in signed 64-bit arithmetic. Reviewed-by: Filipe Manana <fdmanana@suse.com> Fixes: 84cda1a ("btrfs: cache folio size and shift in extent_buffer") Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
When btrfs_drop_extent_map_range() splits an extent map, the new split maps inherit the original map's flags through a local 'flags' variable. Commit f86f7a7 ("btrfs: use the flags of an extent map to identify the compression type") changed the EXTENT_FLAG_LOGGING clearing to operate on em->flags instead of that local 'flags' copy, so a split of an extent map that is currently being logged wrongly inherits EXTENT_FLAG_LOGGING. The flag is then never cleared on the split, and when it is freed while still on the inode's modified_extents list (for example by the extent map shrinker) it trips the WARN_ON(!list_empty(&em->list)) in btrfs_free_extent_map() and leads to a use-after-free. Clear EXTENT_FLAG_LOGGING from the local 'flags' copy used for the splits and only clear EXTENT_FLAG_PINNED from em->flags, restoring the behaviour prior to f86f7a7. Fixes: f86f7a7 ("btrfs: use the flags of an extent map to identify the compression type") Cc: Jeff Layton <jlayton@kernel.org> Cc: Boris Burkov <boris@bur.io> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The 'tree_id' parameter in btrfs_search_path_in_tree() was only being used in order to fetch the root tree to be considered for the search. For this same reason this function was also requiring a 'struct btrfs_fs_info' parameter. This commit replaces these two parameters with a single 'struct btrfs_root' one, which identifies from which root tree the search should happen. This function only has one caller, the inode lookup ioctl, which knows how to provide the root tree for each case. In fact, if args->treeid == 0, then we don't even have to allocate a new root tree object, and we can reuse the one provided by the ioctl system call, thus avoiding an extra allocation. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.