Skip to content

vdev_disk: avoid vd_lock lockdep cycle during reopen#18504

Open
Gality369 wants to merge 1 commit into
openzfs:masterfrom
Gality369:fix-vdev-disk-lockdep
Open

vdev_disk: avoid vd_lock lockdep cycle during reopen#18504
Gality369 wants to merge 1 commit into
openzfs:masterfrom
Gality369:fix-vdev-disk-lockdep

Conversation

@Gality369
Copy link
Copy Markdown
Contributor

Motivation and Context

This fixes a Linux lockdep cycle in the vdev disk backend.

vdev_disk_io_start() can acquire vd_lock from the page-fault read
path while filemap_fault() holds mapping.invalidate_lock.
Meanwhile vdev_disk_open() held vd_lock across
bdev_file_open_by_path() / bdev_open_by_path(), which ties
vd_lock into the block-open locking chain and back into
mmap_lock / mapping.invalidate_lock.

That makes this a real lock ordering issue rather than a lockdep-only
false positive.

WARNING: possible circular locking dependency detected
------------------------------------------------------
syz.0.54/547 is trying to acquire lock:
ffff88800fccfb78 (&vd->vd_lock){++++}-{4:4}, at: vdev_disk_io_start+0x137/0x26c0 fs/zfs/os/linux/zfs/vdev_disk.c:1135

but task is already holding lock:
ffff8880183612a8 (mapping.invalidate_lock#3){.+.+}-{4:4}, at: filemap_invalidate_lock_shared include/linux/fs.h:1045 [inline]
ffff8880183612a8 (mapping.invalidate_lock#3){.+.+}-{4:4}, at: filemap_fault+0x49b/0x30c0 mm/filemap.c:3496

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #4 (mapping.invalidate_lock#3){.+.+}-{4:4}:
       down_read+0x9c/0x4a0 kernel/locking/rwsem.c:1537
       filemap_invalidate_lock_shared include/linux/fs.h:1045 [inline]
       filemap_fault+0x49b/0x30c0 mm/filemap.c:3496
       __do_fault+0x10a/0x6e0 mm/memory.c:5281
       do_read_fault mm/memory.c:5716 [inline]
       do_fault+0x9b8/0x13e0 mm/memory.c:5850
       do_pte_missing mm/memory.c:4362 [inline]
       handle_pte_fault mm/memory.c:6195 [inline]
       __handle_mm_fault+0x138c/0x22f0 mm/memory.c:6336
       handle_mm_fault+0x42f/0x820 mm/memory.c:6505
       do_user_addr_fault+0x347/0xc30 arch/x86/mm/fault.c:1387
       handle_page_fault arch/x86/mm/fault.c:1476 [inline]
       exc_page_fault+0x73/0x110 arch/x86/mm/fault.c:1532
       asm_exc_page_fault+0x27/0x30 arch/x86/include/asm/idtentry.h:569
       __get_user_4+0x14/0x20 arch/x86/lib/getuser.S:86
       do_vfs_ioctl+0x68c/0x1320 fs/ioctl.c:560
       __do_sys_ioctl fs/ioctl.c:595 [inline]
       __se_sys_ioctl fs/ioctl.c:583 [inline]
       __x64_sys_ioctl+0x108/0x1e0 fs/ioctl.c:583
       x64_sys_call+0x1144/0x26a0 arch/x86/include/generated/asm/syscalls_64.h:17
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x93/0xf80 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #3 (&mm->mmap_lock){++++}-{4:4}:
       __might_fault+0xeb/0x140 mm/memory.c:7099
       _inline_copy_from_user include/linux/uaccess.h:162 [inline]
       _copy_from_user+0x2a/0xa0 lib/usercopy.c:18
       copy_from_user include/linux/uaccess.h:212 [inline]
       get_sg_io_hdr+0xde/0x940 drivers/scsi/scsi_ioctl.c:714
       scsi_ioctl_sg_io drivers/scsi/scsi_ioctl.c:857 [inline]
       scsi_ioctl+0x75b/0x16b0 drivers/scsi/scsi_ioctl.c:917
       sr_block_ioctl+0x1bd/0x210 drivers/scsi/sr.c:557
       blkdev_ioctl+0x22a/0x700 block/ioctl.c:705
       vfs_ioctl fs/ioctl.c:51 [inline]
       __do_sys_ioctl fs/ioctl.c:597 [inline]
       __se_sys_ioctl fs/ioctl.c:583 [inline]
       __x64_sys_ioctl+0x197/0x1e0 fs/ioctl.c:583
       x64_sys_call+0x1144/0x26a0 arch/x86/include/generated/asm/syscalls_64.h:17
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x93/0xf80 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #2 (&cd->lock){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/mutex.c:598 [inline]
       __mutex_lock+0x1a6/0x1d80 kernel/locking/mutex.c:760
       mutex_lock_nested+0x1b/0x30 kernel/locking/mutex.c:812
       sr_block_open+0xaf/0x170 drivers/scsi/sr.c:511
       blkdev_get_whole+0xa2/0x250 block/bdev.c:730
       bdev_open+0x262/0xcf0 block/bdev.c:957
       blkdev_open+0x330/0x4d0 block/fops.c:701
       do_dentry_open+0x883/0x11e0 fs/open.c:965
       vfs_open+0x83/0x3a0 fs/open.c:1097
       do_open fs/namei.c:3975 [inline]
       path_openat+0x1d32/0x2ce0 fs/namei.c:4134
       do_filp_open+0x1f6/0x430 fs/namei.c:4161
       do_sys_openat2+0x117/0x1c0 fs/open.c:1437
       do_sys_open fs/open.c:1452 [inline]
       __do_sys_openat fs/open.c:1468 [inline]
       __se_sys_openat fs/open.c:1463 [inline]
       __x64_sys_openat+0x15b/0x220 fs/open.c:1463
       x64_sys_call+0x161b/0x26a0 arch/x86/include/generated/asm/syscalls_64.h:258
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x93/0xf80 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #1 (&disk->open_mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/mutex.c:598 [inline]
       __mutex_lock+0x1a6/0x1d80 kernel/locking/mutex.c:760
       mutex_lock_nested+0x1b/0x30 kernel/locking/mutex.c:812
       bdev_open+0x9a/0xcf0 block/bdev.c:945
       bdev_file_open_by_dev block/bdev.c:1059 [inline]
       bdev_file_open_by_dev+0x13a/0x1c0 block/bdev.c:1034
       bdev_file_open_by_path+0x104/0x300 block/bdev.c:1082
       vdev_blkdev_get_by_path fs/zfs/os/linux/zfs/vdev_disk.c:259 [inline]
       vdev_disk_open+0x6e5/0x1750 fs/zfs/os/linux/zfs/vdev_disk.c:393
       vdev_open+0x3d2/0x1a20 fs/zfs/zfs/vdev.c:2182
       vdev_open_child+0x51/0xe0 fs/zfs/zfs/vdev.c:1979
       taskq_thread+0x9e4/0x19c0 fs/zfs/os/linux/spl/spl-taskq.c:1110
       kthread+0x3f0/0x850 kernel/kthread.c:463
       ret_from_fork+0x50f/0x610 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #0 (&vd->vd_lock){++++}-{4:4}:
       check_prev_add kernel/locking/lockdep.c:3165 [inline]
       check_prevs_add kernel/locking/lockdep.c:3284 [inline]
       validate_chain kernel/locking/lockdep.c:3908 [inline]
       __lock_acquire+0x14ae/0x21e0 kernel/locking/lockdep.c:5237
       lock_acquire kernel/locking/lockdep.c:5868 [inline]
       lock_acquire+0x169/0x2f0 kernel/locking/lockdep.c:5825
       down_read+0x9c/0x4a0 kernel/locking/rwsem.c:1537
       vdev_disk_io_start+0x137/0x26c0 fs/zfs/os/linux/zfs/vdev_disk.c:1135
       zio_vdev_io_start+0x4c6/0xc40 fs/zfs/zfs/zio.c:4797
       __zio_execute fs/zfs/zfs/zio.c:2483 [inline]
       zio_nowait+0x297/0x5e0 fs/zfs/zfs/zio.c:2576
       vdev_mirror_io_start+0x27c/0xad0 fs/zfs/zfs/vdev_mirror.c:688
       zio_vdev_io_start+0x879/0xc40 fs/zfs/zfs/zio.c:4663
       __zio_execute fs/zfs/zfs/zio.c:2483 [inline]
       zio_nowait+0x297/0x5e0 fs/zfs/zfs/zio.c:2576
       arc_read+0x29a1/0x5a10 fs/zfs/zfs/arc.c:6433
       dbuf_read_impl.constprop.0+0x8a2/0x1d90 fs/zfs/zfs/dbuf.c:1660
       dbuf_read+0xbf3/0x1940 fs/zfs/zfs/dbuf.c:1854
       dmu_buf_hold_array_by_dnode+0x24b/0x1710 fs/zfs/zfs/dmu.c:581
       dmu_read_impl+0x230/0x570 fs/zfs/zfs/dmu.c:1205
       dmu_read+0xca/0x130 fs/zfs/zfs/dmu.c:1243
       zfs_fillpage+0x1ef/0x610 fs/zfs/os/linux/zfs/zfs_vnops_os.c:4136
       zfs_getpage+0x1fe/0x960 fs/zfs/os/linux/zfs/zfs_vnops_os.c:4214
       zpl_readpage_common fs/zfs/os/linux/zfs/zpl_file.c:407 [inline]
       zpl_read_folio+0xb4/0x140 fs/zfs/os/linux/zfs/zpl_file.c:419
       filemap_read_folio+0xb8/0x250 mm/filemap.c:2444
       filemap_fault+0x17f8/0x30c0 mm/filemap.c:3581
       __do_fault+0x10a/0x6e0 mm/memory.c:5281
       do_read_fault mm/memory.c:5716 [inline]
       do_fault+0x9b8/0x13e0 mm/memory.c:5850
       do_pte_missing mm/memory.c:4362 [inline]
       handle_pte_fault mm/memory.c:6195 [inline]
       __handle_mm_fault+0x138c/0x22f0 mm/memory.c:6336
       handle_mm_fault+0x42f/0x820 mm/memory.c:6505
       do_user_addr_fault+0x347/0xc30 arch/x86/mm/fault.c:1387
       handle_page_fault arch/x86/mm/fault.c:1476 [inline]
       exc_page_fault+0x73/0x110 arch/x86/mm/fault.c:1532
       asm_exc_page_fault+0x27/0x30 arch/x86/include/asm/idtentry.h:569
       __get_user_4+0x14/0x20 arch/x86/lib/getuser.S:86
       do_vfs_ioctl+0x68c/0x1320 fs/ioctl.c:560
       __do_sys_ioctl fs/ioctl.c:595 [inline]
       __se_sys_ioctl fs/ioctl.c:583 [inline]
       __x64_sys_ioctl+0x108/0x1e0 fs/ioctl.c:583
       x64_sys_call+0x1144/0x26a0 arch/x86/include/generated/asm/syscalls_64.h:17
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x93/0xf80 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

other info that might help us debug this:

Chain exists of:
  &vd->vd_lock --> &mm->mmap_lock --> mapping.invalidate_lock#3

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  rlock(mapping.invalidate_lock#3);
                               lock(&mm->mmap_lock);
                               lock(mapping.invalidate_lock#3);
  rlock(&vd->vd_lock);

 *** DEADLOCK ***

1 lock held by syz.0.54/547:
 #0: ffff8880183612a8 (mapping.invalidate_lock#3){.+.+}-{4:4}, at: filemap_invalidate_lock_shared include/linux/fs.h:1045 [inline]
 #0: ffff8880183612a8 (mapping.invalidate_lock#3){.+.+}-{4:4}, at: filemap_fault+0x49b/0x30c0 mm/filemap.c:3496

Call Trace:
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0xbe/0x130 lib/dump_stack.c:120
 dump_stack+0x15/0x20 lib/dump_stack.c:129
 print_circular_bug+0x285/0x360 kernel/locking/lockdep.c:2043
 check_noncircular+0x14e/0x170 kernel/locking/lockdep.c:2175
 check_prev_add kernel/locking/lockdep.c:3165 [inline]
 check_prevs_add kernel/locking/lockdep.c:3284 [inline]
 validate_chain kernel/locking/lockdep.c:3908 [inline]
 __lock_acquire+0x14ae/0x21e0 kernel/locking/lockdep.c:5237
 lock_acquire kernel/locking/lockdep.c:5868 [inline]
 lock_acquire+0x169/0x2f0 kernel/locking/lockdep.c:5825
 down_read+0x9c/0x4a0 kernel/locking/rwsem.c:1537
 vdev_disk_io_start+0x137/0x26c0 fs/zfs/os/linux/zfs/vdev_disk.c:1135
 zio_vdev_io_start+0x4c6/0xc40 fs/zfs/zfs/zio.c:4797
 __zio_execute fs/zfs/zfs/zio.c:2483 [inline]
 zio_nowait+0x297/0x5e0 fs/zfs/zfs/zio.c:2576
 vdev_mirror_io_start+0x27c/0xad0 fs/zfs/zfs/vdev_mirror.c:688
 zio_vdev_io_start+0x879/0xc40 fs/zfs/zfs/zio.c:4663
 __zio_execute fs/zfs/zfs/zio.c:2483 [inline]
 zio_nowait+0x297/0x5e0 fs/zfs/zfs/zio.c:2576
 arc_read+0x29a1/0x5a10 fs/zfs/zfs/arc.c:6433
 dbuf_read_impl.constprop.0+0x8a2/0x1d90 fs/zfs/zfs/dbuf.c:1660
 dbuf_read+0xbf3/0x1940 fs/zfs/zfs/dbuf.c:1854
 dmu_buf_hold_array_by_dnode+0x24b/0x1710 fs/zfs/zfs/dmu.c:581
 dmu_read_impl+0x230/0x570 fs/zfs/zfs/dmu.c:1205
 dmu_read+0xca/0x130 fs/zfs/zfs/dmu.c:1243
 zfs_fillpage+0x1ef/0x610 fs/zfs/os/linux/zfs/zfs_vnops_os.c:4136
 zfs_getpage+0x1fe/0x960 fs/zfs/os/linux/zfs/zfs_vnops_os.c:4214
 zpl_readpage_common fs/zfs/os/linux/zfs/zpl_file.c:407 [inline]
 zpl_read_folio+0xb4/0x140 fs/zfs/os/linux/zfs/zpl_file.c:419
 filemap_read_folio+0xb8/0x250 mm/filemap.c:2444
 filemap_fault+0x17f8/0x30c0 mm/filemap.c:3581
 __do_fault+0x10a/0x6e0 mm/memory.c:5281
 do_read_fault mm/memory.c:5716 [inline]
 do_fault+0x9b8/0x13e0 mm/memory.c:5850
 do_pte_missing mm/memory.c:4362 [inline]
 handle_pte_fault mm/memory.c:6195 [inline]
 __handle_mm_fault+0x138c/0x22f0 mm/memory.c:6336
 handle_mm_fault+0x42f/0x820 mm/memory.c:6505
 do_user_addr_fault+0x347/0xc30 arch/x86/mm/fault.c:1387
 handle_page_fault arch/x86/mm/fault.c:1476 [inline]
 exc_page_fault+0x73/0x110 arch/x86/mm/fault.c:1532
 asm_exc_page_fault+0x27/0x30 arch/x86/include/asm/idtentry.h:569

Description

This change narrows vd_lock so it only protects vd_bdh state
transitions and no longer covers the slow block-device open/release
operations.

A small reopen wait state was added so I/O waits for a transient
reopen window instead of observing a temporary vd_bdh == NULL and
failing immediately.

The patch also keeps EIO media revalidation tied to the
submission-time block device, so reopen does not race with media
checks.

How Has This Been Tested?

Tested on Linux.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)

Checklist:

Copilot AI review requested due to automatic review settings May 7, 2026 07:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a real Linux lock ordering problem in the vdev_disk backend by ensuring vd_lock is no longer held across slow block-device open/close paths, avoiding a lockdep-reported cycle involving mapping.invalidate_lock during page-fault reads.

Changes:

  • Narrow vd_lock usage to protect only vd_bdh state transitions, and move block-device open/release operations outside of vd_lock.
  • Add a reopen “in-progress” wait state (vd_open_lock/vd_open_cv/vd_opening) so I/O waits through transient reopen windows instead of immediately failing on vd_bdh == NULL.
  • Tie EIO media revalidation/removal decisions to the submission-time struct block_device * (captured via vbio_bdev for RW and io_vsd for flush/trim) to avoid races with reopen.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label May 7, 2026
Copy link
Copy Markdown
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix. I can't say a love the for loop which was added to vdev_disk_io_start() but it makes sense and looks solid.

Comment thread module/os/linux/zfs/vdev_disk.c Outdated
@Gality369 Gality369 force-pushed the fix-vdev-disk-lockdep branch from a96ca68 to 250ba07 Compare May 14, 2026 03:08
Comment thread module/os/linux/zfs/vdev_disk.c Outdated
@behlendorf behlendorf self-requested a review May 14, 2026 18:32
Copilot AI review requested due to automatic review settings May 15, 2026 01:30
@Gality369 Gality369 force-pushed the fix-vdev-disk-lockdep branch from 250ba07 to 7bf3a65 Compare May 15, 2026 01:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 7 comments.

Comment thread module/os/linux/zfs/vdev_disk.c Outdated
Comment thread module/os/linux/zfs/vdev_disk.c
Comment thread module/os/linux/zfs/vdev_disk.c Outdated
Comment thread module/os/linux/zfs/vdev_disk.c
Comment thread module/os/linux/zfs/vdev_disk.c
Comment thread module/os/linux/zfs/vdev_disk.c Outdated
Comment thread module/os/linux/zfs/vdev_disk.c
On Linux, fault-driven reads can reach vdev_disk_io_start() while
holding mapping.invalidate_lock and then acquire vd_lock.

At the same time, vdev_disk_open() held vd_lock across
bdev_file_open_by_path()/bdev_open_by_path(). That pulls vd_lock
into the block open path and into the mmap_lock ->
mapping.invalidate_lock dependency chain.

This is a real lock ordering problem, not just a lockdep false
positive. A reopen can hold vd_lock while the page fault path
already holds mapping.invalidate_lock and waits for vd_lock.

Fix this by limiting vd_lock to short vd_bdh state transitions and
moving the slow block-device open/release work outside the lock.
Add a small reopen wait state so I/O waits for a transient reopen
window instead of treating vd_bdh == NULL as a permanent failure.
Also use the submission-time block device for EIO revalidation so
reopen does not race with media checks.

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
@Gality369 Gality369 force-pushed the fix-vdev-disk-lockdep branch from 7bf3a65 to 884df7a Compare May 15, 2026 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants