Calling thread IO by MigeljanImeri · Pull Request #18562 · openzfs/zfs

MigeljanImeri · 2026-05-19T20:07:12Z

Adds a module parameter that will allow waiting for bio's to complete, along with a flag that tracks whether a zio has bypassed the queue.

The motivation behind this change was performance based. The intention was to reduce overhead caused by swapping between threads from when bio's are submitted, and the callback executes.

Currently, only zio's who have bypassed the queue are allowed to wait for bio completion, this is mainly done because any performance uplift from staying in the same thread is overshadowed by the vdev queue lock.

Motivation and Context

The motivation behind this change was performance based. The intention was to reduce overhead caused by swapping between threads from when bio's are submitted, and the callback executes.

Description

The normal ZIO pipeline for IO stops when IO is submitted to the disk, and resumes after a completion event is called signifying IO is done. Instead of stopping, we wait for the IO to complete after submission and then continue through the rest of the pipeline.

How Has This Been Tested?

Performance testing has been done with fio measuring mainly read performance, with randread iops showing ~20% improvement and streaming read bandwidth showing ~30% improvement.

results:

randreads before:

read: IOPS=467k, BW=1823MiB/s (1911MB/s)(534GiB/300007msec)

randreads after:

read: IOPS=561k, BW=2193MiB/s (2299MB/s)(642GiB/300007msec)

sequential reads before:

read: IOPS=62.1k, BW=60.7GiB/s (65.2GB/s)(17.8TiB/300012msec)

sequential reads after:

read: IOPS=81.7k, BW=79.8GiB/s (85.6GB/s)(23.4TiB/300006msec)

full fio output files:

randread_multiple_256jobs_calling_io_0.txt
randread_multiple_256jobs_calling_io_1.txt
read_multiple_256jobs_calling_io_0.txt
read_multiple_256jobs_calling_io_1.txt

node configuration:

Intel Xeon Gold 6438Y+
Rocky Linux 9.6

fio test configuration:

numjobs=256
blocksize=4k
direct=1
runtime=300
filesize= 1G | 16G (iops | bw)

zpool config:

for iops:

1x gen 5 nvme ssd (KIOXIA KCMYXRUG7T68)
recordsize=4k
compression=off

for streaming bw:

raidz2 ( 6 + 2)
8x gen 5 nvme ssd (KIOXIA KCMYXRUG7T68)
recordsize=1M
compression=off

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

behlendorf

That's a nice performance win! I'm curious if you've tried this change on other systems with lower performing devices? Did it help there as well, or at least perform as well?

behlendorf · 2026-05-20T23:39:12Z

Adding support for FreeBSD in vdev_geom.c via a call to biowait() looks like it may be straight forward. @markjdb @amotin what do you think? For today's very fast devices I'd expect it would benefit from this as well.

MigeljanImeri · 2026-05-21T19:09:48Z

That's a nice performance win! I'm curious if you've tried this change on other systems with lower performing devices? Did it help there as well, or at least perform as well?

I repeated the same tests on another node with gen 3 nvme ssds and saw significant improvements with randread iops: 439k -> 713k (~60% increase), although interestingly enough it seemed sequential writes in a raidz2 pool with 1M blocksize took a bit of a hit (~5-10% decrease).

node configuration:

AMD EPYC 7502 32-Core Processor
CentOS Stream 8
Dell Express Flash PM1725a 1.6TB SFF

Do you think it might be worth adding more granularity to the module parameter so we can specify whether to apply this for only reads / writes?

behlendorf · 2026-05-21T20:54:57Z

Do you think it might be worth adding more granularity to the module parameter so we can specify whether to apply this for only reads / writes?

Good question. In your testing have you seen cases where you'd want to tune it differently for reads vs writes? If so, is it highly device dependent? I was actually wondering if we should drop the module parameter entirely and rely solely on the vdev scheduler property.

amotin · 2026-05-26T15:10:53Z

 	}

 	vd->vdev_ops->vdev_op_io_start(zio);
+	if (vd->vdev_ops->vdev_op_leaf && zio_get_bio_wait(zio))


Both here and in vdev_disk_io_rw() I am not sure how legal is it to access zio/vbio after submitting them for execution. How do we know that they haven't completed and haven't been freed yet?

Yeah, this is why I think we want these changes to be isolated in the platform vdev_disk_ops. At least on Linux the vbio won't be cleaned up until .vdev_op_io_done() is called, but the underlying bio is almost certainly long gone.

Can't .vdev_op_io_done() get called by interrupt thread before this line has its chance if the wait was not used?

Yes it could be, I reworked it to get around that issue.

amotin · 2026-05-26T17:12:09Z

Adding support for FreeBSD in vdev_geom.c via a call to biowait() looks like it may be straight forward. @markjdb @amotin what do you think? For today's very fast devices I'd expect it would benefit from this as well.

I am not sure what exactly we are saving here and how. I don't know if Linux is different, but IIRC FreeBSD's block layer is not really tuned to handle I/O completions directly in the submission path without completion interrupts. So the interrupt will still be there, and the waiting thread will be woken up by it after completion passing through GEOM. The question is where would we like the ZIO completion be processed in ZFS: in the original submission thread, or in interrupt ZIO taskqueue, which probably depends where that ZIO was issued from and whether it is still waited there.

I suppose return to the original thread may matter primarily in case of Direct I/O reads, because for cached I/O we may not care or even prefer not to, since many reads will be prefetches, and writes due to ZIO_STAGE_ISSUE_ASYNC go via write ZIO taskqs. Blocking taskqs in any of those cases may be not a great idea. I wonder how the sequential tests with the 8-wide RAIDZ2 did not notice the additional latency of processing 8 vdevs sequentially? Got it covered by numjobs=256, or sequential processing allowed to avoid some otherwise bad lock contention between child and parent ZIOs completions?

I wonder if this could be conditional not only on ZIO_FLAG_BYPASSED_QUEUE, but on a thread we are coming from? I.e. we may not want to block limited ZIO taskqs, especially considering this tunable is system-wide, and there may be devices with different performance.

Adds a module parameter that will allow waiting for bio's to complete, along with a flag that tracks whether a zio has bypassed the queue. The motivation behind this change was performance based. The intention was to reduce overhead caused by swapping between threads from when bio's are submitted, and the callback executes. Currently, only zio's who have bypassed the queue are allowed to wait for bio completion, this is mainly done because any performance uplift from staying in the same thread is overshadowed by the vdev queue lock. Signed-off-by: Migel Imeri <mimeri@lanl.gov>

behlendorf · 2026-05-27T20:17:14Z

I am not sure what exactly we are saving here and how.

@MigeljanImeri please jump in if I'm off base. But my understanding is the performance win here comes from avoiding the zio_taskq_dispatch() and this latency (taskq/zio lock contention, thread switching, etc?). At least on Linux vbio_completion() runs in interrupt context and can't sleep so we're forced to dispatch the remaining pipeline stages to a taskq thread. I see the FreeBSD code is structured similarly, but perhaps GEOM offers us more options there.

The question is where would we like the ZIO completion be processed in ZFS: in the original submission thread, or in interrupt ZIO taskqueue, which probably depends where that ZIO was issued from and whether it is still waited there.

Yeah, "it depends" I think is the answer. For devices where the dispatch overhead is significant compared the IO time handling the completion in the submission thread seems best. It is still somewhat surprising that sequentially handling even wide-raidz vdevs is worth it. Ideally, I suppose you'd want to issue all the child IO in parallel then somehow sequentially process the completions in the submission thread after they all complete. This probably would always make sense for direct IO.

amotin · 2026-05-27T20:43:22Z

win here comes from avoiding the zio_taskq_dispatch() and this latency (taskq/zio lock contention, thread switching, etc?)

At least on FreeBSD context switch from the interrupt thread to the waiting thread and to a interrupt taskq should be similar. The win I suppose is can be if after switching to interrupt taskq we then need another switch to the waiting thread. For Direct I/O case it may be true. For cached I think it is not a fact, if there is read-ahead or write-back.

MigeljanImeri · 2026-05-27T21:06:07Z

I am not sure what exactly we are saving here and how.

@MigeljanImeri please jump in if I'm off base. But my understanding is the performance win here comes from avoiding the zio_taskq_dispatch() and this latency (taskq/zio lock contention, thread switching, etc?). At least on Linux vbio_completion() runs in interrupt context and can't sleep so we're forced to dispatch the remaining pipeline stages to a taskq thread. I see the FreeBSD code is structured similarly, but perhaps GEOM offers us more options there.

The question is where would we like the ZIO completion be processed in ZFS: in the original submission thread, or in interrupt ZIO taskqueue, which probably depends where that ZIO was issued from and whether it is still waited there.

Yeah, "it depends" I think is the answer. For devices where the dispatch overhead is significant compared the IO time handling the completion in the submission thread seems best. It is still somewhat surprising that sequentially handling even wide-raidz vdevs is worth it. Ideally, I suppose you'd want to issue all the child IO in parallel then somehow sequentially process the completions in the submission thread after they all complete. This probably would always make sense for direct IO.

Yep, that's where the main performance win is coming from, avoiding the thread switching and overhead that comes with that. The intention behind this was to use it in conjunction with direct IO. The numbers don't paint the full picture here, there is a decent hit we take performance wise when the IO workload isn't high enough. Dropping the numjobs on the raidz2 results in calling thread negatively impacting performance because of the sequential handing. I think ideally we would have some tunable that would be able to detect when the workload is big enough that it is worth turning calling thread on, but I am not sure exactly on how to go about that.

behlendorf added Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing labels May 19, 2026

MigeljanImeri force-pushed the calling_thread_io branch from 24e880c to c5179f5 Compare May 20, 2026 21:08

behlendorf reviewed May 20, 2026

View reviewed changes

Comment thread include/sys/zio.h

Comment thread module/os/linux/zfs/vdev_disk.c Outdated

Comment thread include/sys/zio.h Outdated

Comment thread module/os/linux/zfs/vdev_disk.c

MigeljanImeri force-pushed the calling_thread_io branch from c5179f5 to 271d6b9 Compare May 21, 2026 19:00

behlendorf reviewed May 21, 2026

View reviewed changes

Comment thread module/os/linux/zfs/vdev_disk.c

amotin reviewed May 26, 2026

View reviewed changes

MigeljanImeri force-pushed the calling_thread_io branch from 271d6b9 to 73b973b Compare May 27, 2026 20:53

Conversation

MigeljanImeri commented May 19, 2026

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

behlendorf commented May 20, 2026

Uh oh!

MigeljanImeri commented May 21, 2026

Uh oh!

behlendorf commented May 21, 2026

Uh oh!

Uh oh!

amotin May 26, 2026

Choose a reason for hiding this comment

Uh oh!

behlendorf May 27, 2026

Choose a reason for hiding this comment

Uh oh!

amotin May 27, 2026

Choose a reason for hiding this comment

Uh oh!

MigeljanImeri May 27, 2026

Choose a reason for hiding this comment

Uh oh!

amotin commented May 26, 2026

Uh oh!

behlendorf commented May 27, 2026

Uh oh!

amotin commented May 27, 2026

Uh oh!

MigeljanImeri commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants