Skip to content

Calling thread IO#18562

Open
MigeljanImeri wants to merge 1 commit into
openzfs:masterfrom
MigeljanImeri:calling_thread_io
Open

Calling thread IO#18562
MigeljanImeri wants to merge 1 commit into
openzfs:masterfrom
MigeljanImeri:calling_thread_io

Conversation

@MigeljanImeri
Copy link
Copy Markdown
Contributor

Adds a module parameter that will allow waiting for bio's to complete, along with a flag that tracks whether a zio has bypassed the queue.

The motivation behind this change was performance based. The intention was to reduce overhead caused by swapping between threads from when bio's are submitted, and the callback executes.

Currently, only zio's who have bypassed the queue are allowed to wait for bio completion, this is mainly done because any performance uplift from staying in the same thread is overshadowed by the vdev queue lock.

Motivation and Context

The motivation behind this change was performance based. The intention was to reduce overhead caused by swapping between threads from when bio's are submitted, and the callback executes.

Description

The normal ZIO pipeline for IO stops when IO is submitted to the disk, and resumes after a completion event is called signifying IO is done. Instead of stopping, we wait for the IO to complete after submission and then continue through the rest of the pipeline.

How Has This Been Tested?

Performance testing has been done with fio measuring mainly read performance, with randread iops showing ~20% improvement and streaming read bandwidth showing ~30% improvement.

results:

randreads before:

read: IOPS=467k, BW=1823MiB/s (1911MB/s)(534GiB/300007msec)

randreads after:

read: IOPS=561k, BW=2193MiB/s (2299MB/s)(642GiB/300007msec)

sequential reads before:

read: IOPS=62.1k, BW=60.7GiB/s (65.2GB/s)(17.8TiB/300012msec)

sequential reads after:

read: IOPS=81.7k, BW=79.8GiB/s (85.6GB/s)(23.4TiB/300006msec)

full fio output files:

randread_multiple_256jobs_calling_io_0.txt
randread_multiple_256jobs_calling_io_1.txt
read_multiple_256jobs_calling_io_0.txt
read_multiple_256jobs_calling_io_1.txt

node configuration:

Intel Xeon Gold 6438Y+
Rocky Linux 9.6

fio test configuration:

numjobs=256
blocksize=4k
direct=1
runtime=300
filesize= 1G | 16G (iops | bw)

zpool config:

for iops:

1x gen 5 nvme ssd (KIOXIA KCMYXRUG7T68)
recordsize=4k
compression=off

for streaming bw:

raidz2 ( 6 + 2)
8x gen 5 nvme ssd (KIOXIA KCMYXRUG7T68)
recordsize=1M
compression=off

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing labels May 19, 2026
Copy link
Copy Markdown
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice performance win! I'm curious if you've tried this change on other systems with lower performing devices? Did it help there as well, or at least perform as well?

Comment thread include/sys/zio.h
Comment thread module/os/linux/zfs/vdev_disk.c Outdated
Comment thread include/sys/zio.h Outdated
Comment thread module/os/linux/zfs/vdev_disk.c
@behlendorf
Copy link
Copy Markdown
Contributor

Adding support for FreeBSD in vdev_geom.c via a call to biowait() looks like it may be straight forward. @markjdb @amotin what do you think? For today's very fast devices I'd expect it would benefit from this as well.

@MigeljanImeri
Copy link
Copy Markdown
Contributor Author

That's a nice performance win! I'm curious if you've tried this change on other systems with lower performing devices? Did it help there as well, or at least perform as well?

I repeated the same tests on another node with gen 3 nvme ssds and saw significant improvements with randread iops: 439k -> 713k (~60% increase), although interestingly enough it seemed sequential writes in a raidz2 pool with 1M blocksize took a bit of a hit (~5-10% decrease).

node configuration:

AMD EPYC 7502 32-Core Processor
CentOS Stream 8
Dell Express Flash PM1725a 1.6TB SFF

Do you think it might be worth adding more granularity to the module parameter so we can specify whether to apply this for only reads / writes?

@behlendorf
Copy link
Copy Markdown
Contributor

Do you think it might be worth adding more granularity to the module parameter so we can specify whether to apply this for only reads / writes?

Good question. In your testing have you seen cases where you'd want to tune it differently for reads vs writes? If so, is it highly device dependent? I was actually wondering if we should drop the module parameter entirely and rely solely on the vdev scheduler property.

Comment thread module/os/linux/zfs/vdev_disk.c
Comment thread module/zfs/zio.c Outdated
}

vd->vdev_ops->vdev_op_io_start(zio);
if (vd->vdev_ops->vdev_op_leaf && zio_get_bio_wait(zio))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both here and in vdev_disk_io_rw() I am not sure how legal is it to access zio/vbio after submitting them for execution. How do we know that they haven't completed and haven't been freed yet?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is why I think we want these changes to be isolated in the platform vdev_disk_ops. At least on Linux the vbio won't be cleaned up until .vdev_op_io_done() is called, but the underlying bio is almost certainly long gone.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't .vdev_op_io_done() get called by interrupt thread before this line has its chance if the wait was not used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it could be, I reworked it to get around that issue.

@amotin
Copy link
Copy Markdown
Member

amotin commented May 26, 2026

Adding support for FreeBSD in vdev_geom.c via a call to biowait() looks like it may be straight forward. @markjdb @amotin what do you think? For today's very fast devices I'd expect it would benefit from this as well.

I am not sure what exactly we are saving here and how. I don't know if Linux is different, but IIRC FreeBSD's block layer is not really tuned to handle I/O completions directly in the submission path without completion interrupts. So the interrupt will still be there, and the waiting thread will be woken up by it after completion passing through GEOM. The question is where would we like the ZIO completion be processed in ZFS: in the original submission thread, or in interrupt ZIO taskqueue, which probably depends where that ZIO was issued from and whether it is still waited there.

I suppose return to the original thread may matter primarily in case of Direct I/O reads, because for cached I/O we may not care or even prefer not to, since many reads will be prefetches, and writes due to ZIO_STAGE_ISSUE_ASYNC go via write ZIO taskqs. Blocking taskqs in any of those cases may be not a great idea. I wonder how the sequential tests with the 8-wide RAIDZ2 did not notice the additional latency of processing 8 vdevs sequentially? Got it covered by numjobs=256, or sequential processing allowed to avoid some otherwise bad lock contention between child and parent ZIOs completions?

I wonder if this could be conditional not only on ZIO_FLAG_BYPASSED_QUEUE, but on a thread we are coming from? I.e. we may not want to block limited ZIO taskqs, especially considering this tunable is system-wide, and there may be devices with different performance.

Adds a module parameter that will allow waiting for bio's
to complete, along with a flag that tracks whether a zio
has bypassed the queue.

The motivation behind this change was performance based. The
intention was to reduce overhead caused by swapping between
threads from when bio's are submitted, and the callback executes.

Currently, only zio's who have bypassed the queue are allowed
to wait for bio completion, this is mainly done because any performance
uplift from staying in the same thread is overshadowed by the vdev
queue lock.

Signed-off-by: Migel Imeri <mimeri@lanl.gov>
@behlendorf
Copy link
Copy Markdown
Contributor

I am not sure what exactly we are saving here and how.

@MigeljanImeri please jump in if I'm off base. But my understanding is the performance win here comes from avoiding the zio_taskq_dispatch() and this latency (taskq/zio lock contention, thread switching, etc?). At least on Linux vbio_completion() runs in interrupt context and can't sleep so we're forced to dispatch the remaining pipeline stages to a taskq thread. I see the FreeBSD code is structured similarly, but perhaps GEOM offers us more options there.

The question is where would we like the ZIO completion be processed in ZFS: in the original submission thread, or in interrupt ZIO taskqueue, which probably depends where that ZIO was issued from and whether it is still waited there.

Yeah, "it depends" I think is the answer. For devices where the dispatch overhead is significant compared the IO time handling the completion in the submission thread seems best. It is still somewhat surprising that sequentially handling even wide-raidz vdevs is worth it. Ideally, I suppose you'd want to issue all the child IO in parallel then somehow sequentially process the completions in the submission thread after they all complete. This probably would always make sense for direct IO.

@amotin
Copy link
Copy Markdown
Member

amotin commented May 27, 2026

win here comes from avoiding the zio_taskq_dispatch() and this latency (taskq/zio lock contention, thread switching, etc?)

At least on FreeBSD context switch from the interrupt thread to the waiting thread and to a interrupt taskq should be similar. The win I suppose is can be if after switching to interrupt taskq we then need another switch to the waiting thread. For Direct I/O case it may be true. For cached I think it is not a fact, if there is read-ahead or write-back.

@MigeljanImeri
Copy link
Copy Markdown
Contributor Author

I am not sure what exactly we are saving here and how.

@MigeljanImeri please jump in if I'm off base. But my understanding is the performance win here comes from avoiding the zio_taskq_dispatch() and this latency (taskq/zio lock contention, thread switching, etc?). At least on Linux vbio_completion() runs in interrupt context and can't sleep so we're forced to dispatch the remaining pipeline stages to a taskq thread. I see the FreeBSD code is structured similarly, but perhaps GEOM offers us more options there.

The question is where would we like the ZIO completion be processed in ZFS: in the original submission thread, or in interrupt ZIO taskqueue, which probably depends where that ZIO was issued from and whether it is still waited there.

Yeah, "it depends" I think is the answer. For devices where the dispatch overhead is significant compared the IO time handling the completion in the submission thread seems best. It is still somewhat surprising that sequentially handling even wide-raidz vdevs is worth it. Ideally, I suppose you'd want to issue all the child IO in parallel then somehow sequentially process the completions in the submission thread after they all complete. This probably would always make sense for direct IO.

Yep, that's where the main performance win is coming from, avoiding the thread switching and overhead that comes with that. The intention behind this was to use it in conjunction with direct IO. The numbers don't paint the full picture here, there is a decent hit we take performance wise when the IO workload isn't high enough. Dropping the numjobs on the raidz2 results in calling thread negatively impacting performance because of the sequential handing. I think ideally we would have some tunable that would be able to detect when the workload is big enough that it is worth turning calling thread on, but I am not sure exactly on how to go about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing Type: Performance Performance improvement or performance problem

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants