Calling thread IO#18562
Conversation
24e880c to
c5179f5
Compare
behlendorf
left a comment
There was a problem hiding this comment.
That's a nice performance win! I'm curious if you've tried this change on other systems with lower performing devices? Did it help there as well, or at least perform as well?
c5179f5 to
271d6b9
Compare
I repeated the same tests on another node with gen 3 nvme ssds and saw significant improvements with randread iops: 439k -> 713k (~60% increase), although interestingly enough it seemed sequential writes in a raidz2 pool with 1M blocksize took a bit of a hit (~5-10% decrease). node configuration: AMD EPYC 7502 32-Core Processor Do you think it might be worth adding more granularity to the module parameter so we can specify whether to apply this for only reads / writes? |
Good question. In your testing have you seen cases where you'd want to tune it differently for reads vs writes? If so, is it highly device dependent? I was actually wondering if we should drop the module parameter entirely and rely solely on the vdev scheduler property. |
| } | ||
|
|
||
| vd->vdev_ops->vdev_op_io_start(zio); | ||
| if (vd->vdev_ops->vdev_op_leaf && zio_get_bio_wait(zio)) |
There was a problem hiding this comment.
Both here and in vdev_disk_io_rw() I am not sure how legal is it to access zio/vbio after submitting them for execution. How do we know that they haven't completed and haven't been freed yet?
There was a problem hiding this comment.
Yeah, this is why I think we want these changes to be isolated in the platform vdev_disk_ops. At least on Linux the vbio won't be cleaned up until .vdev_op_io_done() is called, but the underlying bio is almost certainly long gone.
There was a problem hiding this comment.
Can't .vdev_op_io_done() get called by interrupt thread before this line has its chance if the wait was not used?
There was a problem hiding this comment.
Yes it could be, I reworked it to get around that issue.
I am not sure what exactly we are saving here and how. I don't know if Linux is different, but IIRC FreeBSD's block layer is not really tuned to handle I/O completions directly in the submission path without completion interrupts. So the interrupt will still be there, and the waiting thread will be woken up by it after completion passing through GEOM. The question is where would we like the ZIO completion be processed in ZFS: in the original submission thread, or in interrupt ZIO taskqueue, which probably depends where that ZIO was issued from and whether it is still waited there. I suppose return to the original thread may matter primarily in case of Direct I/O reads, because for cached I/O we may not care or even prefer not to, since many reads will be prefetches, and writes due to I wonder if this could be conditional not only on |
Adds a module parameter that will allow waiting for bio's to complete, along with a flag that tracks whether a zio has bypassed the queue. The motivation behind this change was performance based. The intention was to reduce overhead caused by swapping between threads from when bio's are submitted, and the callback executes. Currently, only zio's who have bypassed the queue are allowed to wait for bio completion, this is mainly done because any performance uplift from staying in the same thread is overshadowed by the vdev queue lock. Signed-off-by: Migel Imeri <mimeri@lanl.gov>
@MigeljanImeri please jump in if I'm off base. But my understanding is the performance win here comes from avoiding the
Yeah, "it depends" I think is the answer. For devices where the dispatch overhead is significant compared the IO time handling the completion in the submission thread seems best. It is still somewhat surprising that sequentially handling even wide-raidz vdevs is worth it. Ideally, I suppose you'd want to issue all the child IO in parallel then somehow sequentially process the completions in the submission thread after they all complete. This probably would always make sense for direct IO. |
At least on FreeBSD context switch from the interrupt thread to the waiting thread and to a interrupt taskq should be similar. The win I suppose is can be if after switching to interrupt taskq we then need another switch to the waiting thread. For Direct I/O case it may be true. For cached I think it is not a fact, if there is read-ahead or write-back. |
271d6b9 to
73b973b
Compare
Yep, that's where the main performance win is coming from, avoiding the thread switching and overhead that comes with that. The intention behind this was to use it in conjunction with direct IO. The numbers don't paint the full picture here, there is a decent hit we take performance wise when the IO workload isn't high enough. Dropping the numjobs on the raidz2 results in calling thread negatively impacting performance because of the sequential handing. I think ideally we would have some tunable that would be able to detect when the workload is big enough that it is worth turning calling thread on, but I am not sure exactly on how to go about that. |
Adds a module parameter that will allow waiting for bio's to complete, along with a flag that tracks whether a zio has bypassed the queue.
The motivation behind this change was performance based. The intention was to reduce overhead caused by swapping between threads from when bio's are submitted, and the callback executes.
Currently, only zio's who have bypassed the queue are allowed to wait for bio completion, this is mainly done because any performance uplift from staying in the same thread is overshadowed by the vdev queue lock.
Motivation and Context
The motivation behind this change was performance based. The intention was to reduce overhead caused by swapping between threads from when bio's are submitted, and the callback executes.
Description
The normal ZIO pipeline for IO stops when IO is submitted to the disk, and resumes after a completion event is called signifying IO is done. Instead of stopping, we wait for the IO to complete after submission and then continue through the rest of the pipeline.
How Has This Been Tested?
Performance testing has been done with fio measuring mainly read performance, with randread iops showing ~20% improvement and streaming read bandwidth showing ~30% improvement.
results:
randreads before:
read: IOPS=467k, BW=1823MiB/s (1911MB/s)(534GiB/300007msec)randreads after:
read: IOPS=561k, BW=2193MiB/s (2299MB/s)(642GiB/300007msec)sequential reads before:
read: IOPS=62.1k, BW=60.7GiB/s (65.2GB/s)(17.8TiB/300012msec)sequential reads after:
read: IOPS=81.7k, BW=79.8GiB/s (85.6GB/s)(23.4TiB/300006msec)full fio output files:
randread_multiple_256jobs_calling_io_0.txt
randread_multiple_256jobs_calling_io_1.txt
read_multiple_256jobs_calling_io_0.txt
read_multiple_256jobs_calling_io_1.txt
node configuration:
Intel Xeon Gold 6438Y+
Rocky Linux 9.6
fio test configuration:
numjobs=256
blocksize=4k
direct=1
runtime=300
filesize= 1G | 16G (iops | bw)
zpool config:
for iops:
1x gen 5 nvme ssd (KIOXIA KCMYXRUG7T68)
recordsize=4k
compression=off
for streaming bw:
raidz2 ( 6 + 2)
8x gen 5 nvme ssd (KIOXIA KCMYXRUG7T68)
recordsize=1M
compression=off
Types of changes
Checklist:
Signed-off-by.