Validate /host-root mount at VfioPciManager startup#1077
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for dra-driver-nvidia-gpu ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
|
|
Welcome @johnahull! |
|
Hi @johnahull. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/easycla |
62e3acd to
a7d5a15
Compare
|
/assign @varunrsekar |
| return nil, fmt.Errorf("IOMMU is not enabled in the kernel") | ||
| } | ||
|
|
||
| if _, err := os.Stat(hostRoot); os.IsNotExist(err) { |
There was a problem hiding this comment.
When do you hit this case? If PassthroughSupport feature gate is enabled, then this host mount should be added through Helm templates. This function itself is only invoked when that feature gate is enabled here:
There was a problem hiding this comment.
I forked the repo (added some standard attributes to the driver) and i was using a custom-built image, which i deployed via Helm with some custom values. I agree that the stock Helm chart has the mount present. This was just something I hit doing some dev work.
… messages NewVfioPciManager now checks that the /host-root volume mount exists at startup. Without this mount, WaitForGPUFree silently fails with "exit status 125" on every fuser invocation, making VFIO passthrough appear to hang with no actionable diagnostics. Also improves error messages in WaitForGPUFree to: - Include the mount path and suggest checking the helm chart featureGates.PassthroughSupport setting when fuser fails - Reference issue kubernetes-sigs#1076 when external processes (e.g. dcgm-exporter) are detected holding GPU device handles Ref: kubernetes-sigs#1076
a7d5a15 to
8c67a5d
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: johnahull The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
@johnahull Can you confirm you dont hit this issue if you deploy through the helm chart? This change is unnecessary if that's the case. |
Summary
NewVfioPciManagerthat fails fast if/host-rootis not mounted, instead of silently failing on everyfusercall during VFIO bindWaitForGPUFreeto include actionable diagnostics whenfuserfails (suggests checkingfeatureGates.PassthroughSupport) and when external processes are detected holding GPU device handles (references WaitForGPUFree blocks indefinitely when external processes hold GPU device handles during VFIO bind #1076)Context
When deploying without the helm chart (or with
PassthroughSupportnot set), the/host-roothostPath volume is not mounted.WaitForGPUFreecallschroot /host-root fuser ...which fails with exit code 125 on every attempt, logging onlyUnexpected error checking if gpu device "0000:xx:00.0" is free: exit status 125— with no indication that the mount is missing. This loops for 60 seconds until the gRPC deadline, making VFIO passthrough appear broken with no actionable error.Tested
Verified on the following configuration:
Without
/host-rootmount:NewVfioPciManagerfails at startup with a clear error message referencing the missing mount andfeatureGates.PassthroughSupport. Previously, this produced onlyexit status 125errors during VFIO bind with no indication of root cause.With
/host-rootmount + dcgm-exporter disabled: VFIO GPU claim created viagpu-vfio.nvidia.comDeviceClass. GPU-0 successfully unbound from nvidia and bound to vfio-pci. Pod received/dev/vfio/10(IOMMU group) and/dev/vfio/vfio. End-to-end time ~9 seconds.With
/host-rootmount + dcgm-exporter running:WaitForGPUFreenow logs the holder PIDs with an actionable message referencing #1076, instead of the previous generic error.Test plan
PassthroughSupportfeature gate →NewVfioPciManagerfails immediately with a clear message about the missing mountPassthroughSupport=true→ startup succeeds, VFIO bind worksRef: #1076