Skip to content

Fm partition#1190

Draft
shengnuo wants to merge 4 commits into
kubernetes-sigs:mainfrom
shengnuo:fm-partition
Draft

Fm partition#1190
shengnuo wants to merge 4 commits into
kubernetes-sigs:mainfrom
shengnuo:fm-partition

Conversation

@shengnuo

@shengnuo shengnuo commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

Which issue(s) this PR is related to:

Fixes #1070

Special notes for your reviewer:

Open Questions

  • what to do when fabric manager restarts? should DRA driver replay the activated partitions?
  • what about GetNvlinkFailedDevices? should we still publish the failed devices?
  • should FM attributes be persisted into checkpoint file?
  • when the GPU kubelet plugin starts, there might be devices in passthrough mode and bound to vfio-pci driver. These devices are invisible to nvml, thus FM attributes is unobtainable. Can FM attributes for vfio-pci bound devices be omitted until it's bound to nvidia driver again?
  • fmpm or nvfm go-binding
  • featuregates?

Does this PR introduce a user-facing change?

Support activation of FM partitioning for VFIO devices.

Additional documentation (design docs, usage docs, etc.):


Checklist

  • make check test passes locally
  • make check-generate passes if api/ changed (CRDs, deepcopy, informers, listers, clientset)
  • make check-modules passes if go.mod / go.sum changed
  • Tests added or updated for the change
  • Helm chart (deployments/helm) updated if flags, RBAC, or defaults changed

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Jun 11, 2026
@netlify

netlify Bot commented Jun 11, 2026

Copy link
Copy Markdown

Deploy Preview for dra-driver-nvidia-gpu ready!

Name Link
🔨 Latest commit 3af0731
🔍 Latest deploy log https://app.netlify.com/projects/dra-driver-nvidia-gpu/deploys/6a2ae70e7561d600089a190a
😎 Deploy Preview https://deploy-preview-1190--dra-driver-nvidia-gpu.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot requested review from jgehrcke and klueska June 11, 2026 01:02
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shengnuo
Once this PR has been reviewed and has the lgtm label, please assign dims for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 11, 2026
# dir (e.g. x86_64-linux-gnu or aarch64-linux-gnu), so we normalize the location
# here and copy with -a to preserve the symlink chain.
RUN apt-get update \
&& apt-get install -y --no-install-recommends nvidia-fabricmanager-dev-580 \

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we ensure that versioning of libnvfm matches with driver installed by GPU operator?

return nil
}
defer shutdown()
klog.Infof("!!!!!!!!!!!ensureNVML done")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore these debugging logs for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Feature]: Support for FabricManager GPU partitions

2 participants