Skip to content

Stop injecting implicit TimeSlicing and reset on unprepare#1183

Open
guptaNswati wants to merge 1 commit into
kubernetes-sigs:mainfrom
guptaNswati:fix-timeslicing
Open

Stop injecting implicit TimeSlicing and reset on unprepare#1183
guptaNswati wants to merge 1 commit into
kubernetes-sigs:mainfrom
guptaNswati:fix-timeslicing

Conversation

@guptaNswati

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

Which issue(s) this PR is related to:

fixes #81
#363

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 9, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: guptaNswati
Once this PR has been reviewed and has the lgtm label, please assign shivamerla for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from jgehrcke and shengnuo June 9, 2026 22:46
@netlify

netlify Bot commented Jun 9, 2026

Copy link
Copy Markdown

👷 Deploy Preview for dra-driver-nvidia-gpu processing.

Name Link
🔨 Latest commit 6a7635c
🔍 Latest deploy log https://app.netlify.com/projects/dra-driver-nvidia-gpu/deploys/6a2897b559931a000864d800

@netlify

netlify Bot commented Jun 9, 2026

Copy link
Copy Markdown

Deploy Preview for dra-driver-nvidia-gpu canceled.

Name Link
🔨 Latest commit 22e7ff7
🔍 Latest deploy log https://app.netlify.com/projects/dra-driver-nvidia-gpu/deploys/6a2b4d84cad59600076d0c03

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 9, 2026
@guptaNswati

guptaNswati commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

AI summary of how time slicing is supported from internal and external docs:

The `--set-timeslice` knob is tied to compute preemption/CILP support, and Pascal is split: GP100 has instruction-level compute preemption, but not all 6.x parts do. So the safest check is a runtime capability probe, not a raw cc >= X test.

Don’t gate this on a single CudaComputeCapability threshold.

nvidia-smi compute-policy --set-timeslice was added as a configurable CUDA time-slice control in `CUDA 11.1 / R455+` drivers and is a per-GPU, admin-level policy knob. Internally, that knob is specifically for the CILP timeslice. 

@guptaNswati guptaNswati requested a review from Copilot June 11, 2026 21:47
@guptaNswati guptaNswati moved this from Backlog to In Progress in DRA Driver for NVIDIA GPUs Jun 11, 2026
@guptaNswati guptaNswati self-assigned this Jun 11, 2026
@guptaNswati guptaNswati requested a review from shivamerla June 11, 2026 21:48

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the driver’s time-slicing behavior to avoid implicitly applying time-slicing defaults (which can break on older/non-supporting GPUs) and to only reset time-slicing during unprepare when it was actually applied during prepare.

Changes:

  • Stop injecting implicit Sharing/TimeSlicing defaults in DefaultGpuConfig, DefaultMigDeviceConfig, and Normalize() when Sharing is unset.
  • Track whether time-slicing was applied during prepare via a persisted TimeSliceApplied flag.
  • Reset time-slicing on unprepare only when TimeSliceApplied is true.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
cmd/gpu-kubelet-plugin/device_state.go Persists whether time-slicing was applied and uses that to conditionally reset time-slice policy during unprepare.
api/nvidia.com/resource/v1beta1/migconfig.go Removes implicit default sharing injection; leaves sharing unset unless explicitly configured.
api/nvidia.com/resource/v1beta1/gpuconfig.go Removes implicit default sharing/time-slicing injection; keeps defaulting of TimeSlicingConfig.Interval only when Sharing is explicitly set to time-slicing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 34 to 43
// DefaultGpuConfig provides the default GPU configuration.
func DefaultGpuConfig() *GpuConfig {
config := &GpuConfig{
return &GpuConfig{
TypeMeta: metav1.TypeMeta{
APIVersion: GroupName + "/" + Version,
Kind: GpuConfigKind,
},
}

if featuregates.Enabled(featuregates.TimeSlicingSettings) {
config.Sharing = &GpuSharing{
Strategy: TimeSlicingStrategy,
TimeSlicingConfig: &TimeSlicingConfig{
Interval: ptr.To(DefaultTimeSlice),
},
}
}

return config
}
@guptaNswati

Copy link
Copy Markdown
Contributor Author

Test config used

      config:
      - requests: ["ts-gpu"]
        opaque:
          driver: gpu.nvidia.com
          parameters:
            apiVersion: resource.nvidia.com/v1beta1
            kind: GpuConfig
            sharing:
              strategy: TimeSlicing
              timeSlicingConfig:
                interval: Long

A30

I0609 23:36:55.605225       1 device_state.go:1061] SetTimeSlice() for full GPUs with UUIDs: [GPU-f50641a9-38d7-6df1-0bec-e661b4ca1847]

nvidia-smi compute-policy -l
+-------------------------------------+
| GPU compute policies:               |
| GPU     Name           Value        |
|=====================================|
|   0   Timeslice       Long          |
+-------------------------------------+
|   1   Timeslice       Default       |

Tesla P4

Events:
  Type     Reason                         Age                 From               Message
  ----     ------                         ----                ----               -------
  Normal   Scheduled                      2m7s                default-scheduler  Successfully assigned gpu-test-timeslice/test-pod to kind-dra-1-worker
  Warning  FailedPrepareDynamicResources  54s (x2 over 2m7s)  kubelet            Failed to prepare dynamic resources: prepare dynamic resources: NodePrepareResources failed for ResourceClaim test-pod-shared-gpu-67n47: error preparing devices for claim gpu-test-timeslice/test-pod-shared-gpu-67n47:7c43438f-c31b-42f8-b0d7-1f399b2b0315: prepare devices failed: error applying config: error setting timeslice config for requests '[ts-gpu]' in claim '7c43438f-c31b-42f8-b0d7-1f399b2b0315': error setting time slice: error running nvidia-smi: exit status 3
  
driver logs 
dra-driver-nvidia-gpu   dra-driver-nvidia-gpu-kubelet-plugin-p7xs9          2/2     Running             0          82m
gpu-test-timeslice      test-pod                                            0/2     ContainerCreating   0          8m37s

Failed to set timeslice policy with value Long for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported
I0611 23:44:55.220775       1 device_state.go:274] Claim gpu-test-timeslice/test-pod-shared-gpu-67n47:7c43438f-c31b-42f8-b0d7-1f399b2b0315 already in PrepareStarted state: attempt rollback before new prepare
I0611 23:44:55.220808       1 device_state.go:540] unprepare noop: preparation started but not completed for claim gpu-test-timeslice/test-pod-shared-gpu-67n47:7c43438f-c31b-42f8-b0d7-1f399b2b0315 (devices: [{ts-gpu gpu.nvidia.com kind-dra-1-worker gpu-0 <nil> [] [] [] <nil> map[]}])
E0611 23:44:55.262023       1 nvlib.go:851] 
Failed to set timeslice policy with value Long for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported

Signed-off-by: Swati Gupta <swatig@nvidia.com>

mig cleanup

Signed-off-by: Swati Gupta <swatig@nvidia.com>
if err != nil {
return nil, fmt.Errorf("error setting timeslice config for requests '%v' in claim '%v': %w", requests, claim.UID, err)
}
configState.TimeSliceApplied = true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess with these newly added fields in the checkpoint, downgrade is not possible at all due to checksum match. That seems unavaoidable.

@shivamerla

Copy link
Copy Markdown
Contributor

Test config used

      config:
      - requests: ["ts-gpu"]
        opaque:
          driver: gpu.nvidia.com
          parameters:
            apiVersion: resource.nvidia.com/v1beta1
            kind: GpuConfig
            sharing:
              strategy: TimeSlicing
              timeSlicingConfig:
                interval: Long

A30

I0609 23:36:55.605225       1 device_state.go:1061] SetTimeSlice() for full GPUs with UUIDs: [GPU-f50641a9-38d7-6df1-0bec-e661b4ca1847]

nvidia-smi compute-policy -l
+-------------------------------------+
| GPU compute policies:               |
| GPU     Name           Value        |
|=====================================|
|   0   Timeslice       Long          |
+-------------------------------------+
|   1   Timeslice       Default       |

Tesla P4

Events:
  Type     Reason                         Age                 From               Message
  ----     ------                         ----                ----               -------
  Normal   Scheduled                      2m7s                default-scheduler  Successfully assigned gpu-test-timeslice/test-pod to kind-dra-1-worker
  Warning  FailedPrepareDynamicResources  54s (x2 over 2m7s)  kubelet            Failed to prepare dynamic resources: prepare dynamic resources: NodePrepareResources failed for ResourceClaim test-pod-shared-gpu-67n47: error preparing devices for claim gpu-test-timeslice/test-pod-shared-gpu-67n47:7c43438f-c31b-42f8-b0d7-1f399b2b0315: prepare devices failed: error applying config: error setting timeslice config for requests '[ts-gpu]' in claim '7c43438f-c31b-42f8-b0d7-1f399b2b0315': error setting time slice: error running nvidia-smi: exit status 3
  
driver logs 
dra-driver-nvidia-gpu   dra-driver-nvidia-gpu-kubelet-plugin-p7xs9          2/2     Running             0          82m
gpu-test-timeslice      test-pod                                            0/2     ContainerCreating   0          8m37s

Failed to set timeslice policy with value Long for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported
I0611 23:44:55.220775       1 device_state.go:274] Claim gpu-test-timeslice/test-pod-shared-gpu-67n47:7c43438f-c31b-42f8-b0d7-1f399b2b0315 already in PrepareStarted state: attempt rollback before new prepare
I0611 23:44:55.220808       1 device_state.go:540] unprepare noop: preparation started but not completed for claim gpu-test-timeslice/test-pod-shared-gpu-67n47:7c43438f-c31b-42f8-b0d7-1f399b2b0315 (devices: [{ts-gpu gpu.nvidia.com kind-dra-1-worker gpu-0 <nil> [] [] [] <nil> map[]}])
E0611 23:44:55.262023       1 nvlib.go:851] 
Failed to set timeslice policy with value Long for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported

When the timeslicing strategy is not explicitly set, can you also verify with MPS config.

if featuregates.Enabled(featuregates.TimeSlicingSettings) {
tsc := configapi.DefaultGpuConfig().Sharing.TimeSlicingConfig
// Reset time-slice policy only when it was applied during prepare.
if featuregates.Enabled(featuregates.TimeSlicingSettings) && group.ConfigState.TimeSliceApplied {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guptaNswati for existing claims prior to upgrade, this will skip resetting them to default setting as the existing checkpoint file will not have this entry. Instead of this method, wondering if its better to handle the error from SetTimeSlice() and log the not-supported error and continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

When the node GPU does not support setting timeslice, the plugin will crash directly.

4 participants