Stop injecting implicit TimeSlicing and reset on unprepare#1183
Stop injecting implicit TimeSlicing and reset on unprepare#1183guptaNswati wants to merge 1 commit into
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: guptaNswati The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
👷 Deploy Preview for dra-driver-nvidia-gpu processing.
|
✅ Deploy Preview for dra-driver-nvidia-gpu canceled.
|
|
AI summary of how time slicing is supported from internal and external docs: |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the driver’s time-slicing behavior to avoid implicitly applying time-slicing defaults (which can break on older/non-supporting GPUs) and to only reset time-slicing during unprepare when it was actually applied during prepare.
Changes:
- Stop injecting implicit
Sharing/TimeSlicingdefaults inDefaultGpuConfig,DefaultMigDeviceConfig, andNormalize()whenSharingis unset. - Track whether time-slicing was applied during prepare via a persisted
TimeSliceAppliedflag. - Reset time-slicing on unprepare only when
TimeSliceAppliedis true.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
cmd/gpu-kubelet-plugin/device_state.go |
Persists whether time-slicing was applied and uses that to conditionally reset time-slice policy during unprepare. |
api/nvidia.com/resource/v1beta1/migconfig.go |
Removes implicit default sharing injection; leaves sharing unset unless explicitly configured. |
api/nvidia.com/resource/v1beta1/gpuconfig.go |
Removes implicit default sharing/time-slicing injection; keeps defaulting of TimeSlicingConfig.Interval only when Sharing is explicitly set to time-slicing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // DefaultGpuConfig provides the default GPU configuration. | ||
| func DefaultGpuConfig() *GpuConfig { | ||
| config := &GpuConfig{ | ||
| return &GpuConfig{ | ||
| TypeMeta: metav1.TypeMeta{ | ||
| APIVersion: GroupName + "/" + Version, | ||
| Kind: GpuConfigKind, | ||
| }, | ||
| } | ||
|
|
||
| if featuregates.Enabled(featuregates.TimeSlicingSettings) { | ||
| config.Sharing = &GpuSharing{ | ||
| Strategy: TimeSlicingStrategy, | ||
| TimeSlicingConfig: &TimeSlicingConfig{ | ||
| Interval: ptr.To(DefaultTimeSlice), | ||
| }, | ||
| } | ||
| } | ||
|
|
||
| return config | ||
| } |
|
Test config used A30 Tesla P4 |
Signed-off-by: Swati Gupta <swatig@nvidia.com> mig cleanup Signed-off-by: Swati Gupta <swatig@nvidia.com>
efd159b to
22e7ff7
Compare
| if err != nil { | ||
| return nil, fmt.Errorf("error setting timeslice config for requests '%v' in claim '%v': %w", requests, claim.UID, err) | ||
| } | ||
| configState.TimeSliceApplied = true |
There was a problem hiding this comment.
I guess with these newly added fields in the checkpoint, downgrade is not possible at all due to checksum match. That seems unavaoidable.
When the timeslicing strategy is not explicitly set, can you also verify with MPS config. |
| if featuregates.Enabled(featuregates.TimeSlicingSettings) { | ||
| tsc := configapi.DefaultGpuConfig().Sharing.TimeSlicingConfig | ||
| // Reset time-slice policy only when it was applied during prepare. | ||
| if featuregates.Enabled(featuregates.TimeSlicingSettings) && group.ConfigState.TimeSliceApplied { |
There was a problem hiding this comment.
@guptaNswati for existing claims prior to upgrade, this will skip resetting them to default setting as the existing checkpoint file will not have this entry. Instead of this method, wondering if its better to handle the error from SetTimeSlice() and log the not-supported error and continue.
What type of PR is this?
/kind bug
Which issue(s) this PR is related to:
fixes #81
#363
Special notes for your reviewer:
Does this PR introduce a user-facing change?
NONE