OCPNODE-3983: Add e2e tests for KubeletEnsureSecretPulledImages feature gate#31102
OCPNODE-3983: Add e2e tests for KubeletEnsureSecretPulledImages feature gate#31102Chandan9112 wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Skipping CI for Draft Pull Request. |
WalkthroughAdds a new serial Ginkgo ordered disruptive long-running node test that validates kubelet image-pull credential verification and cached pull-record behavior across five scenarios: multi-tenant isolation, secret rotation, ImagePullPolicy modes, KubeletConfig verification policy transitions (with MCP rollout), and registry availability effects. Includes helpers and README entry. ChangesKubelet Secret Pulled Images Test Suite
sequenceDiagram
participant Test as Test Harness
participant APIServer as Kubernetes API
participant Kubelet as Node Kubelet
participant Registry as Internal Registry
participant MCP as MachineConfigPool
Test->>APIServer: create namespaces, imagestream tag, extract secret
Test->>APIServer: create pods referencing secrets / imagePullPolicy
APIServer->>Kubelet: schedule Pod spec
Kubelet->>Registry: attempt image pull (use secret or cached pull-record)
Registry-->>Kubelet: respond (success / unauthorized / unavailable)
alt KubeletConfig change
Test->>APIServer: apply KubeletConfig (NeverVerify / AlwaysVerify)
APIServer->>MCP: rollout config
MCP-->>Kubelet: kubelet restarts with new policy
end
Kubelet-->>APIServer: report Pod status (Running / ErrImagePull / ImagePullBackOff)
Test->>APIServer: scale registry (to zero) and validate cached vs new pulls
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/extended/node/kubelet_secret_pulled_images.go (1)
409-425: ⚡ Quick winConsider folding the MCP transition polling into the shared helper.
This local helper duplicates MCP condition polling that already exists in
test/extended/node/node_utils.go:521-570. Extending the shared wait path to support “must observeUpdating=Truebefore waiting for ready” would keep rollout semantics in one place and avoid the two implementations drifting on degraded/updated handling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/kubelet_secret_pulled_images.go` around lines 409 - 425, The function credVerifyWaitForMCPUpdating duplicates MCP polling; remove it and extend the existing shared helper waitForMCP (in node_utils.go) to accept an option/flag (e.g., requireObserveUpdating bool or a policy enum) that first polls for condition Type "Updating" == True and only then proceeds with the existing wait-for-ready semantics; update callers (including where credVerifyWaitForMCPUpdating was used) to call waitForMCP with the new option so all MCP transition logic remains centralized and avoids duplicated/ drifting implementations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/node/kubelet_secret_pulled_images.go`:
- Around line 27-30: Replace the Docker Hub mutable image references with
pinned, internally-hosted images: change the constant credVerifyPublicImage from
"docker.io/library/nginx:alpine" to a registry-controlled, immutable image
(preferably using a digest) served by the cluster image-registry (use
internalRegistryPrefix as the prefix and point to a specific repo@sha256:...
digest); update the seed import lines referenced around 71-74 likewise to pull
from the internal registry and pin by digest (do not use :latest or other
mutable tags) so all pulls come from the environment-controlled registry and are
immutable.
- Around line 251-257: Before scaling the image-registry down, query and store
its current replica count (e.g., via
oc.AsAdmin().WithoutNamespace().Run("get").Args("deployment/image-registry",
"-n", "openshift-image-registry", "-o", "jsonpath={.spec.replicas}") or
equivalent) and then use that stored value in the g.DeferCleanup closure instead
of the hard-coded "--replicas=2"; in the cleanup closure call
oc.AsAdmin().WithoutNamespace().Run("scale").Args("deployment/image-registry",
"-n", "openshift-image-registry", fmt.Sprintf("--replicas=%d",
originalReplicas)) and then wait for completion with
oc.AsAdmin().WithoutNamespace().Run("rollout").Args("status",
"deployment/image-registry", "-n", "openshift-image-registry",
"--timeout=2m").Execute() to ensure the registry is restored to its exact prior
size and the test waits for rollout.
---
Nitpick comments:
In `@test/extended/node/kubelet_secret_pulled_images.go`:
- Around line 409-425: The function credVerifyWaitForMCPUpdating duplicates MCP
polling; remove it and extend the existing shared helper waitForMCP (in
node_utils.go) to accept an option/flag (e.g., requireObserveUpdating bool or a
policy enum) that first polls for condition Type "Updating" == True and only
then proceeds with the existing wait-for-ready semantics; update callers
(including where credVerifyWaitForMCPUpdating was used) to call waitForMCP with
the new option so all MCP transition logic remains centralized and avoids
duplicated/ drifting implementations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 6c10e20a-d25e-473b-9257-be65b6405b54
📒 Files selected for processing (1)
test/extended/node/kubelet_secret_pulled_images.go
|
@Chandan9112: This pull request references OCPNODE-3983 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@Chandan9112: This pull request references OCPNODE-3983 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
179ba2d to
57ed3a2
Compare
57ed3a2 to
63b9239
Compare
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-disruptive-longrunning |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3cb375d0-44a6-11f1-837c-4e7f0d7ef13a-0 |
63b9239 to
7e09a28
Compare
|
@Chandan9112 Can you add the link to the job that actually runs this test case? |
The tests are skipped in this job because the cluster doesn't have the KubeletEnsureSecretPulledImages feature gate enabled (requires TechPreviewNoUpgrade or CustomNoUpgrade FeatureSet). |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-disruptive-longrunning-techpreview |
|
@Chandan9112: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/39bdf220-484c-11f1-96b2-66597a73dafb-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-disruptive-longrunning-techpreview |
|
@Chandan9112: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0a664830-4868-11f1-9015-6470e27966d2-0 |
|
@Chandan9112 this should go into main first and then get backported to release-4.22. Can you change the target branch? |
1e11195 to
d5a3719
Compare
…re gate Co-authored-by: Cursor <cursoragent@cursor.com>
d5a3719 to
8400dc7
Compare
Sure @cpmeadors ., I have changed the target branch to main. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
test/extended/node/kubelet_secret_pulled_images.go (1)
258-266:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCleanup rollout
--timeout=2mis likely too short for image-registry to come back up.Scaling
deployment/image-registryfrom 0 back tooriginalReplicastypically takes longer than 2m on real CI clusters (pod scheduling on workers, image pull, leader election, readiness). If the rollout doesn't finish before the deferred cleanup returns, the next test (or post-suite teardown) can race against an unhealthy registry. Bump the timeout (10m matches what was suggested previously) and treat the rollout result as best-effort but observable.♻️ Proposed change
g.DeferCleanup(func() { _ = oc.AsAdmin().WithoutNamespace().Run("scale").Args( "deployment/image-registry", "-n", "openshift-image-registry", fmt.Sprintf("--replicas=%d", originalReplicas), ).Execute() _ = oc.AsAdmin().WithoutNamespace().Run("rollout").Args( - "status", "deployment/image-registry", "-n", "openshift-image-registry", "--timeout=2m", + "status", "deployment/image-registry", "-n", "openshift-image-registry", "--timeout=10m", ).Execute() })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/kubelet_secret_pulled_images.go` around lines 258 - 266, The deferred cleanup waits only 2 minutes for the image-registry rollout which is too short; update the rollout invocation in the g.DeferCleanup block (the oc.AsAdmin().WithoutNamespace().Run("rollout").Args("status", "deployment/image-registry", "-n", "openshift-image-registry", "--timeout=2m").Execute() call) to use a longer timeout (e.g. "--timeout=10m") and treat the result as best-effort (don’t cause cleanup to fail if the rollout times out—capture/ignore the error or log it instead) so the registry has sufficient time to return to originalReplicas.
🧹 Nitpick comments (2)
test/extended/node/kubelet_secret_pulled_images.go (2)
281-310: 💤 Low valueMap iteration over
credentialprovider.DockerConfigis non-deterministic.
for _, auth := range cfgreturns the first entry encountered, but Go map iteration order is randomized. For an SA dockercfg there is normally one auth entry so this is fine in practice; if a future kubelet/OpenShift change adds a second registry endpoint to the secret, this helper would silently pick a non-deterministic credential. Consider keying byinternalRegistryPrefix(or a host suffix match) instead of taking the first map entry.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/kubelet_secret_pulled_images.go` around lines 281 - 310, The helper credVerifyExtractSAPullSecret iterates a credentialprovider.DockerConfig map (cfg) and picks the first auth entry, which is non-deterministic; change the logic to look up the auth by key that matches internalRegistryPrefix (or by suffix match on the host) instead of taking the first map entry, then pass that matched auth.Username/auth.Password into credVerifyBuildDockerConfigJSON and return that result; ensure the function still returns an error via the o.Eventually Succeed path if no matching entry is found.
386-396: 💤 Low valueConfirm
WaitForPodConditionwill satisfy on transientErrImagePull->ImagePullBackOff.
credVerifyExpectImagePullErrorpolls onlyContainerStatuses[*].State.Waiting.Reason. On a freshly-created pod that has not yet been scheduled / had containers created,ContainerStatusesmay be empty for a window, and on registry-down paths kubelet may transition throughErrImagePull→ImagePullBackOffquickly. The 3-minute timeout should be enough in practice, but you may also want to inspectInitContainerStatusesand treatCrashLoopBackOff-style stalls as failure to surface test issues quickly. Worth a sanity pass against CI logs to make sure this doesn't flake when the registry is fully down (Case 5).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/kubelet_secret_pulled_images.go` around lines 386 - 396, The current WaitForPodCondition predicate in credVerifyExpectImagePullError only inspects ContainerStatuses and can miss transient transitions or initial empty statuses; update the predicate used by e2epod.WaitForPodCondition to also iterate p.Status.InitContainerStatuses, treat Waiting reasons "ErrImagePull" and "ImagePullBackOff" as the match you want, and additionally treat "CrashLoopBackOff" (from either ContainerStatuses or InitContainerStatuses) as a failure/early-surface condition so the test doesn't hang waiting silently; ensure the predicate handles empty status arrays by continuing to poll rather than assuming success.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/node/kubelet_secret_pulled_images.go`:
- Around line 230-238: The pod `pod-alwaysverify-nosecret` is inheriting the
default SA's dockercfg so AlwaysVerify isn't actually being tested; fix by
creating a dedicated ServiceAccount with imagePullSecrets: [] (no dockercfg) and
use that SA for the test pod: create e.g. SA "no-pull-sa" in the test namespace
(ensure it has empty imagePullSecrets and optionally
automountServiceAccountToken: false), then modify the call that creates the pod
(credVerifyPod invocation for "pod-alwaysverify-nosecret") to set
serviceAccountName to "no-pull-sa" (either extend credVerifyPod to accept a
serviceAccountName parameter or construct the pod spec inline) and keep the
credVerifyExpectImagePullError call targeting that pod so AlwaysVerify rejection
is exercised properly.
---
Duplicate comments:
In `@test/extended/node/kubelet_secret_pulled_images.go`:
- Around line 258-266: The deferred cleanup waits only 2 minutes for the
image-registry rollout which is too short; update the rollout invocation in the
g.DeferCleanup block (the
oc.AsAdmin().WithoutNamespace().Run("rollout").Args("status",
"deployment/image-registry", "-n", "openshift-image-registry",
"--timeout=2m").Execute() call) to use a longer timeout (e.g. "--timeout=10m")
and treat the result as best-effort (don’t cause cleanup to fail if the rollout
times out—capture/ignore the error or log it instead) so the registry has
sufficient time to return to originalReplicas.
---
Nitpick comments:
In `@test/extended/node/kubelet_secret_pulled_images.go`:
- Around line 281-310: The helper credVerifyExtractSAPullSecret iterates a
credentialprovider.DockerConfig map (cfg) and picks the first auth entry, which
is non-deterministic; change the logic to look up the auth by key that matches
internalRegistryPrefix (or by suffix match on the host) instead of taking the
first map entry, then pass that matched auth.Username/auth.Password into
credVerifyBuildDockerConfigJSON and return that result; ensure the function
still returns an error via the o.Eventually Succeed path if no matching entry is
found.
- Around line 386-396: The current WaitForPodCondition predicate in
credVerifyExpectImagePullError only inspects ContainerStatuses and can miss
transient transitions or initial empty statuses; update the predicate used by
e2epod.WaitForPodCondition to also iterate p.Status.InitContainerStatuses, treat
Waiting reasons "ErrImagePull" and "ImagePullBackOff" as the match you want, and
additionally treat "CrashLoopBackOff" (from either ContainerStatuses or
InitContainerStatuses) as a failure/early-surface condition so the test doesn't
hang waiting silently; ensure the predicate handles empty status arrays by
continuing to poll rather than assuming success.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9045bc92-642a-449d-9c0a-52b945afadad
📒 Files selected for processing (2)
test/extended/node/README.mdtest/extended/node/kubelet_secret_pulled_images.go
✅ Files skipped from review due to trivial changes (1)
- test/extended/node/README.md
|
Scheduling required tests: |
|
@Chandan9112: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/verified by CI |
|
@Chandan9112: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview |
|
@Chandan9112: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a7d1a480-49f7-11f1-807e-fc6872531416-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview |
|
@Chandan9112: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b3940ff0-4a20-11f1-9614-0149f060bdbf-0 |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Chandan9112, cpmeadors, ngopalak-redhat The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
Summary
Test Cases
Prerequisites
KubeletEnsureSecretPulledImagesfeature gate to be enabledTechPreviewNoUpgradeorCustomNoUpgradeFeatureSet[OCPFeatureGate:KubeletEnsureSecretPulledImages]and will be skipped automatically on clusters without the feature gate enabledTesting Result
Tested on latest 4.22 OCP cluster
References
Summary by CodeRabbit