OCPBUGS-85058: monitortests: allow etcd CO blips during TNF jobs on two-node upgrades#31138
OCPBUGS-85058: monitortests: allow etcd CO blips during TNF jobs on two-node upgrades#31138eggfoobar wants to merge 2 commits intoopenshift:mainfrom
Conversation
Teach legacy CVO monitor tests to treat cluster-etcd-operator condition reasons shaped as tnf-*_JobRunning as expected on DualReplica and HighlyAvailableArbiter topologies: Available=False during upgrade, and Progressing=True while machine-config is progressing. Add isTNFJobClusterOperatorReason helper and unit tests so we only match CEO job-running surfaces, not unrelated etcd failures. Co-authored-by: Cursor Composer <noreply@cursor.com> Signed-off-by: ehila <ehila@redhat.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@eggfoobar: This pull request references Jira Issue OCPBUGS-85058, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
WalkthroughThe PR changes the upgrade test flow to call a topology-aware progressing-state transitions function (passing admin REST config and a patch-level flag), adds TNF JobRunning reason detection and two-node topology exceptions for operator conditions, and adds unit tests for the TNF-reason detector. ChangesUpgrade Flow & Two-Node TNF Handling
🎯 4 (Complex) | ⏱️ ~45 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (10 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-upgrade |
|
@eggfoobar: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f7d036b0-4982-11f1-8f62-95a62a82cf83-0 |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: eggfoobar The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@eggfoobar: This pull request references Jira Issue OCPBUGS-85058, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go (1)
95-99:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t abort the whole monitor evaluation when upgrade-level lookup fails.
This new branch turns a transient
ClusterVersionread failure intoEvaluateTestsFromConstructedIntervalsreturning an error, which drops the JUnit cases already built above. Since this value only decides whether the progressing-state suite should be skipped for patch upgrades, please fall back to a conservative skip instead of failing the entire legacy monitor run.Suggested change
- level, err := getUpgradeLevel(w.adminRESTConfig) - if err != nil || level == unknownUpgradeLevel { - return nil, fmt.Errorf("failed to determine upgrade level: %w", err) - } - junits = append(junits, testUpgradeOperatorProgressingStateTransitions(finalIntervals, level == patchUpgradeLevel, w.adminRESTConfig)...) + isPatchLevelUpgrade := true // default to skipping this suite if the level cannot be determined + if level, err := getUpgradeLevel(w.adminRESTConfig); err == nil && level != unknownUpgradeLevel { + isPatchLevelUpgrade = level == patchUpgradeLevel + } + junits = append(junits, testUpgradeOperatorProgressingStateTransitions(finalIntervals, isPatchLevelUpgrade, w.adminRESTConfig)...)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go` around lines 95 - 99, The code currently returns an error when getUpgradeLevel fails which aborts EvaluateTestsFromConstructedIntervals and drops previously built JUnit cases; instead, swallow the transient failure and conservatively skip the progressing-state suite: call getUpgradeLevel(w.adminRESTConfig), and if it returns an error or unknownUpgradeLevel, do not return—set isPatch := false (or treat as non-patch); otherwise set isPatch := (level == patchUpgradeLevel); then call testUpgradeOperatorProgressingStateTransitions(finalIntervals, isPatch, w.adminRESTConfig) and append its results to junits; ensure no early return on getUpgradeLevel failure so previously accumulated junits are preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`:
- Around line 95-99: The code currently returns an error when getUpgradeLevel
fails which aborts EvaluateTestsFromConstructedIntervals and drops previously
built JUnit cases; instead, swallow the transient failure and conservatively
skip the progressing-state suite: call getUpgradeLevel(w.adminRESTConfig), and
if it returns an error or unknownUpgradeLevel, do not return—set isPatch :=
false (or treat as non-patch); otherwise set isPatch := (level ==
patchUpgradeLevel); then call
testUpgradeOperatorProgressingStateTransitions(finalIntervals, isPatch,
w.adminRESTConfig) and append its results to junits; ensure no early return on
getUpgradeLevel failure so previously accumulated junits are preserved.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: c4c2d66f-fa68-4d8c-ad75-24b48e3774e1
📒 Files selected for processing (3)
pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.gopkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.gopkg/monitortests/clusterversionoperator/legacycvomonitortests/operators_test.go
|
Scheduling required tests: |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-upgrade |
|
@eggfoobar: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9a2372a0-49a2-11f1-8086-2c655c08170e-0 |
…low-ups Allow etcd Progressing while MCO runs when CEO reports EtcdMembers_MembersNotStarted (member still joining during fencing/replacement), in addition to existing tnf-*_JobRunning. On dual-replica, tolerate openshift-apiserver Available blips for APIServices_PreconditionNotReady during upgrade, and MCO-time Progressing for OperatorConfig_NewGeneration (openshift-apiserver) and APIServerDeployment_NewGeneration (authentication) as operators roll the control plane alongside machine-config. Co-authored-by: Cursor Composer <noreply@cursor.com> Signed-off-by: ehila <ehila@redhat.com>
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-upgrade |
1 similar comment
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-upgrade |
|
Scheduling required tests: |
|
@eggfoobar: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ccb215f0-49d9-11f1-8175-d81295c02a18-0 |
|
@eggfoobar: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/84b75970-49db-11f1-8000-23a6f1e43003-0 |
Teach legacy CVO monitor tests to treat cluster-etcd-operator condition reasons shaped as tnf-*_JobRunning as expected on DualReplica and HighlyAvailableArbiter topologies: Available=False during upgrade, and Progressing=True while machine-config is progressing.
Add isTNFJobClusterOperatorReason helper and unit tests so we only match CEO job-running surfaces, not unrelated etcd failures.
Summary by CodeRabbit
Bug Fixes
Tests