Skip to content

Add endpoint network disruption test for DFBUGS-5838#15116

Open
sagihirshfeld wants to merge 3 commits into
red-hat-storage:masterfrom
sagihirshfeld:test-endpoint-network-disruption
Open

Add endpoint network disruption test for DFBUGS-5838#15116
sagihirshfeld wants to merge 3 commits into
red-hat-storage:masterfrom
sagihirshfeld:test-endpoint-network-disruption

Conversation

@sagihirshfeld
Copy link
Copy Markdown
Contributor

Summary

  • Add test verifying noobaa-endpoint pods survive when cloud storage connections are severed mid-stream during upload and download
  • Uses a NetworkPolicy to block all external egress from endpoint pods while preserving internal cluster traffic
  • Covers AWS, Azure, GCP, and IBM COS across namespacestore and backingstore configurations (14 parametrized variants)

Details

Automates verification of DFBUGS-5838: noobaa-endpoint pods were crashing with Exit Code 1 due to an unhandled AbortError when TCP connections to cloud storage were severed.

The test:

  1. Creates a bucket backed by cloud storage
  2. Starts a large (2 GB) upload or download in a background thread
  3. Applies a NetworkPolicy that blocks external egress from noobaa-endpoint pods only (internal cluster traffic is preserved via namespaceSelector: {})
  4. Verifies the endpoint pods remain Running, have not restarted, and have no PANIC/uncaughtException in logs

New files

  • tests/functional/object/mcg/test_endpoint_network_disruption.py
  • ocs_ci/templates/mcg/block_egress_network_policy.yaml

Modified files

  • ocs_ci/ocs/constants.py — added TEMPLATE_BLOCK_NB_EGRESS_NETWORK_POLICY

Automate verification that noobaa-endpoint pods survive when TCP
connections to cloud storage are severed mid-stream (upload and
download), covering AWS, Azure, GCP, and IBM COS with both
namespacestore and backingstore configurations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sagi Hirshfeld <shirshfe@redhat.com>
@sagihirshfeld sagihirshfeld self-assigned this May 11, 2026
@sagihirshfeld sagihirshfeld added Needs Testing Run tests and provide logs link MCG Multi Cloud Gateway / NooBaa related issues Squad/Red labels May 11, 2026
@pull-request-size pull-request-size Bot added the size/L PR that changes 100-499 lines label May 11, 2026
Copy link
Copy Markdown

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: shirshfe-22ibm09
Cluster Configuration: conf/deployment/ibmcloud/ipi_3az_rhcos_3m_3w.yaml
PR Test Suite:
PR Test Path: tests/functional/object/mcg/test_endpoint_network_disruption.py
Additional Test Params: --skip-rpm-go-version-collection
OCP VERSION: 4.22
OCS VERSION: 4.22
tested against branch: master

Job UNSTABLE (some or all tests failed).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sagi Hirshfeld <shirshfe@redhat.com>
@sagihirshfeld
Copy link
Copy Markdown
Contributor Author

In the above validation only one test failed due to what looks like a momentary system issue:

ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-storage get Pod -n openshift-storage --selector=noobaa-s3=noobaa -o yaml.
Error is Unable to connect to the server: ... i/o timeout

Rerunning with only the affected test: test_endpoint_survives_cloud_connection_severed[download-backingstore-aws]

Copy link
Copy Markdown

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: shirshfe-22ibm09
Cluster Configuration: conf/deployment/ibmcloud/ipi_3az_rhcos_3m_3w.yaml
PR Test Suite:
PR Test Path: tests/functional/object/mcg/test_endpoint_network_disruption.py::TestEndpointCloudNetworkDisruption:: test_endpoint_survives_cloud_connection_severed[download-backingstore-aws]
Additional Test Params: --skip-rpm-go-version-collection
OCP VERSION: 4.22
OCS VERSION: 4.22
tested against branch: master

Job FAILED (installation failed, tests not executed).

Copy link
Copy Markdown

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: shirshfe-22ibm09
Cluster Configuration: conf/deployment/ibmcloud/ipi_3az_rhcos_3m_3w.yaml
PR Test Suite:
PR Test Path: tests/functional/object/mcg/test_endpoint_network_disruption.py::TestEndpointCloudNetworkDisruption::test_endpoint_survives_cloud_connection_severed[download-backingstore-aws]
Additional Test Params: --skip-rpm-go-version-collection
OCP VERSION: 4.22
OCS VERSION: 4.22
tested against branch: master

Job PASSED.

@sagihirshfeld sagihirshfeld added Verified Mark when PR was verified and log provided and removed Needs Testing Run tests and provide logs link labels May 11, 2026
Remove unused platform parameter from parametrize, fix temp file
leak, consolidate verification loops, and drop hardcoded namespace
from NetworkPolicy template.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Sagi Hirshfeld <shirshfe@redhat.com>
Copy link
Copy Markdown

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: shirshfe-22ibm09
Cluster Configuration: conf/deployment/ibmcloud/ipi_3az_rhcos_3m_3w.yaml
PR Test Suite:
PR Test Path: tests/functional/object/mcg/test_endpoint_network_disruption.py::TestEndpointCloudNetworkDisruption::test_endpoint_survives_cloud_connection_severed[download-backingstore-aws]
Additional Test Params: --skip-rpm-go-version-collection
OCP VERSION: 4.22
OCS VERSION: 4.22
tested against branch: master

Job PASSED.

@sagihirshfeld sagihirshfeld marked this pull request as ready for review May 12, 2026 09:09
@sagihirshfeld sagihirshfeld requested review from a team as code owners May 12, 2026 09:09
Copy link
Copy Markdown
Contributor

@ypersky1980 ypersky1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sagihirshfeld, ypersky1980

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 14, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm MCG Multi Cloud Gateway / NooBaa related issues needs-rebase size/L PR that changes 100-499 lines Squad/Red Verified Mark when PR was verified and log provided

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants