Skip to content

Add topology-aware provisioning tests for external mode (RHSTOR-5525)#15151

Open
AviadP wants to merge 3 commits into
red-hat-storage:masterfrom
AviadP:worktree-RHSTOR-5525
Open

Add topology-aware provisioning tests for external mode (RHSTOR-5525)#15151
AviadP wants to merge 3 commits into
red-hat-storage:masterfrom
AviadP:worktree-RHSTOR-5525

Conversation

@AviadP
Copy link
Copy Markdown
Contributor

@AviadP AviadP commented May 14, 2026

  • Add create_topology_pools() to ExternalCluster for replica-3 pool creation
  • Extract build_exporter_base_params() from get_external_cluster_details()
  • Remove IBM Cloud platform gate on jump_host SSH configuration
  • Fix upload_exporter_script() to use /tmp/ for remote path
  • Add DEFAULT_EXTERNAL_MODE_STORAGECLASS_NON_RESILIENT_RBD constant
  • Add test_topology_awareness.py with 5 tests:
    • SC auto-creation with WaitForFirstConsumer
    • PVC pending without pod
    • Single pod topology placement verification
    • StatefulSet spread across topology pools
    • PVC deletion cleans RBD image

Summary by CodeRabbit

  • New Features

    • Topology-aware provisioning for external mode: automatic pool creation and placement across failure domains.
    • New non-resilient RBD storage class option for external mode.
    • Enhanced exporter handling and deployment configuration for external clusters.
  • Tests

    • Added end-to-end tests validating topology-aware provisioning, PVC/pod placement, StatefulSet spread, and cleanup.
  • Chores

    • Updated secrets baseline entries.

Review Change Stack

@AviadP AviadP requested a review from a team as a code owner May 14, 2026 10:29
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 14, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AviadP

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- Add create_topology_pools() to ExternalCluster for replica-3 pool creation
- Extract build_exporter_base_params() from get_external_cluster_details()
- Remove IBM Cloud platform gate on jump_host SSH configuration
- Fix upload_exporter_script() to use /tmp/ for remote path
- Add DEFAULT_EXTERNAL_MODE_STORAGECLASS_NON_RESILIENT_RBD constant
- Add test_topology_awareness.py with 5 tests:
  - SC auto-creation with WaitForFirstConsumer
  - PVC pending without pod
  - Single pod topology placement verification
  - StatefulSet spread across topology pools
  - PVC deletion cleans RBD image

Signed-off-by: Aviadp <apolak@redhat.com>
@AviadP AviadP force-pushed the worktree-RHSTOR-5525 branch from f23a568 to ada7824 Compare May 14, 2026 10:58
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

📝 Walkthrough

Walkthrough

This PR implements topology-aware provisioning for OCS external mode: refactors exporter parameter building, adds topology pool creation, introduces a non‑resilient RBD storageclass constant, updates a secrets baseline mapping, and adds functional tests validating topology-constrained StorageClass behavior and pod-to-pool placement.

Changes

Topology-Aware Provisioning for OCS External Mode

Layer / File(s) Summary
Storage class constant and baseline updates
ocs_ci/ocs/constants.py, .secrets.baseline
Adds DEFAULT_EXTERNAL_MODE_STORAGECLASS_NON_RESILIENT_RBD constant and updates baseline line numbers.
ExternalCluster init and SSH handling
ocs_ci/deployment/helpers/external_cluster_helpers.py
Unifies ssh_jump_host reading and fills missing private_key from deployment config in ExternalCluster.__init__.
Exporter parameter builder
ocs_ci/deployment/helpers/external_cluster_helpers.py
Centralizes exporter parameter construction in build_exporter_base_params(), including RGW and CephFS flags.
Namespace handling and details refactor
ocs_ci/deployment/helpers/external_cluster_helpers.py
Extends get_external_cluster_details() to create/use RBD namespace when enabled and append dependent exporter flags.
Upload exporter script & topology pool creation
ocs_ci/deployment/helpers/external_cluster_helpers.py
upload_exporter_script() uploads to remote /tmp/<basename> and returns remote path; adds create_topology_pools() to idempotently create/configure replicated RBD pools.
Test module docs, imports, logger
tests/functional/external_mode/test_topology_awareness.py
Adds module docstring, imports, and module-level logger for the tests.
Topology helpers and pool mapping
tests/functional/external_mode/test_topology_awareness.py
Adds helpers to collect worker hostnames, build topology config, pool-to-failure-domain mapping, and read Ceph pool from bound PVCs.
Secret save/patch and operator restart orchestration
tests/functional/external_mode/test_topology_awareness.py
Adds functions to save/patch/restore rook-ceph-external-cluster-details, restart operator pods, and wait for StorageCluster readiness.
RGW detection and exporter param composition
tests/functional/external_mode/test_topology_awareness.py
Detects RGW presence and composes exporter parameters with topology pools and failure-domain flags.
Test class and autouse fixture setup
tests/functional/external_mode/test_topology_awareness.py
Implements TestTopologyAwarenessExternal with a class-scoped autouse fixture that creates topology pools, runs the exporter, patches the external secret, restarts operators, and registers cleanup finalizers.
StorageClass auto-creation validation
tests/functional/external_mode/test_topology_awareness.py
Asserts operator auto-creates the non‑resilient topology StorageClass with expected topology fields and WaitForFirstConsumer binding.
PVC pending behavior test
tests/functional/external_mode/test_topology_awareness.py
Verifies a PVC using the topology StorageClass stays Pending when unconsumed.
Single-pod pool placement test
tests/functional/external_mode/test_topology_awareness.py
Schedules a pod, verifies PV’s Ceph pool matches node failure-domain mapping, and confirms backend RBD image existence.
StatefulSet topology spread test
tests/functional/external_mode/test_topology_awareness.py
Creates a 3-replica StatefulSet with topology spread constraints and verifies per-pod PV pool placement across three pools.
PVC deletion reclamation test
tests/functional/external_mode/test_topology_awareness.py
Writes data to a PVC, deletes it, and verifies the backend RBD image is removed.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

A rabbit hummed by the storage tree,
"Pools mapped to hosts, as neat as can be,
Three pods hop home to failure domains,
Images vanish when PVCs drain,
External mode sings — topology glee!" 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main change: adding topology-aware provisioning tests for external mode, with a clear reference to the corresponding ticket (RHSTOR-5525).
Docstring Coverage ✅ Passed Docstring coverage is 96.15% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ocs_ci/deployment/helpers/external_cluster_helpers.py`:
- Around line 123-127: The code assumes config.DEPLOYMENT["ssh_key_private"]
always exists and may raise KeyError; in the jump-host initialization (where
self.jump_host is set) change the assignment to first check
config.DEPLOYMENT.get("ssh_key_private") and only set
self.jump_host["private_key"] when that value is present (and expanduser it);
otherwise leave self.jump_host as-is so password or other auth fields can be
used. Locate the block that sets self.jump_host in external_cluster_helpers.py
and replace the direct dict access with a safe .get-based conditional.

In `@tests/functional/external_mode/test_topology_awareness.py`:
- Around line 365-374: The TimeoutSampler loop using func=sc_ocp.get to wait for
the StorageClass named sc_name can exit without confirming creation; after the
loop ends, check whether a successful sample was found and if not fail fast
(e.g., raise AssertionError or call pytest.fail) with a clear message indicating
the topology StorageClass sc_name was not reconciled within the timeout so
subsequent tests stop immediately; update the block that contains
TimeoutSampler, sc_ocp.get and sc_name to perform this explicit check and fail
when no sample was returned.
- Around line 78-93: The returned topology config may have mismatched lengths
between pool_names and failure_domain_values; before the return in the
config-generation block (where pool_names, fd_label, fd_values, pool_size,
pg_num are set in test_topology_awareness.py) add a validation that
len(pool_names) == len(fd_values) and raise a clear exception (or use an assert)
if they differ, including both lengths in the message so the test fails with a
meaningful error rather than proceeding with inconsistent mapping.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 6dcc0aeb-c3a5-4d9a-81a9-574efd80cd76

📥 Commits

Reviewing files that changed from the base of the PR and between 51b29d0 and ada7824.

📒 Files selected for processing (4)
  • .secrets.baseline
  • ocs_ci/deployment/helpers/external_cluster_helpers.py
  • ocs_ci/ocs/constants.py
  • tests/functional/external_mode/test_topology_awareness.py

Comment thread ocs_ci/deployment/helpers/external_cluster_helpers.py Outdated
Comment thread tests/functional/external_mode/test_topology_awareness.py
Comment thread tests/functional/external_mode/test_topology_awareness.py
…t SC wait

- Use safe .get() for ssh_key_private fallback in jump host init
- Validate pool_names and failure_domain_values length match
- Fail fast with assert if topology SC not created within timeout

Signed-off-by: Aviadp <apolak@redhat.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ocs_ci/deployment/helpers/external_cluster_helpers.py`:
- Around line 243-253: The exporter arguments are built before the side-effect
that sets restricted auth and alias_rbd_name, so build_exporter_base_params()
can miss adding --cephfs-filesystem-name/related flags; move the conditional
that checks rbd_name for "." or "_" and sets
config.ENV_DATA["restricted-auth-permission"] and
config.ENV_DATA["alias_rbd_name"] to run before calling
self.build_exporter_base_params(), ensuring build_exporter_base_params() sees
the updated ENV_DATA and emits the correct flags (refer to rbd_name,
build_exporter_base_params, config.ENV_DATA["restricted-auth-permission"], and
config.ENV_DATA["alias_rbd_name"]).

In `@tests/functional/external_mode/test_topology_awareness.py`:
- Around line 70-85: The topology helper only advertises failure domains for
fd_values = topo_cfg.get("failure_domain_values", hostnames[:3]) but test
pods/StatefulSet creations later can land on unmapped nodes; fix by ensuring
workloads are scheduled only onto those advertised domains: either set fd_values
to include all candidate workers (e.g., use hostnames instead of hostnames[:3])
or, preferably, add a node selector/nodeAffinity to the pod/StatefulSet specs
used in the tests that matches the failure domain label (fd_label) with the
values in fd_values so pods are constrained to those nodes (update the code that
creates pods/StatefulSet to include the nodeAffinity/nodeSelector referencing
fd_label and fd_values).
- Around line 414-416: The test currently only checks for the presence of the
topologyFailureDomainLabel key in params; update the assertion to verify the
actual value matches the expected config by asserting
params["topologyFailureDomainLabel"] == self.topo_config["failure_domain_label"]
(use the existing params, sc_name and self.topo_config symbols) so the
StorageClass contract is fully validated and retain the log.info call afterward.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 43c95b40-d8e8-40b0-bd58-82b59171afa6

📥 Commits

Reviewing files that changed from the base of the PR and between ada7824 and 469dc1d.

📒 Files selected for processing (2)
  • ocs_ci/deployment/helpers/external_cluster_helpers.py
  • tests/functional/external_mode/test_topology_awareness.py

Comment thread ocs_ci/deployment/helpers/external_cluster_helpers.py
Comment thread tests/functional/external_mode/test_topology_awareness.py
Comment thread tests/functional/external_mode/test_topology_awareness.py
- Move restricted-auth side effects before build_exporter_base_params()
  so the builder sees alias_rbd_name and restricted-auth flags
- Assert exact topologyFailureDomainLabel value, not just presence

Signed-off-by: Aviadp <apolak@redhat.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ocs_ci/deployment/helpers/external_cluster_helpers.py (1)

204-209: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Duplicate --restricted-auth-permission and --cluster-name flags when rbd_name contains . or _.

When get_external_cluster_details() processes an rbd_name with . or _, it sets config.ENV_DATA["restricted-auth-permission"] = True at lines 248-252 before calling build_exporter_base_params(). This causes both blocks to execute:

  1. Lines 197-202: Adds --restricted-auth-permission true --cluster-name ...
  2. Lines 204-209: Adds another --restricted-auth-permission true --cluster-name ... --alias-rbd-data-pool-name ...

This results in duplicate command-line flags being passed to the exporter script.

Proposed fix: Only add the alias flag when other flags are already present
         if "." in rbd_name or "_" in rbd_name:
             alias_rbd_name = rbd_name.replace(".", "-").replace("_", "-")
-            params = (
-                f"{params} --restricted-auth-permission true --cluster-name {cluster_name} "
-                f"--alias-rbd-data-pool-name {alias_rbd_name}"
-            )
+            if "--restricted-auth-permission" not in params:
+                params = f"{params} --restricted-auth-permission true --cluster-name {cluster_name}"
+            params = f"{params} --alias-rbd-data-pool-name {alias_rbd_name}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ocs_ci/deployment/helpers/external_cluster_helpers.py` around lines 204 -
209, The code is appending duplicate flags when rbd_name contains '.' or '_' in
get_external_cluster_details(): after setting
config.ENV_DATA["restricted-auth-permission"]=True and calling
build_exporter_base_params(), the block that constructs alias_rbd_name appends
"--restricted-auth-permission true --cluster-name ..." again; modify the alias
branch in build_exporter_base_params()/get_external_cluster_details() (where
alias_rbd_name is created) to only append "--alias-rbd-data-pool-name
{alias_rbd_name}" if params already contains the
restricted-auth-permission/cluster-name flags (or omit re-adding those flags
entirely), e.g., detect existing flags in params or rely on the earlier addition
so you only add the alias flag instead of duplicating
"--restricted-auth-permission" and "--cluster-name".
🧹 Nitpick comments (1)
tests/functional/external_mode/test_topology_awareness.py (1)

113-113: 💤 Low value

Consider adding strict=True to zip() for defensive programming.

The Ruff B905 rule checks for zip calls without an explicit strict parameter when called with two or more iterables. By default, if the iterables passed to zip are of different lengths, the resulting iterator will be silently truncated.

While lengths are validated at lines 87-90, adding strict=True provides an extra safety net if this function is ever called from a different path.

Suggested improvement
-    return dict(zip(pool_names, fd_values))
+    return dict(zip(pool_names, fd_values, strict=True))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/functional/external_mode/test_topology_awareness.py` at line 113, The
return currently uses dict(zip(pool_names, fd_values)) which can silently
truncate if lengths differ; update the call to zip to use strict=True (i.e.,
dict(zip(pool_names, fd_values, strict=True))) so mismatched lengths raise
immediately; reference the variables pool_names and fd_values in the same
function where lengths are validated (lines around the existing checks) to
locate and change the zip invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@ocs_ci/deployment/helpers/external_cluster_helpers.py`:
- Around line 204-209: The code is appending duplicate flags when rbd_name
contains '.' or '_' in get_external_cluster_details(): after setting
config.ENV_DATA["restricted-auth-permission"]=True and calling
build_exporter_base_params(), the block that constructs alias_rbd_name appends
"--restricted-auth-permission true --cluster-name ..." again; modify the alias
branch in build_exporter_base_params()/get_external_cluster_details() (where
alias_rbd_name is created) to only append "--alias-rbd-data-pool-name
{alias_rbd_name}" if params already contains the
restricted-auth-permission/cluster-name flags (or omit re-adding those flags
entirely), e.g., detect existing flags in params or rely on the earlier addition
so you only add the alias flag instead of duplicating
"--restricted-auth-permission" and "--cluster-name".

---

Nitpick comments:
In `@tests/functional/external_mode/test_topology_awareness.py`:
- Line 113: The return currently uses dict(zip(pool_names, fd_values)) which can
silently truncate if lengths differ; update the call to zip to use strict=True
(i.e., dict(zip(pool_names, fd_values, strict=True))) so mismatched lengths
raise immediately; reference the variables pool_names and fd_values in the same
function where lengths are validated (lines around the existing checks) to
locate and change the zip invocation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ba744750-6ecd-48c3-90d0-b8e09794fac7

📥 Commits

Reviewing files that changed from the base of the PR and between 469dc1d and bb17e34.

📒 Files selected for processing (2)
  • ocs_ci/deployment/helpers/external_cluster_helpers.py
  • tests/functional/external_mode/test_topology_awareness.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant