[BUG] nodes stuck in allocation exclude list during rolling restart

### What is the bug?
when `drainDataNodes` is enabled, nodes can remain permanently in the openSearch cluster setting `cluster.routing.allocation.exclude._name` during rolling restart or upgrade. This blocks further restarts or leaves the cluster in a bad allocation state.

### How can one reproduce the bug?
This cannot be reproduced in a reliable way; it happens irregularly when the cluster API is briefly unreachable right after a pod is deleted (e.g. endpoint flakiness, short network blip). The transient failure (e.g. conn refused) triggers the stuck exclude list.

- rolling restart appears stuck: only the first pod (e.g. master-0) is restarted. The next candidate (e.g. master-1) is never restarted; the operator keeps waiting for node draining to finish and never proceeds.
- cluster allocation unhealthy: checking `GET _cluster/settings` shows multiple node names in `cluster.routing.allocation.exclude._name` (e.g. both master-0 and master-1) even though one of them has already been restarted and is back in the cluster.
- operator logs: transient error when trying to remove the exclusion, for example:
  - `Could not remove allocation exclusion for node <name>`
  - `dial tcp ...:9200: connect: connection refused`
  After that error, the operator does not retry removing that node from the exclude list.

### What is the expected behavior?
- if removing a node from the allocation exclude list fails transiently (e.g. connection refused), the operator should retry until it succeeds, or otherwise ensure that node is eventually removed from `cluster.routing.allocation.exclude._name`.
- no node should remain in the exclude list after its pod has been restarted or upgraded. Rolling restart and upgrade should complete without leaving nodes permanently excluded.

### What is your host/environment?
All versions

### Do you have any screenshots?
_If applicable, add screenshots to help explain your problem._

### Do you have any additional context?
_Add any other context about the problem._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] nodes stuck in allocation exclude list during rolling restart #1355

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] nodes stuck in allocation exclude list during rolling restart #1355

Description

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions