What is the bug?
when drainDataNodes is enabled, nodes can remain permanently in the openSearch cluster setting cluster.routing.allocation.exclude._name during rolling restart or upgrade. This blocks further restarts or leaves the cluster in a bad allocation state.
How can one reproduce the bug?
This cannot be reproduced in a reliable way; it happens irregularly when the cluster API is briefly unreachable right after a pod is deleted (e.g. endpoint flakiness, short network blip). The transient failure (e.g. conn refused) triggers the stuck exclude list.
- rolling restart appears stuck: only the first pod (e.g. master-0) is restarted. The next candidate (e.g. master-1) is never restarted; the operator keeps waiting for node draining to finish and never proceeds.
- cluster allocation unhealthy: checking
GET _cluster/settings shows multiple node names in cluster.routing.allocation.exclude._name (e.g. both master-0 and master-1) even though one of them has already been restarted and is back in the cluster.
- operator logs: transient error when trying to remove the exclusion, for example:
Could not remove allocation exclusion for node <name>
dial tcp ...:9200: connect: connection refused
After that error, the operator does not retry removing that node from the exclude list.
What is the expected behavior?
- if removing a node from the allocation exclude list fails transiently (e.g. connection refused), the operator should retry until it succeeds, or otherwise ensure that node is eventually removed from
cluster.routing.allocation.exclude._name.
- no node should remain in the exclude list after its pod has been restarted or upgraded. Rolling restart and upgrade should complete without leaving nodes permanently excluded.
What is your host/environment?
All versions
Do you have any screenshots?
If applicable, add screenshots to help explain your problem.
Do you have any additional context?
Add any other context about the problem.
What is the bug?
when
drainDataNodesis enabled, nodes can remain permanently in the openSearch cluster settingcluster.routing.allocation.exclude._nameduring rolling restart or upgrade. This blocks further restarts or leaves the cluster in a bad allocation state.How can one reproduce the bug?
This cannot be reproduced in a reliable way; it happens irregularly when the cluster API is briefly unreachable right after a pod is deleted (e.g. endpoint flakiness, short network blip). The transient failure (e.g. conn refused) triggers the stuck exclude list.
GET _cluster/settingsshows multiple node names incluster.routing.allocation.exclude._name(e.g. both master-0 and master-1) even though one of them has already been restarted and is back in the cluster.Could not remove allocation exclusion for node <name>dial tcp ...:9200: connect: connection refusedAfter that error, the operator does not retry removing that node from the exclude list.
What is the expected behavior?
cluster.routing.allocation.exclude._name.What is your host/environment?
All versions
Do you have any screenshots?
If applicable, add screenshots to help explain your problem.
Do you have any additional context?
Add any other context about the problem.