scale-in --force leaves removed TiKV store as Up/Serving in PD

1. What did you do?

Run `tiup cluster scale-in --force` to scale in a TiKV instance.

```bash
tiup cluster scale-in <cluster-name> -N <tikv-host>:<tikv-port> --force
```

The TiKV process/deployment was removed by TiUP, and `tiup cluster display` no longer showed this TiKV instance.

After that, I checked PD stores:

```bash
pd-ctl store
```

2. What did you expect to see?

After TiUP reports the force scale-in as completed and removes the TiKV instance from its topology, PD should not keep reporting that same store as an active serving store.

If PD rejects deleting the store/member, TiUP should not continue to stop/destroy the instance and update topology. At minimum, TiUP and PD should not end up in this inconsistent state:

- TiUP topology: the TiKV instance has been removed.
- PD metadata: the same TiKV store is still `Up` / `Serving`.

3. What did you see instead?

PD still returned the scaled-in TiKV store, and the store state was still active:

```json
{
  "store": {
    "id": 4,
    "address": "<tikv-host>:<tikv-port>",
    "state_name": "Up"
  }
}
```

In TiFlash client-c, `GetAllStores(exclude_tombstone_stores=true)` also still returned this scaled-in TiKV node, because from PD's view the store was not tombstone/removed.

From PD logs/code path, the store was never marked offline. The normal PD delete-store request can be rejected before changing store state if the remaining TiKV count would be less than `max-replicas`.

Relevant PD code:

```go
// server/cluster/cluster.go
func (c *RaftCluster) RemoveStore(storeID uint64, physicallyDestroyed bool) error {
    ...
    if (store.IsPreparing() || store.IsServing()) && !physicallyDestroyed {
        if err := c.checkTikvReplicaBeforeOfflineStore(storeID); err != nil {
            return err
        }
    }

    // only reached if the above check passes
    c.setStore(store.Clone(core.SetStoreState(metapb.StoreState_Offline, physicallyDestroyed)), ...)
}
```

And the replica check:

```go
func (c *RaftCluster) checkTikvReplicaBeforeOfflineStore(storeID uint64) error {
    upStores := c.getUpTikvStores()
    expectUpStoresNum := len(upStores) - 1
    if expectUpStoresNum < c.opt.GetMaxReplicas() {
        return errs.ErrStoresNotEnough.FastGenByArgs(...)
    }
}
```

So the observed behavior is:

- TiUP `scale-in --force` sends a delete-store/member request to PD.
- PD rejects the delete-store operation due to the replica-count check.
- TiUP only logs the delete error as a warning, then continues to stop/destroy the TiKV instance and update topology.
- The store remains `Up` / `Serving` in PD even though the actual TiKV instance has already been removed by TiUP.

This makes clients that rely on PD store metadata receive stale node information.

4. What version of TiUP are you using (`tiup --version`)?

```bash
<please fill tiup --version output here>
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scale-in --force leaves removed TiKV store as Up/Serving in PD #2705

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

scale-in --force leaves removed TiKV store as Up/Serving in PD #2705

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions