- What did you do?
Run tiup cluster scale-in --force to scale in a TiKV instance.
tiup cluster scale-in <cluster-name> -N <tikv-host>:<tikv-port> --force
The TiKV process/deployment was removed by TiUP, and tiup cluster display no longer showed this TiKV instance.
After that, I checked PD stores:
- What did you expect to see?
After TiUP reports the force scale-in as completed and removes the TiKV instance from its topology, PD should not keep reporting that same store as an active serving store.
If PD rejects deleting the store/member, TiUP should not continue to stop/destroy the instance and update topology. At minimum, TiUP and PD should not end up in this inconsistent state:
- TiUP topology: the TiKV instance has been removed.
- PD metadata: the same TiKV store is still
Up / Serving.
- What did you see instead?
PD still returned the scaled-in TiKV store, and the store state was still active:
{
"store": {
"id": 4,
"address": "<tikv-host>:<tikv-port>",
"state_name": "Up"
}
}
In TiFlash client-c, GetAllStores(exclude_tombstone_stores=true) also still returned this scaled-in TiKV node, because from PD's view the store was not tombstone/removed.
From PD logs/code path, the store was never marked offline. The normal PD delete-store request can be rejected before changing store state if the remaining TiKV count would be less than max-replicas.
Relevant PD code:
// server/cluster/cluster.go
func (c *RaftCluster) RemoveStore(storeID uint64, physicallyDestroyed bool) error {
...
if (store.IsPreparing() || store.IsServing()) && !physicallyDestroyed {
if err := c.checkTikvReplicaBeforeOfflineStore(storeID); err != nil {
return err
}
}
// only reached if the above check passes
c.setStore(store.Clone(core.SetStoreState(metapb.StoreState_Offline, physicallyDestroyed)), ...)
}
And the replica check:
func (c *RaftCluster) checkTikvReplicaBeforeOfflineStore(storeID uint64) error {
upStores := c.getUpTikvStores()
expectUpStoresNum := len(upStores) - 1
if expectUpStoresNum < c.opt.GetMaxReplicas() {
return errs.ErrStoresNotEnough.FastGenByArgs(...)
}
}
So the observed behavior is:
- TiUP
scale-in --force sends a delete-store/member request to PD.
- PD rejects the delete-store operation due to the replica-count check.
- TiUP only logs the delete error as a warning, then continues to stop/destroy the TiKV instance and update topology.
- The store remains
Up / Serving in PD even though the actual TiKV instance has already been removed by TiUP.
This makes clients that rely on PD store metadata receive stale node information.
- What version of TiUP are you using (
tiup --version)?
<please fill tiup --version output here>
Run
tiup cluster scale-in --forceto scale in a TiKV instance.The TiKV process/deployment was removed by TiUP, and
tiup cluster displayno longer showed this TiKV instance.After that, I checked PD stores:
After TiUP reports the force scale-in as completed and removes the TiKV instance from its topology, PD should not keep reporting that same store as an active serving store.
If PD rejects deleting the store/member, TiUP should not continue to stop/destroy the instance and update topology. At minimum, TiUP and PD should not end up in this inconsistent state:
Up/Serving.PD still returned the scaled-in TiKV store, and the store state was still active:
{ "store": { "id": 4, "address": "<tikv-host>:<tikv-port>", "state_name": "Up" } }In TiFlash client-c,
GetAllStores(exclude_tombstone_stores=true)also still returned this scaled-in TiKV node, because from PD's view the store was not tombstone/removed.From PD logs/code path, the store was never marked offline. The normal PD delete-store request can be rejected before changing store state if the remaining TiKV count would be less than
max-replicas.Relevant PD code:
And the replica check:
So the observed behavior is:
scale-in --forcesends a delete-store/member request to PD.Up/Servingin PD even though the actual TiKV instance has already been removed by TiUP.This makes clients that rely on PD store metadata receive stale node information.
tiup --version)?