Skip to content

scale-in --force leaves removed TiKV store as Up/Serving in PD #2705

@lhy1024

Description

@lhy1024
  1. What did you do?

Run tiup cluster scale-in --force to scale in a TiKV instance.

tiup cluster scale-in <cluster-name> -N <tikv-host>:<tikv-port> --force

The TiKV process/deployment was removed by TiUP, and tiup cluster display no longer showed this TiKV instance.

After that, I checked PD stores:

pd-ctl store
  1. What did you expect to see?

After TiUP reports the force scale-in as completed and removes the TiKV instance from its topology, PD should not keep reporting that same store as an active serving store.

If PD rejects deleting the store/member, TiUP should not continue to stop/destroy the instance and update topology. At minimum, TiUP and PD should not end up in this inconsistent state:

  • TiUP topology: the TiKV instance has been removed.
  • PD metadata: the same TiKV store is still Up / Serving.
  1. What did you see instead?

PD still returned the scaled-in TiKV store, and the store state was still active:

{
  "store": {
    "id": 4,
    "address": "<tikv-host>:<tikv-port>",
    "state_name": "Up"
  }
}

In TiFlash client-c, GetAllStores(exclude_tombstone_stores=true) also still returned this scaled-in TiKV node, because from PD's view the store was not tombstone/removed.

From PD logs/code path, the store was never marked offline. The normal PD delete-store request can be rejected before changing store state if the remaining TiKV count would be less than max-replicas.

Relevant PD code:

// server/cluster/cluster.go
func (c *RaftCluster) RemoveStore(storeID uint64, physicallyDestroyed bool) error {
    ...
    if (store.IsPreparing() || store.IsServing()) && !physicallyDestroyed {
        if err := c.checkTikvReplicaBeforeOfflineStore(storeID); err != nil {
            return err
        }
    }

    // only reached if the above check passes
    c.setStore(store.Clone(core.SetStoreState(metapb.StoreState_Offline, physicallyDestroyed)), ...)
}

And the replica check:

func (c *RaftCluster) checkTikvReplicaBeforeOfflineStore(storeID uint64) error {
    upStores := c.getUpTikvStores()
    expectUpStoresNum := len(upStores) - 1
    if expectUpStoresNum < c.opt.GetMaxReplicas() {
        return errs.ErrStoresNotEnough.FastGenByArgs(...)
    }
}

So the observed behavior is:

  • TiUP scale-in --force sends a delete-store/member request to PD.
  • PD rejects the delete-store operation due to the replica-count check.
  • TiUP only logs the delete error as a warning, then continues to stop/destroy the TiKV instance and update topology.
  • The store remains Up / Serving in PD even though the actual TiKV instance has already been removed by TiUP.

This makes clients that rely on PD store metadata receive stale node information.

  1. What version of TiUP are you using (tiup --version)?
<please fill tiup --version output here>

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugCategorizes issue as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions