Skip to content

gameserver-controller: OpsState spec change not reliably synced to Pod label due to reconcile trigger gap #321

@hxzhouh

Description

@hxzhouh

Summary

We observed a production issue where GameServer.spec.opsState changes (patched by an external allocator) are not reliably synced to Pod labels via SyncGsToPod(), causing game servers to read stale OpsState values through Downward API.

Environment

  • OKG version: v1.0.0 (registry-cn-shenzhen.ack.aliyuncs.com/acs/kruise-game-manager:v1.0.0)
  • Kubernetes: Alibaba Cloud ACK (managed)
  • --sync-period: not configured (default = 10 hours)
  • --gameserver-workers: 100

Observed Behavior

After an external Director (OpenMatch) patches spec.opsState=Allocated + gs-sync/match-id annotation on a GameServer, the Pod label game.kruise.io/gs-opsState sometimes does not update for an extended period. The game server process reads the Downward API file (/etc/podinfo/gs-opsstate) and sees the old value (e.g. None or Kill), so it never starts the game session.

We deployed a monitoring script to continuously compare spec.opsState and pod.label[game.kruise.io/gs-opsState], and observed frequent SYNC GAPs across all OpsState transitions:

[ALERT] SYNC GAP: gs-exploding-kittens-13 | spec.opsState=Allocated | pod.label=Kill      | state=Ready
[ALERT] SYNC GAP: gs-exploding-kittens-16 | spec.opsState=None      | pod.label=Allocated | state=Ready
[ALERT] SYNC GAP: gs-exploding-kittens-3  | spec.opsState=Kill      | pod.label=Allocated | state=Deleting

Root Cause Analysis

1. Reconcile is primarily driven by Node Conditions events

From gameserver_controller.go, the controller registers three watch sources:

// GS Watch
c.Watch(source.Kind(mgr.GetCache(), &GameServer{},
    &handler.TypedEnqueueRequestForObject[*GameServer]{}))

// Pod Watch
watchPod(mgr, c)

// Node Watch — enqueues ALL pods on a node when Node Conditions change
watchNode(mgr, c)

While GS Watch is registered, we found from controller logs that Watch Node Conditions Changed is the dominant reconcile trigger in practice.

2. Node Conditions trigger gaps during peak hours

We analyzed 13 hours of OKG controller logs (2026-04-07 22:00 ~ 2026-04-08 11:00):

Metric Value
Node Conditions Changed triggers 55,579
OpsState successfully synced (OpsState turn from X to Y) 7,411
SyncGsToPod patch failures 0

SyncGsToPod itself never fails — every reconcile execution succeeds. The problem is reconcile trigger frequency.

Maximum gap between consecutive Node Conditions triggers (during peak hours):

Gap Time
139.6s 07:05
139.6s 07:15
139.6s 07:00
138.7s 06:55
135.8s 09:35

Peak hours (midnight ~ morning) see reconcile gaps up to 2 minutes 20 seconds. For short game sessions (20~30s), this means the pod label may never be updated before the session times out.

3. --sync-period default is 10 hours

The default SyncPeriod in controller-runtime is 10 hours, meaning the informer cache will only do a full resync every 10 hours as a fallback. For real-time game allocation, this provides no practical safety net.

Impact

  • gs-sync/match-id annotations leak permanently when game server never starts (because it reads stale OpsState from Downward API)
  • GameServers become permanently unallocatable
  • On 2026-04-03: 15 of 19 Poolball GameServers (79%) were stuck, causing severe match allocation degradation

Workaround

Setting --sync-period=30s forces a periodic full reconcile every 30 seconds, reducing the maximum sync delay from 10 hours to 30 seconds. Verified working on staging.

Questions / Feature Request

  1. Why does GS Watch not reliably trigger reconcile after a spec patch? Is there intentional filtering or debouncing that prevents it from firing on every spec change?

  2. Should the recommended default for --sync-period be documented for production deployments, especially for short-session game workloads?

  3. Could the controller proactively reconcile GS after spec changes, rather than relying solely on Node Conditions events as the primary trigger?

References

  • Official demo allocator: kruise-game-open-match-director (does not handle sync gap either)
  • SyncGsToPod source: gameserver_manager.go:105
  • reconcile trigger source: gameserver_controller.go:83-107

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions