Summary
We observed a production issue where GameServer.spec.opsState changes (patched by an external allocator) are not reliably synced to Pod labels via SyncGsToPod(), causing game servers to read stale OpsState values through Downward API.
Environment
- OKG version: v1.0.0 (
registry-cn-shenzhen.ack.aliyuncs.com/acs/kruise-game-manager:v1.0.0)
- Kubernetes: Alibaba Cloud ACK (managed)
--sync-period: not configured (default = 10 hours)
--gameserver-workers: 100
Observed Behavior
After an external Director (OpenMatch) patches spec.opsState=Allocated + gs-sync/match-id annotation on a GameServer, the Pod label game.kruise.io/gs-opsState sometimes does not update for an extended period. The game server process reads the Downward API file (/etc/podinfo/gs-opsstate) and sees the old value (e.g. None or Kill), so it never starts the game session.
We deployed a monitoring script to continuously compare spec.opsState and pod.label[game.kruise.io/gs-opsState], and observed frequent SYNC GAPs across all OpsState transitions:
[ALERT] SYNC GAP: gs-exploding-kittens-13 | spec.opsState=Allocated | pod.label=Kill | state=Ready
[ALERT] SYNC GAP: gs-exploding-kittens-16 | spec.opsState=None | pod.label=Allocated | state=Ready
[ALERT] SYNC GAP: gs-exploding-kittens-3 | spec.opsState=Kill | pod.label=Allocated | state=Deleting
Root Cause Analysis
1. Reconcile is primarily driven by Node Conditions events
From gameserver_controller.go, the controller registers three watch sources:
// GS Watch
c.Watch(source.Kind(mgr.GetCache(), &GameServer{},
&handler.TypedEnqueueRequestForObject[*GameServer]{}))
// Pod Watch
watchPod(mgr, c)
// Node Watch — enqueues ALL pods on a node when Node Conditions change
watchNode(mgr, c)
While GS Watch is registered, we found from controller logs that Watch Node Conditions Changed is the dominant reconcile trigger in practice.
2. Node Conditions trigger gaps during peak hours
We analyzed 13 hours of OKG controller logs (2026-04-07 22:00 ~ 2026-04-08 11:00):
| Metric |
Value |
| Node Conditions Changed triggers |
55,579 |
OpsState successfully synced (OpsState turn from X to Y) |
7,411 |
SyncGsToPod patch failures |
0 |
SyncGsToPod itself never fails — every reconcile execution succeeds. The problem is reconcile trigger frequency.
Maximum gap between consecutive Node Conditions triggers (during peak hours):
| Gap |
Time |
| 139.6s |
07:05 |
| 139.6s |
07:15 |
| 139.6s |
07:00 |
| 138.7s |
06:55 |
| 135.8s |
09:35 |
Peak hours (midnight ~ morning) see reconcile gaps up to 2 minutes 20 seconds. For short game sessions (20~30s), this means the pod label may never be updated before the session times out.
3. --sync-period default is 10 hours
The default SyncPeriod in controller-runtime is 10 hours, meaning the informer cache will only do a full resync every 10 hours as a fallback. For real-time game allocation, this provides no practical safety net.
Impact
gs-sync/match-id annotations leak permanently when game server never starts (because it reads stale OpsState from Downward API)
- GameServers become permanently unallocatable
- On 2026-04-03: 15 of 19 Poolball GameServers (79%) were stuck, causing severe match allocation degradation
Workaround
Setting --sync-period=30s forces a periodic full reconcile every 30 seconds, reducing the maximum sync delay from 10 hours to 30 seconds. Verified working on staging.
Questions / Feature Request
-
Why does GS Watch not reliably trigger reconcile after a spec patch? Is there intentional filtering or debouncing that prevents it from firing on every spec change?
-
Should the recommended default for --sync-period be documented for production deployments, especially for short-session game workloads?
-
Could the controller proactively reconcile GS after spec changes, rather than relying solely on Node Conditions events as the primary trigger?
References
- Official demo allocator: kruise-game-open-match-director (does not handle sync gap either)
SyncGsToPod source: gameserver_manager.go:105
- reconcile trigger source:
gameserver_controller.go:83-107
Summary
We observed a production issue where
GameServer.spec.opsStatechanges (patched by an external allocator) are not reliably synced to Pod labels viaSyncGsToPod(), causing game servers to read stale OpsState values through Downward API.Environment
registry-cn-shenzhen.ack.aliyuncs.com/acs/kruise-game-manager:v1.0.0)--sync-period: not configured (default = 10 hours)--gameserver-workers: 100Observed Behavior
After an external Director (OpenMatch) patches
spec.opsState=Allocated+gs-sync/match-idannotation on a GameServer, the Pod labelgame.kruise.io/gs-opsStatesometimes does not update for an extended period. The game server process reads the Downward API file (/etc/podinfo/gs-opsstate) and sees the old value (e.g.NoneorKill), so it never starts the game session.We deployed a monitoring script to continuously compare
spec.opsStateandpod.label[game.kruise.io/gs-opsState], and observed frequent SYNC GAPs across all OpsState transitions:Root Cause Analysis
1. Reconcile is primarily driven by Node Conditions events
From
gameserver_controller.go, the controller registers three watch sources:While GS Watch is registered, we found from controller logs that
Watch Node Conditions Changedis the dominant reconcile trigger in practice.2. Node Conditions trigger gaps during peak hours
We analyzed 13 hours of OKG controller logs (2026-04-07 22:00 ~ 2026-04-08 11:00):
OpsState turn from X to Y)SyncGsToPodpatch failuresSyncGsToPoditself never fails — every reconcile execution succeeds. The problem is reconcile trigger frequency.Maximum gap between consecutive Node Conditions triggers (during peak hours):
Peak hours (midnight ~ morning) see reconcile gaps up to 2 minutes 20 seconds. For short game sessions (20~30s), this means the pod label may never be updated before the session times out.
3.
--sync-perioddefault is 10 hoursThe default
SyncPeriodin controller-runtime is 10 hours, meaning the informer cache will only do a full resync every 10 hours as a fallback. For real-time game allocation, this provides no practical safety net.Impact
gs-sync/match-idannotations leak permanently when game server never starts (because it reads stale OpsState from Downward API)Workaround
Setting
--sync-period=30sforces a periodic full reconcile every 30 seconds, reducing the maximum sync delay from 10 hours to 30 seconds. Verified working on staging.Questions / Feature Request
Why does GS Watch not reliably trigger reconcile after a spec patch? Is there intentional filtering or debouncing that prevents it from firing on every spec change?
Should the recommended default for
--sync-periodbe documented for production deployments, especially for short-session game workloads?Could the controller proactively reconcile GS after spec changes, rather than relying solely on Node Conditions events as the primary trigger?
References
SyncGsToPodsource:gameserver_manager.go:105gameserver_controller.go:83-107