gameserver-controller: OpsState spec change not reliably synced to Pod label due to reconcile trigger gap

## Summary

We observed a production issue where `GameServer.spec.opsState` changes (patched by an external allocator) are not reliably synced to Pod labels via `SyncGsToPod()`, causing game servers to read stale OpsState values through Downward API.

## Environment

- OKG version: v1.0.0 (`registry-cn-shenzhen.ack.aliyuncs.com/acs/kruise-game-manager:v1.0.0`)
- Kubernetes: Alibaba Cloud ACK (managed)
- `--sync-period`: not configured (default = 10 hours)
- `--gameserver-workers`: 100

## Observed Behavior

After an external Director (OpenMatch) patches `spec.opsState=Allocated` + `gs-sync/match-id` annotation on a GameServer, the Pod label `game.kruise.io/gs-opsState` sometimes does not update for an extended period. The game server process reads the Downward API file (`/etc/podinfo/gs-opsstate`) and sees the old value (e.g. `None` or `Kill`), so it never starts the game session.

We deployed a monitoring script to continuously compare `spec.opsState` and `pod.label[game.kruise.io/gs-opsState]`, and observed frequent SYNC GAPｓ across all OpsState transitions:

```
[ALERT] SYNC GAP: gs-exploding-kittens-13 | spec.opsState=Allocated | pod.label=Kill      | state=Ready
[ALERT] SYNC GAP: gs-exploding-kittens-16 | spec.opsState=None      | pod.label=Allocated | state=Ready
[ALERT] SYNC GAP: gs-exploding-kittens-3  | spec.opsState=Kill      | pod.label=Allocated | state=Deleting
```

## Root Cause Analysis

### 1. Reconcile is primarily driven by Node Conditions events

From `gameserver_controller.go`, the controller registers three watch sources:

```go
// GS Watch
c.Watch(source.Kind(mgr.GetCache(), &GameServer{},
    &handler.TypedEnqueueRequestForObject[*GameServer]{}))

// Pod Watch
watchPod(mgr, c)

// Node Watch — enqueues ALL pods on a node when Node Conditions change
watchNode(mgr, c)
```

While GS Watch is registered, we found from controller logs that `Watch Node Conditions Changed` is the dominant reconcile trigger in practice.

### 2. Node Conditions trigger gaps during peak hours

We analyzed 13 hours of OKG controller logs (2026-04-07 22:00 ~ 2026-04-08 11:00):

| Metric | Value |
|--------|-------|
| Node Conditions Changed triggers | 55,579 |
| OpsState successfully synced (`OpsState turn from X to Y`) | 7,411 |
| `SyncGsToPod` patch failures | **0** |

`SyncGsToPod` itself never fails — every reconcile execution succeeds. The problem is reconcile trigger frequency.

**Maximum gap between consecutive Node Conditions triggers (during peak hours):**

| Gap | Time |
|-----|------|
| 139.6s | 07:05 |
| 139.6s | 07:15 |
| 139.6s | 07:00 |
| 138.7s | 06:55 |
| 135.8s | 09:35 |

Peak hours (midnight ~ morning) see reconcile gaps up to **2 minutes 20 seconds**. For short game sessions (20~30s), this means the pod label may never be updated before the session times out.

### 3. `--sync-period` default is 10 hours

The default `SyncPeriod` in controller-runtime is **10 hours**, meaning the informer cache will only do a full resync every 10 hours as a fallback. For real-time game allocation, this provides no practical safety net.

## Impact

- `gs-sync/match-id` annotations leak permanently when game server never starts (because it reads stale OpsState from Downward API)
- GameServers become permanently unallocatable
- On 2026-04-03: 15 of 19 Poolball GameServers (79%) were stuck, causing severe match allocation degradation

## Workaround

Setting `--sync-period=30s` forces a periodic full reconcile every 30 seconds, reducing the maximum sync delay from 10 hours to 30 seconds. Verified working on staging.

## Questions / Feature Request

1. **Why does GS Watch not reliably trigger reconcile after a spec patch?** Is there intentional filtering or debouncing that prevents it from firing on every spec change?

2. **Should the recommended default for `--sync-period` be documented** for production deployments, especially for short-session game workloads?

3. **Could the controller proactively reconcile GS after spec changes**, rather than relying solely on Node Conditions events as the primary trigger?

## References

- Official demo allocator: [kruise-game-open-match-director](https://github.com/CloudNativeGame/kruise-game-open-match-director) (does not handle sync gap either)
- `SyncGsToPod` source: `gameserver_manager.go:105`
- reconcile trigger source: `gameserver_controller.go:83-107`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gameserver-controller: OpsState spec change not reliably synced to Pod label due to reconcile trigger gap #321

Summary

Environment

Observed Behavior

Root Cause Analysis

1. Reconcile is primarily driven by Node Conditions events

2. Node Conditions trigger gaps during peak hours

3. `--sync-period` default is 10 hours

Impact

Workaround

Questions / Feature Request

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value
Node Conditions Changed triggers	55,579
OpsState successfully synced (`OpsState turn from X to Y`)	7,411
`SyncGsToPod` patch failures	0

gameserver-controller: OpsState spec change not reliably synced to Pod label due to reconcile trigger gap #321

Description

Summary

Environment

Observed Behavior

Root Cause Analysis

1. Reconcile is primarily driven by Node Conditions events

2. Node Conditions trigger gaps during peak hours

3. --sync-period default is 10 hours

Impact

Workaround

Questions / Feature Request

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. `--sync-period` default is 10 hours