Context
MCP Runtime already has a manual mcp-runtime cluster doctor command for point-in-time diagnostics, and the operator already reconciles MCPServer resources into Deployments, Services, Ingresses, policy ConfigMaps, and status conditions.
The missing piece is an always-on cluster listener that detects unhealthy platform/runtime states and surfaces them to Sentinel/UI without requiring a user to run cluster doctor manually.
Proposal
Add a sentinel-agent / cluster-watcher component that continuously observes cluster health and emits structured incidents. It should reuse existing cluster doctor checks where possible instead of duplicating diagnostic logic.
Suggested shape:
- Extract reusable checks into a shared package, for example
pkg/diagnostics or internal/healthchecks.
- Keep
cluster doctor as the human-facing CLI formatter for those checks.
- Add a watcher service/controller that runs checks periodically and/or watches Kubernetes events.
- Emit findings to Sentinel API, Kubernetes Events, and/or metrics so the UI can show active incidents.
Initial Signals
The first version should detect and report:
MCPServer stuck in Error, Pending, or long-running PartiallyReady.
- Pods in
CrashLoopBackOff, ImagePullBackOff, ErrImagePull, or OOMKilled.
- Deployments with unavailable replicas.
- Services with no endpoints.
- Ingress not admitted or missing load balancer status when strict readiness is expected.
- Certificate readiness failures.
- Registry pull/routing failures where detectable.
- Gateway or policy materialization issues.
Non-goals for v1
- Do not add broad automatic remediation in the first version.
- Do not duplicate all
cluster doctor logic in a second implementation.
- Do not turn the existing
MCPServer reconciler into a general cluster incident system.
Acceptance Criteria
cluster doctor and the watcher share core diagnostic/check logic where practical.
- A deployed watcher can report structured incidents for at least MCPServer readiness, pod image pull failures, and unavailable deployments.
- Incidents include namespace, resource kind/name, severity, reason, human-readable message, and suggested remediation.
- Watcher failures do not block normal
MCPServer reconciliation.
- Tests cover the shared check logic and at least one watcher emission path.
Context
MCP Runtime already has a manual
mcp-runtime cluster doctorcommand for point-in-time diagnostics, and the operator already reconcilesMCPServerresources into Deployments, Services, Ingresses, policy ConfigMaps, and status conditions.The missing piece is an always-on cluster listener that detects unhealthy platform/runtime states and surfaces them to Sentinel/UI without requiring a user to run
cluster doctormanually.Proposal
Add a
sentinel-agent/cluster-watchercomponent that continuously observes cluster health and emits structured incidents. It should reuse existingcluster doctorchecks where possible instead of duplicating diagnostic logic.Suggested shape:
pkg/diagnosticsorinternal/healthchecks.cluster doctoras the human-facing CLI formatter for those checks.Initial Signals
The first version should detect and report:
MCPServerstuck inError,Pending, or long-runningPartiallyReady.CrashLoopBackOff,ImagePullBackOff,ErrImagePull, orOOMKilled.Non-goals for v1
cluster doctorlogic in a second implementation.MCPServerreconciler into a general cluster incident system.Acceptance Criteria
cluster doctorand the watcher share core diagnostic/check logic where practical.MCPServerreconciliation.