Skip to content

Add continuous cluster watcher for runtime health incidents #106

@Agent-Hellboy

Description

@Agent-Hellboy

Context

MCP Runtime already has a manual mcp-runtime cluster doctor command for point-in-time diagnostics, and the operator already reconciles MCPServer resources into Deployments, Services, Ingresses, policy ConfigMaps, and status conditions.

The missing piece is an always-on cluster listener that detects unhealthy platform/runtime states and surfaces them to Sentinel/UI without requiring a user to run cluster doctor manually.

Proposal

Add a sentinel-agent / cluster-watcher component that continuously observes cluster health and emits structured incidents. It should reuse existing cluster doctor checks where possible instead of duplicating diagnostic logic.

Suggested shape:

  • Extract reusable checks into a shared package, for example pkg/diagnostics or internal/healthchecks.
  • Keep cluster doctor as the human-facing CLI formatter for those checks.
  • Add a watcher service/controller that runs checks periodically and/or watches Kubernetes events.
  • Emit findings to Sentinel API, Kubernetes Events, and/or metrics so the UI can show active incidents.

Initial Signals

The first version should detect and report:

  • MCPServer stuck in Error, Pending, or long-running PartiallyReady.
  • Pods in CrashLoopBackOff, ImagePullBackOff, ErrImagePull, or OOMKilled.
  • Deployments with unavailable replicas.
  • Services with no endpoints.
  • Ingress not admitted or missing load balancer status when strict readiness is expected.
  • Certificate readiness failures.
  • Registry pull/routing failures where detectable.
  • Gateway or policy materialization issues.

Non-goals for v1

  • Do not add broad automatic remediation in the first version.
  • Do not duplicate all cluster doctor logic in a second implementation.
  • Do not turn the existing MCPServer reconciler into a general cluster incident system.

Acceptance Criteria

  • cluster doctor and the watcher share core diagnostic/check logic where practical.
  • A deployed watcher can report structured incidents for at least MCPServer readiness, pod image pull failures, and unavailable deployments.
  • Incidents include namespace, resource kind/name, severity, reason, human-readable message, and suggested remediation.
  • Watcher failures do not block normal MCPServer reconciliation.
  • Tests cover the shared check logic and at least one watcher emission path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions