Add continuous cluster watcher for runtime health incidents

## Context

MCP Runtime already has a manual `mcp-runtime cluster doctor` command for point-in-time diagnostics, and the operator already reconciles `MCPServer` resources into Deployments, Services, Ingresses, policy ConfigMaps, and status conditions.

The missing piece is an always-on cluster listener that detects unhealthy platform/runtime states and surfaces them to Sentinel/UI without requiring a user to run `cluster doctor` manually.

## Proposal

Add a `sentinel-agent` / `cluster-watcher` component that continuously observes cluster health and emits structured incidents. It should reuse existing `cluster doctor` checks where possible instead of duplicating diagnostic logic.

Suggested shape:

- Extract reusable checks into a shared package, for example `pkg/diagnostics` or `internal/healthchecks`.
- Keep `cluster doctor` as the human-facing CLI formatter for those checks.
- Add a watcher service/controller that runs checks periodically and/or watches Kubernetes events.
- Emit findings to Sentinel API, Kubernetes Events, and/or metrics so the UI can show active incidents.

## Initial Signals

The first version should detect and report:

- `MCPServer` stuck in `Error`, `Pending`, or long-running `PartiallyReady`.
- Pods in `CrashLoopBackOff`, `ImagePullBackOff`, `ErrImagePull`, or `OOMKilled`.
- Deployments with unavailable replicas.
- Services with no endpoints.
- Ingress not admitted or missing load balancer status when strict readiness is expected.
- Certificate readiness failures.
- Registry pull/routing failures where detectable.
- Gateway or policy materialization issues.

## Non-goals for v1

- Do not add broad automatic remediation in the first version.
- Do not duplicate all `cluster doctor` logic in a second implementation.
- Do not turn the existing `MCPServer` reconciler into a general cluster incident system.

## Acceptance Criteria

- `cluster doctor` and the watcher share core diagnostic/check logic where practical.
- A deployed watcher can report structured incidents for at least MCPServer readiness, pod image pull failures, and unavailable deployments.
- Incidents include namespace, resource kind/name, severity, reason, human-readable message, and suggested remediation.
- Watcher failures do not block normal `MCPServer` reconciliation.
- Tests cover the shared check logic and at least one watcher emission path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add continuous cluster watcher for runtime health incidents #106

Context

Proposal

Initial Signals

Non-goals for v1

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add continuous cluster watcher for runtime health incidents #106

Description

Context

Proposal

Initial Signals

Non-goals for v1

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions