Patent Pending: This system and its underlying methods for GPU cluster validation, variance-based "fail-slow" detection, and ring-topology interconnect profiling are the subject of a pending U.S. Patent Application (Application No. 63/987,090).
All rights reserved. Unauthorized commercial use, reproduction, or distribution of these methods is strictly prohibited under 35 U.S.C. § 111.
A Kubernetes DaemonSet agent that intercepts GPU nodes before Slurm can resume them. When a node transitions to Ready after a reboot or join event, the agent runs a synthetic GPU workload and quarantines the node if it exhibits straggler behavior — before any job ever lands on it.
The taint applied is sunk.coreweave.com/zombie-quarantine:NoSchedule. Nodes that pass are cleared immediately.
GPU nodes regularly return to a Ready state without being fit for distributed training. Common failure modes:
- Fail-slow: mean GEMM latency looks acceptable but the GPU throttles intermittently, creating high variance across runs. Every AllReduce barrier on a shared job stalls waiting for the slow node.
- Clock deration: SM clock remains stuck well below P0 after a thermal event, producing consistent slowdowns that compound across thousands of training steps.
- NVLink failure: the GEMM pulse passes but P2P bandwidth between devices is severely degraded, making multi-GPU AllReduce across that node unusable.
- ECC errors / thermal recovery incomplete: detected pre-flight, node quarantined without running the pulse at all.
Single-pass latency checks miss all of these. This agent runs five timed passes per device and evaluates mean, coefficient of variation, NVLink bandwidth, and post-pulse clock state before clearing a node.
For each GPU on the node:
- Pre-flight — queries
nvidia-smifor uncorrectable ECC errors and idle temperature. Any ECC error or temp above 70°C quarantines immediately. - GEMM pulse — five timed 2048×2048 FP32 matrix multiplications via a CUDA shared library. Computes mean latency and coefficient of variation across runs.
- P2P ring check — 100 MiB
cudaMemcpyPeeracross each adjacent GPU pair in ring order (0→1, 1→2, …, N-1→0). Catches any single broken NVLink segment. - Clock validation — queries
nvidia-smipost-pulse. SM clock must be ≥ 50% of device max, confirming the device boosted to P0 under load.
Thresholds are auto-calibrated to the detected GPU architecture:
| Check | H100 / H200 | A100 | B200 / GB200 | Default |
|---|---|---|---|---|
| Mean GEMM latency | 35 ms | 100 ms | 15 ms | 500 ms |
| Coefficient of variation | 20% | 20% | 20% | 20% |
| P2P bandwidth | 5 GB/s | 5 GB/s | 5 GB/s | 5 GB/s |
| Idle temperature | 70°C | 70°C | 70°C | 70°C |
| Post-pulse SM clock | ≥ 50% max | ≥ 50% max | ≥ 50% max | ≥ 50% max |
All thresholds are overridable via environment variables (PULSE_THRESHOLD_MS, PULSE_CV_MAX, P2P_MIN_GBS, IDLE_TEMP_MAX).
K8s node watch loop (watch.Modified/Added)
└─ edge-detect Ready transition (within 5-minute window)
└─ ReconcileNode()
├─ preflight() nvidia-smi ECC + temp
├─ runDevicePulse() × N per-device GEMM timing
├─ checkP2P() × ring cudaMemcpyPeer ring topology
└─ validateClocks() post-pulse SM clock check
├─ pass → removeTaint (clear zombie-quarantine)
└─ fail → applyTaint + GPUStraggler condition
Metrics are served on :9090/metrics. The agent runs as a DaemonSet with one replica per GPU node; it watches only its own node via the downward API NODE_NAME env var. Per-node reconciliation is guarded by a sync.Mutex — duplicate Ready events are discarded while a pulse is in flight.
Requires CUDA toolkit and nvcc on the build host.
# build the shared library and agent binary
make
# CI / no-GPU builds (stub pulse, always passes)
make go-stub
# GPU architecture targets — set in Makefile NVCC_FLAGS:
# sm_70 Volta (V100)
# sm_80 Ampere (A100)
# sm_90 Hopper (H100) — default
# sm_100 Blackwell (B200)The CUDA kernel compiles to cuda/libgpupulse.so. The Go binary links against it via CGO (-tags cuda).
kubectl apply -f deploy/rbac.yaml
kubectl apply -f deploy/daemonset.yamlThe agent needs:
NODE_NAMEset via the downward API- GPU device plugin (
nvidia.com/gpuresource) andruntimeClassName: nvidia - RBAC:
get,watch,patchonnodesandnodes/status
A self-contained script is included for generating structured evidence on a bare-metal GPU instance (RunPod, Lambda Labs, etc.):
git clone https://github.com/justin-oleary/straggler-shield
cd straggler-shield
bash scripts/run_hardware_benchmark.sh 10 # 10 runs, writes evidence.jsonRequires nvcc, sudo, and a Debian/Ubuntu host. Installs Go if absent.
sunk.coreweave.com/zombie-quarantine:NoSchedule
A GPUStraggler node condition is also written to the status subresource with the failure reason and measured values. Both are cleared atomically when a node subsequently passes the pulse.
| Metric | Type | Labels | Description |
|---|---|---|---|
gpu_validator_pulse_duration_seconds |
Histogram | device |
Mean GEMM latency per device per validation cycle |
gpu_validator_pulse_cv |
Gauge | device |
Coefficient of variation across GEMM runs |
gpu_validator_straggler_detected_total |
Counter | reason |
Quarantine events by failure reason |
Reason values: latency_threshold_exceeded, high_variance, interconnect_degraded, pre_flight_failure.
The failure modes mitigated by straggler-shield are actively impacting large-scale training clusters. For context on the community's ongoing efforts to handle these silent degradation and ECC gaps natively, see:
- NVIDIA/k8s-device-plugin #519: GPU health status exposure and remediation methods
- NVIDIA/k8s-device-plugin #267: k8s-device-plugin seems to think gpu healthy when it is not usable due to Uncorrectable ECC Error
- Falcon: A silent fail-slow straggler pattern
Two real-hardware runs are committed to this repo, generated on RunPod A100 SXM4 80GB instances on 2026-02-20:
evidence.json — Node 221a215e1c25 (driver 580.126.09). Pre-flight detected 1001 uncorrectable aggregate ECC errors on GPU 0. All 10 runs quarantined without firing the GEMM pulse. This is the HBM row-remapping failure mode documented in the Meta post-mortem.
evidence_computational_straggler.json — Node 21cc925acf9c (driver 570.195.03). ECC clean, pre-flight passed. GEMM pulse fired; runs 1 and 2 returned CV=1.136 and CV=1.206 — 5–6× above the 0.20 threshold — on a node with a mean latency of 59 ms (well under the 100 ms A100 threshold). Textbook fail-slow: latency looks acceptable, variance exposes the fault.
Both are raw JSON as emitted by the benchmark binary. No modifications.
- Falcon: arXiv:2410.12588 — fail-slow GPU detection via iteration-time variance
- CoreWeave SUNK — zombie node quarantine taint convention
- Meta infrastructure post-mortem on H100 HBM3 ECC failures driving job stalls