straggler-shield

⚖️ Intellectual Property Notice

Patent Pending: This system and its underlying methods for GPU cluster validation, variance-based "fail-slow" detection, and ring-topology interconnect profiling are the subject of a pending U.S. Patent Application (Application No. 63/987,090).

straggler-shield

A Kubernetes DaemonSet agent that intercepts GPU nodes before Slurm can resume them. When a node transitions to Ready after a reboot or join event, the agent runs a synthetic GPU workload and quarantines the node if it exhibits straggler behavior — before any job ever lands on it.

The taint applied is sunk.coreweave.com/zombie-quarantine:NoSchedule. Nodes that pass are cleared immediately.

The problem

GPU nodes regularly return to a Ready state without being fit for distributed training. Common failure modes:

Fail-slow: mean GEMM latency looks acceptable but the GPU throttles intermittently, creating high variance across runs. Every AllReduce barrier on a shared job stalls waiting for the slow node.
Clock deration: SM clock remains stuck well below P0 after a thermal event, producing consistent slowdowns that compound across thousands of training steps.
NVLink failure: the GEMM pulse passes but P2P bandwidth between devices is severely degraded, making multi-GPU AllReduce across that node unusable.
ECC errors / thermal recovery incomplete: detected pre-flight, node quarantined without running the pulse at all.

Single-pass latency checks miss all of these. This agent runs five timed passes per device and evaluates mean, coefficient of variation, NVLink bandwidth, and post-pulse clock state before clearing a node.

What it runs

For each GPU on the node:

Pre-flight — queries nvidia-smi for uncorrectable ECC errors and idle temperature. Any ECC error or temp above 70°C quarantines immediately.
GEMM pulse — five timed 2048×2048 FP32 matrix multiplications via a CUDA shared library. Computes mean latency and coefficient of variation across runs.
P2P ring check — 100 MiB cudaMemcpyPeer across each adjacent GPU pair in ring order (0→1, 1→2, …, N-1→0). Catches any single broken NVLink segment.
Clock validation — queries nvidia-smi post-pulse. SM clock must be ≥ 50% of device max, confirming the device boosted to P0 under load.

Thresholds are auto-calibrated to the detected GPU architecture:

Check	H100 / H200	A100	B200 / GB200	Default
Mean GEMM latency	35 ms	100 ms	15 ms	500 ms
Coefficient of variation	20%	20%	20%	20%
P2P bandwidth	5 GB/s	5 GB/s	5 GB/s	5 GB/s
Idle temperature	70°C	70°C	70°C	70°C
Post-pulse SM clock	≥ 50% max	≥ 50% max	≥ 50% max	≥ 50% max

All thresholds are overridable via environment variables (PULSE_THRESHOLD_MS, PULSE_CV_MAX, P2P_MIN_GBS, IDLE_TEMP_MAX).

Architecture

K8s node watch loop (watch.Modified/Added)
  └─ edge-detect Ready transition (within 5-minute window)
       └─ ReconcileNode()
            ├─ preflight()            nvidia-smi ECC + temp
            ├─ runDevicePulse() × N   per-device GEMM timing
            ├─ checkP2P() × ring      cudaMemcpyPeer ring topology
            └─ validateClocks()       post-pulse SM clock check
                 ├─ pass → removeTaint (clear zombie-quarantine)
                 └─ fail → applyTaint + GPUStraggler condition

Metrics are served on :9090/metrics. The agent runs as a DaemonSet with one replica per GPU node; it watches only its own node via the downward API NODE_NAME env var. Per-node reconciliation is guarded by a sync.Mutex — duplicate Ready events are discarded while a pulse is in flight.

Building

Requires CUDA toolkit and nvcc on the build host.

# build the shared library and agent binary
make

# CI / no-GPU builds (stub pulse, always passes)
make go-stub

# GPU architecture targets — set in Makefile NVCC_FLAGS:
#   sm_70  Volta   (V100)
#   sm_80  Ampere  (A100)
#   sm_90  Hopper  (H100) — default
#   sm_100 Blackwell (B200)

The CUDA kernel compiles to cuda/libgpupulse.so. The Go binary links against it via CGO (-tags cuda).

Deploying

kubectl apply -f deploy/rbac.yaml
kubectl apply -f deploy/daemonset.yaml

The agent needs:

NODE_NAME set via the downward API
GPU device plugin (nvidia.com/gpu resource) and runtimeClassName: nvidia
RBAC: get, watch, patch on nodes and nodes/status

Benchmarking on real hardware

A self-contained script is included for generating structured evidence on a bare-metal GPU instance (RunPod, Lambda Labs, etc.):

git clone https://github.com/justin-oleary/straggler-shield
cd straggler-shield
bash scripts/run_hardware_benchmark.sh 10   # 10 runs, writes evidence.json

Requires nvcc, sudo, and a Debian/Ubuntu host. Installs Go if absent.

Quarantine taint

sunk.coreweave.com/zombie-quarantine:NoSchedule

A GPUStraggler node condition is also written to the status subresource with the failure reason and measured values. Both are cleared atomically when a node subsequently passes the pulse.

Metrics

Metric	Type	Labels	Description
`gpu_validator_pulse_duration_seconds`	Histogram	`device`	Mean GEMM latency per device per validation cycle
`gpu_validator_pulse_cv`	Gauge	`device`	Coefficient of variation across GEMM runs
`gpu_validator_straggler_detected_total`	Counter	`reason`	Quarantine events by failure reason

Reason values: latency_threshold_exceeded, high_variance, interconnect_degraded, pre_flight_failure.

Context & Prior Art

The failure modes mitigated by straggler-shield are actively impacting large-scale training clusters. For context on the community's ongoing efforts to handle these silent degradation and ECC gaps natively, see:

Evidence

Two real-hardware runs are committed to this repo, generated on RunPod A100 SXM4 80GB instances on 2026-02-20:

evidence.json — Node 221a215e1c25 (driver 580.126.09). Pre-flight detected 1001 uncorrectable aggregate ECC errors on GPU 0. All 10 runs quarantined without firing the GEMM pulse. This is the HBM row-remapping failure mode documented in the Meta post-mortem.

evidence_computational_straggler.json — Node 21cc925acf9c (driver 570.195.03). ECC clean, pre-flight passed. GEMM pulse fired; runs 1 and 2 returned CV=1.136 and CV=1.206 — 5–6× above the 0.20 threshold — on a node with a mean latency of 59 ms (well under the 100 ms A100 threshold). Textbook fail-slow: latency looks acceptable, variance exposes the fault.

Both are raw JSON as emitted by the benchmark binary. No modifications.

References

Falcon: arXiv:2410.12588 — fail-slow GPU detection via iteration-time variance
CoreWeave SUNK — zombie node quarantine taint convention
Meta infrastructure post-mortem on H100 HBM3 ECC failures driving job stalls

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
cmd		cmd
cuda		cuda
deploy		deploy
pkg		pkg
scripts		scripts
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
evidence.json		evidence.json
evidence_computational_straggler.json		evidence_computational_straggler.json
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚖️ Intellectual Property Notice

straggler-shield

The problem

What it runs

Architecture

Building

Deploying

Benchmarking on real hardware

Quarantine taint

Metrics

Context & Prior Art

Evidence

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚖️ Intellectual Property Notice

straggler-shield

The problem

What it runs

Architecture

Building

Deploying

Benchmarking on real hardware

Quarantine taint

Metrics

Context & Prior Art

Evidence

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages