Skip to content

[codex] migrate GPU stack to CUDA 13#729

Draft
bradhilton wants to merge 1 commit into
mainfrom
codex/cuda13-migration
Draft

[codex] migrate GPU stack to CUDA 13#729
bradhilton wants to merge 1 commit into
mainfrom
codex/cuda13-migration

Conversation

@bradhilton

Copy link
Copy Markdown
Collaborator

Summary

  • migrate ART GPU dependencies, lockfiles, and image config from CUDA 12.8/cu12 to CUDA 13/cu13
  • move Torch/TorchVision/torchao to cu130 wheels and TransformerEngine to transformer-engine-cu13
  • update vLLM runtime pins away from cu128/cu129/cu12 and rebuild CUDA source extensions in the CUDA 13 image
  • update CI/cache/dev image defaults and CUDA/CUDNN/NCCL include/library paths for CUDA 13

Validation

Validated on CoreWeave ext-collab2 node g122252:

  • GPU: NVIDIA H200
  • Driver: 595.71.05
  • nvidia-smi CUDA: 13.2
  • nvcc: CUDA 13.0
  • NCore image tag: 2.45.0

Commands/checks run:

  • uv lock
  • cd vllm_runtime && uv lock
  • uv lock --check && (cd vllm_runtime && uv lock --check)
  • uv sync --extra backend --extra megatron --extra tinker
  • cd vllm_runtime && uv sync --no-dev
  • core CUDA import smoke for torch==2.11.0+cu130, CUDA 13.0, and H200 visibility
  • extension import smoke for torchao, transformer_engine.pytorch, apex, deep_ep_cpp, quack.layout_utils, and PyTorch flex attention
  • ldd sanity on 14 CUDA shared objects across Apex, DeepEP, TransformerEngine, and torchao; no not found, including no missing libcudart.so.13
  • focused Megatron tests: 21 passed around packed/flattened attention, compile flags, and provider support
  • tiny Kubernetes dev smoke: uv run --extra backend --extra megatron dev/yes-no-maybe-megatron.py

Final image built and prewarmed:

  • docker.io/bradhiltonnw/art-gpu:cuda13-codex-20260612-122156
  • digest: sha256:009d5aa9cbe46a5a2855d44adc6bc3c272b268eb264bccc77e0d247f1b4ce7f0

Notes

  • Remaining non-fatal runtime warnings are documented: xFormers is still pulled by Unsloth and its current wheel warns about a cu128 build; this path is not used by the Megatron/flex-attn smoke.
  • The tiny dev smoke reached train/eval but skipped gradient updates because one rollout produced no advantage signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant