Programmatic provisioning of RunPod GPU clusters via SkyPilot.
Lets you spin up a pod, stage data, and SSH in with one command — no manual
pod setup, no manual rsync, no stale SSH configs.
This sub-project is infrastructure, not genai code. Treat it as a utility that
any topic under examples/<topic>/ or notebooks/<topic>/ can invoke when a
task needs a GPU.
| Scenario | Tool |
|---|---|
| Toy/small tasks that fit on your laptop (PBMC 3k CVAE) | Run locally (CPU or MPS) |
| Anything realistic — Norman 2019 Perturb-seq, JEPA training, diffusion runs | ops/provision_cluster.py |
| Multi-GPU distributed training | ops/provision_cluster.py --gpu a100 or h100 |
If you're reaching for a pod, you should be reaching for this — not the legacy
manual workflow in runpods/ (see Relationship to runpods/ below).
- SkyPilot + RunPod credentials installed once:
pip install "skypilot[runpod]" sky check runpod # should say "enabled"
- A RunPod network volume named in
configs/gpu_config.yaml(default:"AI lab extension"). Rename or create via the RunPod dashboard. - Data organized as
data/<modality>/<sub-topic>/<dataset>/at the project root (e.g.,data/scrna/perturb_seq/norman_2019/,data/scrna/pbmc/68k/,data/bulk/gtex/).
# Default: stages data/scrna/perturb_seq/norman_2019/
python ops/provision_cluster.py --stage-data
# Stage a different dataset
python ops/provision_cluster.py --stage-data --data-path scrna/pbmc/68k
python ops/provision_cluster.py --stage-data --data-path bulk/gtexThis uploads the local dataset to the network volume once. Subsequent provisions mount the volume instantly — no re-upload.
# Default: A40 on RunPod, genailab installed, volume mounted
python ops/provision_cluster.py
# Specific GPU
python ops/provision_cluster.py --gpu a100
# With an extra dependency profile (see gpu_config.yaml)
python ops/provision_cluster.py --model scviThe cluster stays alive until you explicitly tear it down. This is intentional — iterative work on a pod is hundreds of times faster than re-provisioning for each run.
ssh genai-workspace # cluster name printed by provision_cluster.py
cd /workspace/genai-lab # workdir synced from your local repo
python examples/perturbation/P2_cvae_nb_baseline.pypython ops/provision_cluster.py --status # list running clusters
python ops/provision_cluster.py --down # interactive teardown
python ops/provision_cluster.py --down-all # nuke everythingTear down when you're done. Pods cost money whether or not you're using them.
Defaults live in configs/gpu_config.yaml.
CLI flags override the file. The most common edits:
| Field | What it does | When to edit |
|---|---|---|
gpu |
GPU type (a40, a100, h100, ...) |
Change for more/less VRAM |
data_path |
Dataset subpath <modality>/<sub-topic>/<dataset> |
Change for a different dataset |
default_model |
Extra pip deps beyond genailab | Set to scvi, scgen, etc. |
use_volume |
Mount the network volume | Set false for one-off jobs |
| Key | GPU | VRAM | ~$/hr |
|---|---|---|---|
rtx4000ada |
RTX 4000 Ada | 20 GB | 0.26 |
rtxa5000 |
RTX A5000 | 24 GB | 0.27 |
l4 |
L4 | 24 GB | 0.39 |
a40 |
A40 (default) | 48 GB | 0.39 |
rtx4090 |
RTX 4090 | 24 GB | 0.59 |
rtx5090 |
RTX 5090 | 32 GB | 0.89 |
a100 |
A100 | 80 GB | 1.64 |
h100 |
H100 | 80 GB | 3.29 |
Pricing is approximate and varies with RunPod availability. Check sky show-gpus
for live rates before launching expensive GPUs.
Most genailab tasks just need the stock PyTorch image + the genailab package,
so default_model: none. Add optional profiles as your work expands:
- scvi — scvi-tools reference single-cell VAE
- scgen — perturbation response prediction baseline
- cpa — Compositional Perturbation Autoencoder
- diffusers — HuggingFace diffusion reference implementations
Invoke with --model <name>. Extend profiles by editing models: in
configs/gpu_config.yaml.
provision_cluster.pyreadsconfigs/gpu_config.yaml+ CLI overrides- Generates a SkyPilot YAML into
configs/skypilot/generated/(git-ignored) - Invokes
sky launch— provisions pod, runs setup (pip install -e ., optional model deps), mounts volume - Prints SSH instructions + reminder to tear down
The underlying Python API lives in gpu_runner.py:
GPU_SPECS, InfraConfig, build_skypilot_config, launch, stage_data.
Import these directly if you want to script a pipeline (see docstring).
This repo historically had a manual workflow under runpods/:
- User provisions pod via the RunPod web dashboard
runpod_ssh_manager.shadds an SSH host entryrsyncdata manually
That workflow still works, but ops/ is the recommended approach going forward
because it is:
- Programmatic — no web-dashboard clicking, no hand-written configs
- Repeatable — same provision every time, same data staging every time
- Volume-aware — stages datasets once, reuses them across runs
- Model-aware — optional dependency profiles for reference methods
Keep runpods/ scripts around for one-off or legacy needs; use ops/ for
everything new.
Adapted from agentic-spliceai's foundation_models/ GPU runner. The pattern (SkyPilot + network volume + YAML-driven infra config) carries across bio-AI projects. If you end up maintaining the same provisioner across three or more projects, consider extracting it to a standalone repo.