Artifact release for the paper "Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training".
This repository contains everything needed to reproduce the main end-to-end results in the paper:
- Section 6.2.1 — Max-Throughput Comparison (Megatron-LM vs. Megatron+Perseus vs. Nanobatching+Perseus vs. Kareus).
- Section 6.2.2 — Time/Energy Frontier Improvement (Megatron+Perseus vs. Nanobatching+Perseus vs. Kareus, all evaluated as 10-point frontiers).
.
├── 3rdparty/ # Vendored dependencies (see 3rdparty/README.md)
│ ├── Megatron-LM/ # Megatron-Core v0.12.1 with Perseus/Kareus support
│ ├── NeMo/ # NeMo v2.3.1 with Perseus/Kareus support
│ ├── TransformerEngine/ # TransformerEngine v2.4.0 with Perseus/Kareus support
│ ├── mscclpp/ # MSCCL++ v0.7.0 with Perseus/Kareus support
│ └── zeus/ # Zeus v0.15.1
│
├── kareus/ # Kareus runtime (see kareus/README.md)
│ ├── megatron/core/partitions/ # Core partition-overlap execution engine
│ ├── scheduler/ # Per-microbatch communication scheduler
│ ├── nemo/ # Kareus adaptations in the NeMo training-loop layer
│ ├── megatron/ # Kareus adaptations in the Megatron-Core transformer layer
│ ├── transformer_engine/# Kareus adaptations in the TransformerEngine attention/linear layer
│ ├── apex/ # Kareus adaptations in the Apex transformer-utility layer
│ ├── flash_attn/ # Kareus adaptations in the FlashAttention kernel layer
│ └── msccl/ # Kareus adaptations in the MSCCL++ collective-comm layer
│
├── tests/
│ ├── artifact/ # 16-GPU (2x8 A100) reproduction scripts — main entry point
│ ├── toy/ # 4-GPU (1x4 A40) sanity scripts
│ ├── bayesian/ # Per-partition Bayesian-optimization profilers
│ ├── kareus/ # Utils for Kareus (Kareus-adapted Perseus optimizer)
│ ├── perseus/ # Utils for Megatron+Perseus
│ ├── nanobatch_perseus/ # Utils for Nanobatching+Perseus
│ └── data/ # Dataset preparation
│
├── Dockerfile # Docker image for the entire stack (PyTorch 25.06)
└── videos/ # Screen recordings of each reproduction step
A Kareus end-to-end run is three phases:
- Bayesian optimization of each partition produces a per-partition
time–energy frontier. Driver: tests/artifact/kareus_run_bayesian.sh (per-partition profilers in tests/bayesian/partitions/ and tests/bayesian/nonpartition/). Outputs land in
tests/bayesian/logs/<model>/cp${CP}-tp${TP}-bs${BS}-seq${SEQ}/. - Compose the per-partition frontiers and solve for an optimized
iteration-level execution schedule. Driver:
generate_profile_csv.py+run_optimization.pyin tests/kareus/, invoked by tests/artifact/run_kareus.sh. Outputs are pairs offreqs_pipeline_*.py(per-microbatch GPU frequencies) andscheds_pipeline_*.py(per-microbatch communication schedules). - Execute either (a) the maximum-throughput plan, or (b) all 10 frontier
plans, on the real cluster. Driver: tests/artifact/run_kareus.sh (set
FRONTIER=truefor the frontier sweep).
Note
Hardware: experiments in the paper were generated on 2× AWS p4d.24xlarge instances (2x8 NVIDIA A100 SXM4 40GB); all defaults below assume 2x8 A100 GPUs.
For other GPU types or parallelism dimensions, see common GPU and configuration variables.
For example, tests/toy/ shows a small sanity check on 4× NVIDIA A40 GPUs with PP=2, TP=2 (see tests/toy/README.md).
Clone the repository on each node:
git clone -b osdi26ae https://github.com/ml-energy/kareus.git
cd kareusBuild the image (or pull the prebuilt one):
# Option A: pull
docker pull ruofanwu7/kareus-artifact:latest
# Option B: build locally (from the repo root, BuildKit required)
DOCKER_BUILDKIT=1 docker build -t kareus-artifact:latest .Launch the container on each node, mounting the repository at
/workspaces/Kareus:
docker run -it \
--gpus=all --ipc=host --network=host --privileged \
--name=kareus-artifact \
-v /dev/shm:/dev/shm \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v $(pwd):/workspaces/Kareus \
ruofanwu7/kareus-artifact:latestConfigure the cluster by editing tests/artifact/env.sh once on each node:
| Variable | Purpose |
|---|---|
MASTER_ADDR |
IP/hostname of node 0 (used by torchrun, and as scp target for node 1) |
REMOTE_USER |
SSH user on node 0 (node 1 uses this to scp results back) |
REMOTE_BASE_DIR |
Path to tests/artifact/ on node 0 |
SSH_KEY_PATH |
(Optional) SSH key node 1 uses to reach node 0 |
Export your Hugging Face token (required by the tokenizer downloads):
export HF_TOKEN=<your_huggingface_token>Prepare the dataset. Kareus evaluates 10 model configurations; we
recommend reviewers start with CONFIG_MODE=single (Llama 3.2 3B, TP=8,
MBS=8, SEQ=4096), which is the first row of the configuration table in evaluation.
The same CONFIG_MODE threads through all run scripts.
# On each node (one-time; takes a few minutes)
CONFIG_MODE=single bash tests/data/prepare_data.shThis produces tests/data/llama_dataset_text_document.{bin,idx} plus the
per-config dataset-index files under tests/data/.
Each method below is run on both nodes. 0 and 1 are node ranks;
the script uses MASTER_ADDR from env.sh for torchrun rendezvous.
(1) Run Bayesian optimization (one-time per config; only needs 8 GPUs, so it runs on a single node):
CONFIG_MODE=single bash tests/artifact/kareus_run_bayesian.shThis takes about 4 hours on a single 8× A100 node, or about 2 hours if you distribute the 4 partition groups across both nodes (run different partitions on each node — see tests/bayesian/README.md).
Walkthrough video: video
(2) Run the maximum-throughput execution plan (uses BO results from step 1; runs on both nodes):
# On node 0
CONFIG_MODE=single bash tests/artifact/run_kareus.sh 0
# On node 1
CONFIG_MODE=single bash tests/artifact/run_kareus.sh 1Outputs land in:
tests/artifact/nemo_experiments/megatron_llama3.2_3b/cp1_tp8_mbs8_seq4096/kareus/
Walkthrough video: video
(1) Megatron-LM (no frequency control):
CONFIG_MODE=single bash tests/artifact/run_megatron.sh 0 # node 0
CONFIG_MODE=single bash tests/artifact/run_megatron.sh 1 # node 1Walkthrough video: video
(2) Megatron + Perseus:
CONFIG_MODE=single bash tests/artifact/run_perseus.sh 0 # node 0
CONFIG_MODE=single bash tests/artifact/run_perseus.sh 1 # node 1The first run does a per-config GPU-frequency sweep (~2 hours). We also
provide pre-profiled and precomputed results for Megatron+Perseus and
Nanobatching+Perseus (under tests/perseus/schedules/ and
tests/nanobatch_perseus/schedules/) so you can skip the profiling phases:
SKIP_PROFILING=true CONFIG_MODE=single bash tests/artifact/run_perseus.sh 0
SKIP_PROFILING=true CONFIG_MODE=single bash tests/artifact/run_perseus.sh 1Walkthrough video: video
(3) Nanobatching + Perseus:
SKIP_PROFILING=true CONFIG_MODE=single bash tests/artifact/run_nanobatch_perseus.sh 0
SKIP_PROFILING=true CONFIG_MODE=single bash tests/artifact/run_nanobatch_perseus.sh 1Walkthrough video: video
After all four methods have been run, render the §6.2.1 table:
python tests/artifact/compare_method.pyThis writes tests/artifact/compare_methods_table.png (and prints a text
table to stdout) showing per-config iteration-time and energy reductions of
Megatron+Perseus, Nanobatching+Perseus, and Kareus relative to the
Megatron-LM baseline.
Walkthrough video: video
For the frontier comparison we run each method as 10 plans spanning its
time–energy frontier. All commands below take the same node-rank argument
(0 or 1) as in §6.2.1.
If you have already run BO in §6.2.1, just flip FRONTIER=true:
# On node 0
FRONTIER=true CONFIG_MODE=single bash tests/artifact/run_kareus.sh 0
# On node 1
FRONTIER=true CONFIG_MODE=single bash tests/artifact/run_kareus.sh 1Frontier outputs land in:
tests/artifact/nemo_experiments/megatron_llama3.2_3b/cp1_tp8_mbs8_seq4096/kareus/frontier/pipeline_*/
Walkthrough video: video
(1) Megatron + Perseus:
FRONTIER=true SKIP_PROFILING=true CONFIG_MODE=single bash tests/artifact/run_perseus.sh 0
FRONTIER=true SKIP_PROFILING=true CONFIG_MODE=single bash tests/artifact/run_perseus.sh 1Walkthrough video: video
(2) Nanobatching + Perseus:
FRONTIER=true SKIP_PROFILING=true CONFIG_MODE=single bash tests/artifact/run_nanobatch_perseus.sh 0
FRONTIER=true SKIP_PROFILING=true CONFIG_MODE=single bash tests/artifact/run_nanobatch_perseus.sh 1Walkthrough video: video
python tests/artifact/compare_method.py --frontierThis writes tests/artifact/compare_methods_frontier_table.png showing, for
each configuration, the iso-time energy reduction and iso-energy time
reduction of Nanobatching+Perseus and Kareus relative to the Perseus
frontier (10 points each).
Walkthrough video: video
Per-method run outputs (Zeus monitor files, NeMo logs, lowtime solver artefacts) are collected under:
tests/artifact/nemo_experiments/<model>/<config_tag>/<method>/
with <config_tag> = cp${CP}_tp${TP}_mbs${MBS}_seq${SEQ} and <method> ∈
{megatron, nanobatch, perseus, nanobatch_perseus, kareus}. Frontier runs
add a frontier/pipeline_*/ subdirectory per plan. Per-run training stdout
is duplicated to tests/artifact/logs/<model>_<config_tag>_<method>.log.
For the full configuration table (10 configs total), the complete list of
environment variables, and the SKIP_PROFILING schedule-directory layout per
method, see tests/artifact/README.md.