[2026.4.16] Thanks to community contribution from WayneDW, now deepspeed is supported with this PR.
[2026.1.11] Code is released.
[2025.12.24] Paper is released on arXiv.
- dUltra: Ultra-Fast Diffusion Large Language Models via Reinforcement Learning
This repository implements dUltra, a learned path planning framework for masked diffusion language models using Group Relative Policy Optimization (GRPO). By training an unmasking planner head with reinforcement learning, we enable diffusion language models to achieve SOTA performance both in terms of NFE (number of function evaluations) and TPF (token per forward) with high accuracy.
Figure: dUltra achieves state-of-the-art accuracy-efficiency trade-offs on mathematical reasoning (GSM8K, MATH500) and code generation (HumanEval, MBPP) tasks. Left panels show accuracy vs. number of function evaluations (NFE) for different block sizes. Right panel illustrates the architecture with the unmasking planner head. Here we use a block size of 128 and a generation length of 256
pip install uv
uv sync
uv pip install -e .You may need to downgrade to datasets==3.6.0 to download the APPS dataset.
We provide trained model checkpoints on Hugging Face Hub:
| Model | Block Length | Training Dataset | Description |
|---|---|---|---|
| sengi/dUltra-math | 32 | GSM8K | Optimized for math reasoning tasks |
| sengi/dUltra-math-b128 | 128 | GSM8K | Math model with larger block length for faster inference |
| sengi/dUltra-coding | 32 | APPS | Optimized for code generation tasks |
| sengi/dUltra-coding-b128 | 128 | APPS | Coding model with larger block length for faster inference |
To use a trained dUltra model:
from model.llada.lladou import LLaDOUModelLM
from transformers import AutoTokenizer
model = LLaDOUModelLM.from_pretrained(
"sengi/dUltra-math",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("sengi/dUltra-math")All experiments were conducted on NVIDIA H200 GPUs with DDP. Usage of DeepSpeed is supported with this PR.
We initialize the planner head by training it to mimic confidence-based sampling (Fast-dLLM method). This provides a strong baseline and avoids degenerate solutions. The planner head learns when to unmask tokens in parallel during denoising.
cd planner
accelerate launch planner_train.pyThis pre-trains the planner to mimic confidence-based sampling before GRPO fine-tuning.
After initialization, we jointly optimize the base model and planner head using reinforcement learning to discover task-specific unmasking strategies that improve both accuracy and efficiency. The GRPO trainer (diffu_grpo/diffu_grpo_trainer.py) extends TRL's GRPOTrainer to jointly optimize the diffusion model and unmasking planner head.
To sweep through configurations:
cd diffu_grpo
sh sbatch_scripts/run.shTo run a specific configuration:
sh sbatch_scripts/grpo_exp.sh [dataset] [block_length] [adv_clip] [port] [model_path] [num_generations] [per_device_train_batch_size] [num_processes] [temperature]Parameters:
dataset(default: "apps"): Dataset name (gsm8k, apps)block_length(default: 32): Maximum tokens unmasked per denoising stepadv_clip(default: 0.1): Minimum advantage clipping thresholdport(default: 12345): Main process port for distributed trainingmodel_path(default: "sengi/dUltra-coding"): Path to model checkpoint or HuggingFace model IDnum_generations(default: 12): Number of rollout trajectories per promptper_device_train_batch_size(default: 12): Training batch size per GPUnum_processes(default: 1): Number of GPUs/processes for distributed trainingtemperature(default: 0.1): Sampling temperature for token generation
Located in model/inference/:
- Standard (
inference_lladou.py): Iterative denoising with learned unmasking probabilities - FastDLLM (
inference_fastdllm.py): Confidence-based parallel decoding for faster inference
Instead of lm_eval, We use more robust mathematical expression evaluator math-verify.
cd eval_math
sh run_eval.shAfter running run_eval.sh, we need to parse the results:
cd eval_math
python parse_and_get_acc.py eval_dir/For coding tasks, we evaluate on HumanEval and MBPP datasets using adapted evaluation code from the Dream-Instruct Evaluatino Toolkit.
cd eval_coding
bash eval_coding.shIf uv sync fails building flash-attn from source, it usually means the build toolchain
is missing distutils or your Torch/CUDA combo has no prebuilt wheel.
Common fixes:
# Prefer setuptools' bundled distutils when the stdlib module is missing.
SETUPTOOLS_USE_DISTUTILS=local uv sync
# Or install the system distutils package (Debian/Ubuntu example).
sudo apt-get install -y python3-distutilsIf you want to avoid compiling from source, pin Torch to a version that has a
matching flash-attn wheel (e.g., Torch 2.8.x) before re-running uv sync.
Issue: Error when loading APPS dataset
Solution: Downgrade to datasets==3.6.0:
uv pip install datasets==3.6.0If you find this work useful, please cite:
@misc{chen2025dultraultrafastdiffusionlanguage,
title={dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning},
author={Shirui Chen and Jiantao Jiao and Lillian J. Ratliff and Banghua Zhu},
year={2025},
eprint={2512.21446},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2512.21446},
}This project builds upon and acknowledges the following excellent works:
