dUltra: Ultra-Fast Diffusion Large Language Models via Reinforcement Learning

News

[2026.4.16] Thanks to community contribution from WayneDW, now deepspeed is supported with this PR.

[2026.1.11] Code is released.

[2025.12.24] Paper is released on arXiv.

Overview

This repository implements dUltra, a learned path planning framework for masked diffusion language models using Group Relative Policy Optimization (GRPO). By training an unmasking planner head with reinforcement learning, we enable diffusion language models to achieve SOTA performance both in terms of NFE (number of function evaluations) and TPF (token per forward) with high accuracy.

Results

Figure: dUltra achieves state-of-the-art accuracy-efficiency trade-offs on mathematical reasoning (GSM8K, MATH500) and code generation (HumanEval, MBPP) tasks. Left panels show accuracy vs. number of function evaluations (NFE) for different block sizes. Right panel illustrates the architecture with the unmasking planner head. Here we use a block size of 128 and a generation length of 256

Environment Setup

pip install uv
uv sync
uv pip install -e .

You may need to downgrade to datasets==3.6.0 to download the APPS dataset.

Model Checkpoints

We provide trained model checkpoints on Hugging Face Hub:

Model	Block Length	Training Dataset	Description
sengi/dUltra-math	32	GSM8K	Optimized for math reasoning tasks
sengi/dUltra-math-b128	128	GSM8K	Math model with larger block length for faster inference
sengi/dUltra-coding	32	APPS	Optimized for code generation tasks
sengi/dUltra-coding-b128	128	APPS	Coding model with larger block length for faster inference

To use a trained dUltra model:

from model.llada.lladou import LLaDOUModelLM
from transformers import AutoTokenizer

model = LLaDOUModelLM.from_pretrained(
            "sengi/dUltra-math",
            trust_remote_code=True,
            torch_dtype=torch.bfloat16,
        )
tokenizer = AutoTokenizer.from_pretrained("sengi/dUltra-math")

Hardware Requirements

All experiments were conducted on NVIDIA H200 GPUs with DDP. Usage of DeepSpeed is supported with this PR.

Training

Planner Training (`planner/`)

We initialize the planner head by training it to mimic confidence-based sampling (Fast-dLLM method). This provides a strong baseline and avoids degenerate solutions. The planner head learns when to unmask tokens in parallel during denoising.

cd planner
accelerate launch planner_train.py

This pre-trains the planner to mimic confidence-based sampling before GRPO fine-tuning.

GRPO Training (`diffu_grpo/`)

After initialization, we jointly optimize the base model and planner head using reinforcement learning to discover task-specific unmasking strategies that improve both accuracy and efficiency. The GRPO trainer (diffu_grpo/diffu_grpo_trainer.py) extends TRL's GRPOTrainer to jointly optimize the diffusion model and unmasking planner head.

To sweep through configurations:

cd diffu_grpo

sh sbatch_scripts/run.sh

To run a specific configuration:

sh sbatch_scripts/grpo_exp.sh [dataset] [block_length] [adv_clip] [port] [model_path] [num_generations] [per_device_train_batch_size] [num_processes] [temperature]

Parameters:

dataset (default: "apps"): Dataset name (gsm8k, apps)
block_length (default: 32): Maximum tokens unmasked per denoising step
adv_clip (default: 0.1): Minimum advantage clipping threshold
port (default: 12345): Main process port for distributed training
model_path (default: "sengi/dUltra-coding"): Path to model checkpoint or HuggingFace model ID
num_generations (default: 12): Number of rollout trajectories per prompt
per_device_train_batch_size (default: 12): Training batch size per GPU
num_processes (default: 1): Number of GPUs/processes for distributed training
temperature (default: 0.1): Sampling temperature for token generation

Evaluation

Inference Strategies

Located in model/inference/:

Standard (inference_lladou.py): Iterative denoising with learned unmasking probabilities
FastDLLM (inference_fastdllm.py): Confidence-based parallel decoding for faster inference

Math Evaluation

Instead of lm_eval, We use more robust mathematical expression evaluator math-verify.

cd eval_math
sh run_eval.sh

After running run_eval.sh, we need to parse the results:

cd eval_math
python parse_and_get_acc.py eval_dir/

Coding Evaluation

For coding tasks, we evaluate on HumanEval and MBPP datasets using adapted evaluation code from the Dream-Instruct Evaluatino Toolkit.

cd eval_coding
bash eval_coding.sh

Troubleshooting

flash-attn build failures

If uv sync fails building flash-attn from source, it usually means the build toolchain is missing distutils or your Torch/CUDA combo has no prebuilt wheel.

Common fixes:

# Prefer setuptools' bundled distutils when the stdlib module is missing.
SETUPTOOLS_USE_DISTUTILS=local uv sync

# Or install the system distutils package (Debian/Ubuntu example).
sudo apt-get install -y python3-distutils

If you want to avoid compiling from source, pin Torch to a version that has a matching flash-attn wheel (e.g., Torch 2.8.x) before re-running uv sync.

Unable to download APPS dataset

Issue: Error when loading APPS dataset Solution: Downgrade to datasets==3.6.0:

uv pip install datasets==3.6.0

Citation

If you find this work useful, please cite:

@misc{chen2025dultraultrafastdiffusionlanguage,
      title={dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning},
      author={Shirui Chen and Jiantao Jiao and Lillian J. Ratliff and Banghua Zhu},
      year={2025},
      eprint={2512.21446},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.21446},
}

Acknowledgments

This project builds upon and acknowledges the following excellent works:

Math Evaluation Code: The evaluation pipeline is adapted from d1
Coding Evaluation Code: The HumanEval and MBPP evaluation code is adapted from Dream
Model Architecture: The model architecture code is based on LLaDOU

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
diffu_grpo		diffu_grpo
eval_coding		eval_coding
eval_math		eval_math
media		media
model		model
planner		planner
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
pyproject.toml		pyproject.toml
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dUltra: Ultra-Fast Diffusion Large Language Models via Reinforcement Learning

News

Table of Contents

Overview

Results

Environment Setup

Model Checkpoints

Hardware Requirements

Training

Planner Training (`planner/`)

GRPO Training (`diffu_grpo/`)

Evaluation

Inference Strategies

Math Evaluation

Coding Evaluation

Troubleshooting

flash-attn build failures

Unable to download APPS dataset

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dUltra: Ultra-Fast Diffusion Large Language Models via Reinforcement Learning

News

Table of Contents

Overview

Results

Environment Setup

Model Checkpoints

Hardware Requirements

Training

Planner Training (planner/)

GRPO Training (diffu_grpo/)

Evaluation

Inference Strategies

Math Evaluation

Coding Evaluation

Troubleshooting

flash-attn build failures

Unable to download APPS dataset

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Planner Training (`planner/`)

GRPO Training (`diffu_grpo/`)

Packages