Skip to content

chinsengi/dUltra-os

Repository files navigation

dUltra: Ultra-Fast Diffusion Large Language Models via Reinforcement Learning

ArXiv Checkpoint Checkpoint Checkpoint Checkpoint

News

[2026.4.16] Thanks to community contribution from WayneDW, now deepspeed is supported with this PR.

[2026.1.11] Code is released.

[2025.12.24] Paper is released on arXiv.

Table of Contents

Overview

This repository implements dUltra, a learned path planning framework for masked diffusion language models using Group Relative Policy Optimization (GRPO). By training an unmasking planner head with reinforcement learning, we enable diffusion language models to achieve SOTA performance both in terms of NFE (number of function evaluations) and TPF (token per forward) with high accuracy.

Results

Performance Comparison

Figure: dUltra achieves state-of-the-art accuracy-efficiency trade-offs on mathematical reasoning (GSM8K, MATH500) and code generation (HumanEval, MBPP) tasks. Left panels show accuracy vs. number of function evaluations (NFE) for different block sizes. Right panel illustrates the architecture with the unmasking planner head. Here we use a block size of 128 and a generation length of 256

Environment Setup

pip install uv
uv sync
uv pip install -e .

You may need to downgrade to datasets==3.6.0 to download the APPS dataset.

Model Checkpoints

We provide trained model checkpoints on Hugging Face Hub:

Model Block Length Training Dataset Description
sengi/dUltra-math 32 GSM8K Optimized for math reasoning tasks
sengi/dUltra-math-b128 128 GSM8K Math model with larger block length for faster inference
sengi/dUltra-coding 32 APPS Optimized for code generation tasks
sengi/dUltra-coding-b128 128 APPS Coding model with larger block length for faster inference

To use a trained dUltra model:

from model.llada.lladou import LLaDOUModelLM
from transformers import AutoTokenizer

model = LLaDOUModelLM.from_pretrained(
            "sengi/dUltra-math",
            trust_remote_code=True,
            torch_dtype=torch.bfloat16,
        )
tokenizer = AutoTokenizer.from_pretrained("sengi/dUltra-math")

Hardware Requirements

All experiments were conducted on NVIDIA H200 GPUs with DDP. Usage of DeepSpeed is supported with this PR.

Training

Planner Training (planner/)

We initialize the planner head by training it to mimic confidence-based sampling (Fast-dLLM method). This provides a strong baseline and avoids degenerate solutions. The planner head learns when to unmask tokens in parallel during denoising.

cd planner
accelerate launch planner_train.py

This pre-trains the planner to mimic confidence-based sampling before GRPO fine-tuning.

GRPO Training (diffu_grpo/)

After initialization, we jointly optimize the base model and planner head using reinforcement learning to discover task-specific unmasking strategies that improve both accuracy and efficiency. The GRPO trainer (diffu_grpo/diffu_grpo_trainer.py) extends TRL's GRPOTrainer to jointly optimize the diffusion model and unmasking planner head.

To sweep through configurations:

cd diffu_grpo

sh sbatch_scripts/run.sh

To run a specific configuration:

sh sbatch_scripts/grpo_exp.sh [dataset] [block_length] [adv_clip] [port] [model_path] [num_generations] [per_device_train_batch_size] [num_processes] [temperature]

Parameters:

  1. dataset (default: "apps"): Dataset name (gsm8k, apps)
  2. block_length (default: 32): Maximum tokens unmasked per denoising step
  3. adv_clip (default: 0.1): Minimum advantage clipping threshold
  4. port (default: 12345): Main process port for distributed training
  5. model_path (default: "sengi/dUltra-coding"): Path to model checkpoint or HuggingFace model ID
  6. num_generations (default: 12): Number of rollout trajectories per prompt
  7. per_device_train_batch_size (default: 12): Training batch size per GPU
  8. num_processes (default: 1): Number of GPUs/processes for distributed training
  9. temperature (default: 0.1): Sampling temperature for token generation

Evaluation

Inference Strategies

Located in model/inference/:

  • Standard (inference_lladou.py): Iterative denoising with learned unmasking probabilities
  • FastDLLM (inference_fastdllm.py): Confidence-based parallel decoding for faster inference

Math Evaluation

Instead of lm_eval, We use more robust mathematical expression evaluator math-verify.

cd eval_math
sh run_eval.sh

After running run_eval.sh, we need to parse the results:

cd eval_math
python parse_and_get_acc.py eval_dir/

Coding Evaluation

For coding tasks, we evaluate on HumanEval and MBPP datasets using adapted evaluation code from the Dream-Instruct Evaluatino Toolkit.

cd eval_coding
bash eval_coding.sh

Troubleshooting

flash-attn build failures

If uv sync fails building flash-attn from source, it usually means the build toolchain is missing distutils or your Torch/CUDA combo has no prebuilt wheel.

Common fixes:

# Prefer setuptools' bundled distutils when the stdlib module is missing.
SETUPTOOLS_USE_DISTUTILS=local uv sync

# Or install the system distutils package (Debian/Ubuntu example).
sudo apt-get install -y python3-distutils

If you want to avoid compiling from source, pin Torch to a version that has a matching flash-attn wheel (e.g., Torch 2.8.x) before re-running uv sync.

Unable to download APPS dataset

Issue: Error when loading APPS dataset Solution: Downgrade to datasets==3.6.0:

uv pip install datasets==3.6.0

Citation

If you find this work useful, please cite:

@misc{chen2025dultraultrafastdiffusionlanguage,
      title={dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning},
      author={Shirui Chen and Jiantao Jiao and Lillian J. Ratliff and Banghua Zhu},
      year={2025},
      eprint={2512.21446},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.21446},
}

Acknowledgments

This project builds upon and acknowledges the following excellent works:

  • Math Evaluation Code: The evaluation pipeline is adapted from d1
  • Coding Evaluation Code: The HumanEval and MBPP evaluation code is adapted from Dream
  • Model Architecture: The model architecture code is based on LLaDOU

About

dUltra: Ultra-Fast Diffusion Large Language Models via Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors