Diffusion Transformers (DiT) + Rectified Flow

This directory contains comprehensive documentation on Diffusion Transformers (DiT) combined with rectified flow — the modern architecture for scalable, flexible generative modeling.

DiT represents the shift from convolutional U-Nets to Transformers, enabling better scaling, flexible conditioning, and modality-agnostic generation.

Core Documentation Series

This series follows the same structure as DDPM, SDE, and flow matching documentation.

Document	Description
00_dit_overview.md	Overview: Why DiT matters, key concepts, modern stack
01_dit_foundations.md	Foundations: Architecture details, components, design choices
02_dit_training.md	Training: How to train DiT + rectified flow models
03_dit_sampling.md	Sampling: How to generate samples efficiently

Supplementary Documents

Deep dives on specific topics (located in docs/diffusion/DiT/):

Document	Description
diffusion_transformer.md	Comprehensive tutorial with biology applications
time_embeddings_explained.md	Deep dive on time conditioning mechanisms

Quick Navigation

For Beginners

Start with the overview to understand the big picture, then move through foundations and training.

Path: Overview → Foundations → Training

For Implementation

Focus on the practical training and sampling guides with code examples.

Path: Training → Sampling → Supplementary docs

For Theory Deep Dive

Understand the architectural choices and mathematical foundations.

Path: Foundations → Supplementary docs → Flow matching theory

Key Concepts

The Modern Generative Stack

Rectified Flow (objective) + DiT (architecture) + AdaLN (conditioning)

Rectified Flow: Simple regression target $$ \mathcal{L} = \mathbb{E}{x_0, x_1, t} \left[ \left| v\theta(x_t, t) - (x_1 - x_0) \right|^2 \right] $$

DiT: Transformer-based architecture

Tokenization: Input → patches → tokens
Self-attention: Global dependencies
AdaLN: Time/condition modulation

Result: Fast, scalable, flexible generation

DiT vs U-Net

Aspect	U-Net	DiT
Architecture	Convolutional	Transformer
Receptive field	Local → Global	Global from start
Input format	Fixed grids	Flexible tokens
Conditioning	Architectural changes	Built-in (AdaLN)
Scaling	Limited	Excellent
Best for	Images, fixed size	Any modality

Core Components

1. Tokenization

Image/Data → Patches → Flatten → Embed → Tokens

2. Time Conditioning (AdaLN)

t → TimeEmbed(t) → MLP → (γ, β) → Modulate features

3. Transformer Blocks

Tokens → Self-Attention → MLP → Updated Tokens

4. Output Projection

Tokens → Linear → Velocity Field

Training Overview

Rectified Flow Loss

Simple regression:

$$ \mathcal{L} = \mathbb{E}_{x_0, x_1, t} \left[ \left| v_\theta(x_t, t) - (x_1 - x_0) \right|^2 \right] $$

where:

$x_0 \sim p_{\text{data}}$ (real data)
$x_1 \sim \mathcal{N}(0, I)$ (noise)
$x_t = t x_1 + (1-t) x_0$ (linear interpolation)

Key advantages:

No noise schedules
No variance parameterization
Direct regression target
Stable training

Training Algorithm

for batch in dataloader:
    x_0 = batch  # Real data
    x_1 = torch.randn_like(x_0)  # Noise
    t = torch.rand(batch_size)  # Random time
    
    # Linear interpolation
    x_t = t * x_1 + (1 - t) * x_0
    
    # Predict velocity
    v_pred = model(x_t, t)
    
    # Compute loss
    target = x_1 - x_0
    loss = F.mse_loss(v_pred, target)
    
    # Update
    loss.backward()
    optimizer.step()

Sampling Overview

ODE Integration

Forward ODE (noise → data):

$$ \frac{dx}{dt} = v_\theta(x, t) $$

Euler discretization:

x = torch.randn(shape)  # Start from noise
dt = 1.0 / num_steps

for k in range(num_steps):
    t = k * dt
    v = model(x, t)
    x = x + v * dt

return x  # Generated sample

Properties:

Deterministic (same noise → same output)
Fast (20-50 steps)
Straight paths (rectified flow)

Applications

Vision

Images: Stable Diffusion 3, DALL-E 3
Videos: Sora, Goku
3D: Point clouds, meshes

Audio

Music: MusicGen
Speech: AudioLDM
Sound effects: Foley generation

Biology

Gene expression: Cell state generation
Perturbations: Predict intervention effects
Trajectories: Developmental paths
Molecules: Protein structure

Other

Robotics: Trajectory planning
Physics: Simulation
Design: CAD, architecture

Why DiT for Biology?

Challenges with Traditional Approaches

Gene expression data:

High-dimensional (10K-30K genes)
Unordered (no natural sequence)
Sparse (many zeros)
Compositional (relative values matter)

U-Net limitations:

Assumes spatial structure
Fixed input sizes
Hard to condition on perturbations

DiT Advantages

Flexibility:

Genes/cells/regions as tokens
Variable-length sequences
Natural conditioning on perturbations

Global interactions:

Self-attention captures gene-gene dependencies
No locality bias
Learn regulatory networks

Scalability:

Handle large gene panels
Batch different experiments
Scale to billions of parameters

Open Questions

Tokenization: How to represent genes as tokens?
- Rank by expression? (Geneformer approach)
- Gene embeddings? (learned representations)
- Set-based? (permutation invariant)
Latent space: Better to work in latent space?
- Encode expression → latent → diffusion
- Avoids sparsity issues
- More stable training
Architecture: DiT vs alternatives?
- State-space models (Mamba, S4)
- Hyena (long convolutions)
- Hybrid approaches

See: Supplementary documents for deeper exploration.

Learning Path

Conceptual Understanding

DiT Overview — Why DiT matters
- Architectural shift from U-Net
- Modern generative stack
- Key advantages
Flow Matching Basics — Rectified flow theory
- Velocity fields
- Linear interpolation
- ODE sampling
DiT Foundations — Architecture details
- Tokenization strategies
- Transformer blocks
- Time conditioning

Practical Implementation

DiT Training — Training pipeline
- Data preparation
- Model architecture
- Training loop
- Hyperparameters
DiT Sampling — Generation strategies
- ODE solvers
- Conditional generation
- Quality vs speed

Advanced Topics

Comprehensive Tutorial — Deep dive
- Alternative backbones
- Biology applications
- State-space models
Time Embeddings — Conditioning mechanisms
- Sinusoidal embeddings
- AdaLN details
- FiLM modulation

Comparison with Other Methods

DiT vs DDPM

Aspect	DDPM	DiT + Rectified Flow
Architecture	U-Net	Transformer
Objective	Noise prediction	Velocity prediction
Training	Noise schedule needed	Simple regression
Sampling	1000 steps (SDE)	20-50 steps (ODE)
Conditioning	Concatenation/FiLM	AdaLN/Cross-attention
Flexibility	Images mainly	Any modality

DiT vs Flow Matching (U-Net)

Aspect	Flow Matching + U-Net	DiT + Rectified Flow
Objective	Same (velocity)	Same (velocity)
Architecture	Convolutional	Transformer
Scaling	Limited	Excellent
Conditioning	Moderate	Excellent
Speed	Fast convolutions	Slower attention

Key insight: DiT is an architectural choice, orthogonal to the training objective.

Key Takeaways

Conceptual

DiT = Transformer architecture for diffusion/flow models
Rectified flow = simple objective (velocity regression)
Together = modern stack for state-of-the-art generation
Tokenization enables modality-agnostic modeling

Practical

Training is simple: Regression on $v = x_1 - x_0$
Sampling is fast: 20-50 ODE steps
Conditioning is easy: Tokens or AdaLN
Scales well: Proven to billions of parameters

For Biology

Flexible representation: Genes, cells, perturbations
Global interactions: Attention captures dependencies
Conditional generation: Model interventions
Active research: Best practices still emerging

References

Key Papers

DiT:

Peebles & Xie (2023): "Scalable Diffusion Models with Transformers"

Rectified Flow:

Liu et al. (2022): "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow"
Liu et al. (2023): "InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation"

Transformers:

Dosovitskiy et al. (2020): "An Image is Worth 16x16 Words" (ViT)
Vaswani et al. (2017): "Attention is All You Need"

Conditioning:

Perez et al. (2018): "FiLM: Visual Reasoning with a General Conditioning Layer"

Modern Implementations

Stable Diffusion 3: DiT-based text-to-image
Sora: DiT for video generation
Hugging Face Diffusers: DiT implementations
OpenAI: DALL-E 3

Summary

Diffusion Transformers (DiT) combined with rectified flow represent the modern approach to generative modeling:

Architecture: Transformers replace U-Nets

Global attention from the start
Flexible tokenization
First-class conditioning

Objective: Rectified flow simplifies training

Direct velocity regression
No noise schedules
Fast ODE sampling

Result: State-of-the-art generation

Images, video, audio
Scalable to billions of parameters
Emerging applications in biology

The modern stack:

Rectified Flow + DiT + AdaLN = Powerful, flexible generation

This combination has become the foundation for cutting-edge generative models and is particularly promising for computational biology applications.

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Diffusion Transformers (DiT) + Rectified Flow

Core Documentation Series

Supplementary Documents

Quick Navigation

For Beginners

For Implementation

For Theory Deep Dive

Key Concepts

The Modern Generative Stack

DiT vs U-Net

Core Components

Training Overview

Rectified Flow Loss

Training Algorithm

Sampling Overview

ODE Integration

Applications

Vision

Audio

Biology

Other

Why DiT for Biology?

Challenges with Traditional Approaches

DiT Advantages

Open Questions

Learning Path

Conceptual Understanding

Practical Implementation

Advanced Topics

Comparison with Other Methods

DiT vs DDPM

DiT vs Flow Matching (U-Net)

Key Takeaways

Conceptual

Practical

For Biology

Related Documentation

Prerequisites

Advanced Topics

Code Examples

References

Key Papers

Modern Implementations

Summary