This tutorial explains Diffusion Transformers (DiT) — the architectural shift from convolutional U-Nets to Transformers for generative modeling. We cover why this shift happened, how DiT works, and why it generalizes beyond images.
Prerequisites: Familiarity with rectified flow or diffusion models (see docs/flow_matching/rectifying_flow.md).
A Diffusion Transformer (DiT) is not a new diffusion theory — it's an architectural choice.
DiT is simply a Transformer used to parameterize the function learned in diffusion or flow-based models:
The objective (what to learn) and the architecture (how to learn it) are orthogonal design choices.
Historically, diffusion models used U-Net architectures because:
| Strength | Why It Helped |
|---|---|
| Local structure | Images have strong spatial correlations |
| Multiscale features | Downsampling captures global context |
| Efficient | Convolutions are fast and well-optimized |
| Inductive bias | Spatial structure is built into the architecture |
U-Net learns:
- Local interactions in early layers
- Global interactions via progressive downsampling
- Skip connections preserve fine details
This worked extremely well for images, but came with limitations:
- Fixed grid assumptions: Inputs must be regular grids
- Awkward conditioning: Adding new conditions requires architectural changes
- Limited flexibility: Hard to apply to non-image data
- Special handling: Time and modality need custom integration
Transformers operate on tokens, not grids. The key conceptual move in DiT:
Represent the input
$x_t$ as a sequence of tokens.
For images:
- Split image into patches (e.g., 16×16 pixels)
- Flatten each patch into a vector
- Embed into token space
For other domains:
- Genes, cells, regions, timepoints → tokens
- Patches are a metaphor, not a requirement
Let
Tokenization:
where:
-
$x_t^{(i)} \in \mathbb{R}^{d_{\text{patch}}}$ is the$i$ -th patch -
$N$ is the number of tokens
Embedding:
The Transformer input is:
Diffusion models are time-conditioned. DiT handles this elegantly through modulation, not concatenation.
Standard Transformer block:
DiT with Adaptive LayerNorm (AdaLN):
where
Deep dive: For a detailed explanation of how time embeddings work and why the MLP doesn't "perturb ordering," see time_embeddings_explained.md.
Key insight: Time controls the behavior of the network at every layer, not just its input.
This is the FiLM (Feature-wise Linear Modulation) pattern, which is much cleaner than concatenating
The same AdaLN mechanism handles arbitrary conditions:
- Class labels
- Text embeddings
- Perturbation tokens
- Experimental conditions
Two approaches:
-
Modulation: Embed condition
$c \mapsto e_c$ , use for AdaLN parameters - Cross-attention: Append condition tokens, attend to them
Transformers make adding new conditions trivial — no architectural surgery required.
Inside the Transformer:
Then project back to output space:
Conceptually:
- Self-attention: Learns global dependencies between all tokens
- MLPs: Refine local nonlinearities
- Time modulation: Tells the network where it is along the trajectory
This works regardless of training objective (score matching, noise prediction, or rectified flow).
Combining DiT with rectified flow is particularly elegant.
Recall rectified flow target:
DiT training loss:
where:
-
$v_\theta$ is a Transformer -
$x_t$ is tokenized -
$t$ modulates every layer via AdaLN
Why this combination works well:
| Component | Contribution |
|---|---|
| Transformers | Model long-range structure via attention |
| Rectified flow | Simple, stable regression target |
| AdaLN | Clean time/condition integration |
| ODE sampling | Fast, deterministic generation |
Three structural reasons:
Self-attention is global by default. No need for deep pyramids to propagate information across the image.
With packing/masking tricks (Patch-n-Pack):
- Variable image sizes in same batch
- Variable video lengths
- Heterogeneous biological objects
This is impossible to do cleanly with CNNs.
Adding a new condition:
- Add tokens, or
- Add modulation parameters
No architectural changes needed.
Once you think of DiT as:
"A Transformer learning a time-dependent vector field"
It becomes a general-purpose continuous generative engine.
Applications:
- Images (Stable Diffusion 3, DALL-E 3)
- Videos (Sora, Goku)
- Audio (AudioLDM)
- Molecules (protein structure)
- Trajectories (robotics)
- Latent biological states (gene expression)
Key insight:
- Rectified flow removes density assumptions
- Transformers remove grid assumptions
- Together, they're highly portable
A Diffusion Transformer is a Transformer trained to predict time-conditioned vector fields, replacing convolutional inductive bias with global token interaction.
Key components:
| Component | Purpose |
|---|---|
| Patch embedding | Convert input to tokens |
| Positional encoding | Preserve spatial/sequential structure |
| AdaLN | Time and condition modulation |
| Self-attention | Global dependencies |
| Output projection | Map back to target space |
The modern generative stack:
Rectified Flow (objective) + DiT (architecture) + AdaLN (conditioning)
- Peebles & Xie (2023) - "Scalable Diffusion Models with Transformers" (DiT paper)
- Perez et al. (2018) - "FiLM: Visual Reasoning with a General Conditioning Layer"
- Dosovitskiy et al. (2020) - "An Image is Worth 16x16 Words" (ViT)
The following sections explore alternatives to Transformers for biological applications, where tokenization may not be natural.
Strip away the branding. A diffusion or rectified-flow model needs a function:
Requirements:
- Accept a state representation
- Condition on time
- Optionally condition on context
- Output a vector of the same dimensionality as the state
The real requirement:
A model capable of learning global dependencies and time-conditioned transformations.
Transformers satisfy this — but they are not unique.
Can SSMs (Mamba, S4) or long convolutions (Hyena) be diffusion backbones?
Yes. In fact, this is a natural pairing.
Why?
- Rectified flow defines continuous-time dynamics
- State-space models are literally designed to model dynamics
Architectures like:
- Long convolution models
- SSMs (S4, Mamba)
- Hyena-style implicit sequence operators
are philosophically aligned with flow-based generative modeling.
Why Transformers won historically:
- Easy to scale
- Clean conditioning via cross-attention
- Unified modalities early
- Infrastructure exists
But this is historical inertia, not a fundamental requirement.
Gene expression vectors:
where
Properties:
- Unordered (no natural sequence)
- Dense (most genes have non-zero expression)
- Compositional (relative, not absolute)
- Population-relative
The problem with "genes as tokens":
Approaches like Geneformer rank genes by expression and treat them as a sequence. This works, but feels ontologically wrong:
Ranking genes is not a natural ordering of biological state — it's an engineering trick.
Treat expression as a single state vector:
$x_t \in \mathbb{R}^G$ - Backbone: MLP, SSM, or continuous-time operator
- Time-conditioning via FiLM
This aligns beautifully with rectified flow — you're learning a velocity in gene-expression space.
Instead of tokenizing raw expression:
- Encode expression into latent state
$z \in \mathbb{R}^d$ - Run diffusion/rectified flow in latent space
- Decode only if necessary
The backbone sees:
- Smooth, lower-dimensional states
- No artificial ordering
- No sparsity pathologies
This is where JEPA, VAEs, and diffusion naturally converge.
If you insist on tokens, do it honestly:
- Represent expression as a set (unordered)
- Genes have embeddings
- Expression value modulates them
- Use permutation-invariant operators
- Attention without positional encoding
If your data is time-series, perturb-seq, or trajectories:
- The sequence is time, not genes
- Each timestep holds a full expression state or latent
- Backbone models temporal evolution
This is where SSMs and Hyena-style operators shine.
Combining the insights above:
Expression → Encoder → Latent State
↓
SSM/Hyena modeling latent dynamics
↓
Rectified flow in latent space
↓
Decoder → Expression (if needed)
Properties:
- No fake tokens
- No gene ranking
- Natural temporal modeling
- Proper count handling via VAE decoder
Tokenization is a convenience for architectures, not a requirement of the data.
Once you internalize this:
- DiT becomes "Transformer-as-backbone"
- Rectified flow becomes "state evolution"
- Hyena/SSMs become first-class alternatives
- Gene expression stops being forced into unnatural formats
See docs/incubation/ for explorations of:
- Latent rectified-flow + SSM architectures for perturb-seq
- Transformer vs SSM inductive biases for biological dynamics
- When tokenization is biologically meaningful (pathways, modules)