Open Research: Tokenization in Diffusion Transformers

Status: Active research area (as of January 2026)

Core Question: What is the "right" way to tokenize complex objects (images, gene expression, molecules) for transformer-based generative models?

The Problem

Transformers operate on sequences of tokens. For natural language and DNA/RNA, tokenization is natural — these are inherently sequential. But for other modalities, tokenization feels contrived and arbitrary.

The Uncomfortable Truth

Patch-based tokenization (16×16, 8×8, etc.) is a pragmatic hack that works, but lacks principled justification.

Engineering Reality: "Does it work?" ✅
Theoretical Satisfaction: "Is it right?" ❌

This document explores why current approaches feel unsatisfying and outlines open research directions.

1. Images: The Patch Problem

Current Approach

Standard practice (ViT, DiT, Stable Diffusion 3):

# Split image into fixed-size patches
patches = image.unfold(dimension=2, size=16, step=16)  # 16×16 patches
tokens = embed(patches)
output = transformer(tokens)

Why This Feels Wrong

Q1: Why 16×16?

Not "right" — just empirically tuned
Different models use different sizes (2×2, 4×4, 8×8, 14×14, 16×16)
No principled way to choose

Q2: Should patch size depend on content?

Medical images (smooth gradients): Large patches OK
Text images (fine details): Small patches needed
Satellite images: Depends on scale of features
Current approach: One size for all!

Q3: Do patches respect semantic boundaries?

A 16×16 patch might contain:
- Half a face, half background
- Part of an object, part of another
- Arbitrary image regions
Our visual cortex doesn't work this way

Trade-offs

Patch Size	Pros	Cons
Small (2×2, 4×4)	Fine details, local structure	More tokens, O(n²) attention cost
Large (16×16, 32×32)	Fewer tokens, faster	Loss of detail, coarse representation

The problem: This is a hyperparameter, not a principled design choice.

2. Gene Expression: Even Less Obvious

Gene expression vectors: $x \in \mathbb{R}^{20000}$ (20K genes)

Properties:

Unordered: No natural sequence (unlike DNA)
Dense: Most genes have non-zero expression
Compositional: Relative values matter
High-dimensional: 10K-30K genes typical

Current Approaches (2023-2026)

Approach 1: Rank by Expression (Geneformer)

# Sort genes by expression level
genes_sorted = sort_by_expression(gene_expression)
tokens = [gene_1, gene_2, ..., gene_20000]
output = transformer(tokens)

Problems:

Ranking is arbitrary — not biological
20K tokens = huge sequences (O(n²) = 400M operations)
What about genes with same expression?
Loses biological structure

Approach 2: Gene Modules/Pathways

# Group genes by function
modules = {
    "glycolysis": [gene_1, gene_5, ...],
    "cell_cycle": [gene_2, gene_15, ...],
}
tokens = [module_embeddings]  # ~500 pathways

Problems:

How to define modules? (Also arbitrary!)
Loses individual gene information
Ignores within-module correlations

Approach 3: No Explicit Tokenization

# Direct embedding to latent space
z = encoder(gene_expression)  # (20000,) → (512,)
output = model(z)  # No tokens!

Problems:

Less interpretable
Loses biological structure
Black box

Approach 4: Graph-Structured (GRN-aware)

# Use gene regulatory network
grn = load_gene_regulatory_network()
output = graph_transformer(gene_expression, grn)

Problems:

GRN knowledge incomplete
Still 20K nodes to handle
Which GRN to use?

The Core Issue

There is no natural "tokenization" for gene expression.

Unlike images (spatial structure) or language (sequential structure), gene expression is:

A set (unordered)
A vector (continuous)
A network (interconnected)

Forcing it into a sequence feels wrong because it is wrong.

3. Why Do Patches Work Despite Being Arbitrary?

Pragmatic Reasons

1. Computational Efficiency

256×256 image = 65,536 pixels
With 16×16 patches = 256 tokens
Attention: 65,536² → 256² (65,000× reduction!)

2. Transfer from NLP

Transformers proven for sequences
Patches make images "sequence-like"
Can reuse architectures

3. Good Enough in Practice

ImageNet SOTA achieved
Stable Diffusion works
Empirical success

4. Implementation Simplicity

Easy to code
GPU-efficient
Standard operations

But This Doesn't Make It "Right"

Engineering success ≠ Principled design

The field has optimized for what works, not what makes sense.

4. Alternative Approaches (Research Frontiers)

4.1 Hierarchical Tokenization

Idea: Learn local semantics first, then group into "super tokens"

Swin Transformer (2021):

Image → Small patches → Local attention → Merge
              ↓
      "Super tokens" (hierarchical)
              ↓
      Global attention

Status: Works well, but still uses fixed patch sizes at each level.

4.2 Learned Tokenization

Idea: Don't fix patch size — learn how to tokenize!

BEiT, VQGAN, MaskGIT:

# Instead of fixed patches
tokens = split_into_patches(image, size=16)  # Fixed

# Learn tokenization
tokens = learned_tokenizer(image)  # Adaptive!

Advantages:

Content-aware
Can adapt to different regions
Potentially more semantic

Challenges:

How to train the tokenizer?
Discrete vs continuous tokens?
Computational cost

4.3 Convolutional Stem

Idea: Use CNNs for local features, Transformers for global

class HybridModel(nn.Module):
    def __init__(self):
        # CNN extracts local semantics
        self.conv_stem = ResNet(...)
        # Transformer on CNN features
        self.transformer = Transformer(...)

Status: Used in some models, but not standard for DiT.

4.4 No Tokenization (Continuous)

Idea: Work directly in continuous space

For images:

Latent diffusion (VAE → continuous latent → diffusion)
No explicit tokens

For gene expression:

Direct MLP/attention on expression vector
Treat as continuous state, not sequence

Advantage: No arbitrary discretization

Disadvantage: May lose interpretability

5. Biological Inspiration: How Should We Think About This?

How Visual Cortex Works

Retina → V1 (edges) → V2 (motion) → V3 (shape) → V4 → IT (objects)

Key properties:

Hierarchical: Simple → complex features
Local receptive fields that grow
Specialization: Different areas for different features
Sparse coding: Neurons fire selectively
Feedback: Top-down and bottom-up

Current Models vs Biology

Aspect	Biology	Patch-based DiT
Hierarchy	Yes (V1→V2→V3→V4)	Flat (all patches equal)
Local first	Yes (small receptive fields)	No (global attention)
Adaptive	Yes (attention, feedback)	No (fixed patches)
Sparse	Yes (selective firing)	No (dense attention)

Conclusion: Current approaches are not biologically inspired.

Should We Care?

Two perspectives:

Pragmatic: "Biology is slow, backprop works, patches work — who cares?"

Valid for engineering
Gets SOTA results

Principled: "Understanding biology might lead to better architectures"

Valid for research
May unlock new capabilities

Reality: Field is mostly pragmatic (for now).

6. Open Research Questions

For Images

Q1: What is the optimal tokenization strategy?

Fixed patches? Learned? Hierarchical?
Content-adaptive?
Task-specific?

Q2: Can we learn tokenization end-to-end?

Jointly with the generative model?
Discrete vs continuous?

Q3: How important is biological plausibility?

Should we model V1→V2→V3→V4?
Or is attention enough?

For Gene Expression

Q4: What is the "right" representation?

Tokens (if so, what kind)?
Continuous embeddings?
Graph structure?

Q5: Should tokenization respect biological structure?

Gene modules/pathways?
Regulatory networks?
Or learn from data?

Q6: How to handle high dimensionality?

20K genes → how many tokens?
Latent space diffusion?
Hierarchical representation?

General Questions

Q7: Is tokenization necessary at all?

Can we do generative modeling without tokens?
Continuous-space alternatives?

Q8: Should tokenization be modality-specific?

Images: Patches
Audio: Time patches
Gene expression: ???
Or unified approach?

Q9: How to evaluate tokenization quality?

Reconstruction error?
Downstream task performance?
Interpretability?

7. Current State of the Field (January 2026)

What's Working

For images:

Fixed patches (8×8, 16×16) are standard
Empirically tuned per model
Stable Diffusion 3, Sora use patch-based approaches

For gene expression:

Multiple approaches being explored
No clear winner yet
Geneformer (ranking), scPPDM (tabular), others

What's Being Researched

Active areas:

Learned tokenization (VQ-VAE, MaskGIT)
Hierarchical models (Swin, PVT)
Hybrid CNN-Transformer
Graph-structured attention
Continuous-space alternatives

What's Still Unknown

Open problems:

Principled way to choose patch size
Optimal tokenization for non-image modalities
Whether biological inspiration helps
Unified tokenization across modalities

8. Recommendations for Practitioners

For Image Generation (DiT)

Current best practice:

# Use empirically-tuned patch sizes
patch_size = 8  # For 256×256 images (DiT-XL/8)
# or
patch_size = 4  # For higher quality (more compute)

Experiment with:

Different patch sizes for your data
Hierarchical approaches if quality matters
Latent diffusion (VAE + diffusion) to avoid tokenization

For Gene Expression

Recommended approach (as of 2026):

# Option 1: No explicit tokenization
z = encoder(gene_expression)  # (20000,) → (512,)
output = diffusion_model(z)

# Option 2: Biologically-structured
grn = load_gene_regulatory_network()
output = graph_diffusion(gene_expression, grn)

# Option 3: Learned modules
modules = learn_gene_modules(data)  # Data-driven
tokens = embed_by_modules(gene_expression, modules)
output = transformer(tokens)

Then:

Compare approaches empirically
Publish ablation study
Let performance guide you

General Advice

Start simple:

Use standard approaches (patches for images, embeddings for other)
Get baseline working
Then experiment with alternatives

Don't overthink:

If patches work for your task, use them
Principled design is nice, but results matter

But do explore:

This is an open research area
Novel tokenization strategies could be publishable
Especially for non-image modalities

9. Future Directions

Near-term (2026-2027)

Likely developments:

More learned tokenization methods
Better hierarchical models
Modality-specific tokenization strategies
Improved understanding of why patches work

Medium-term (2027-2029)

Possible breakthroughs:

Unified tokenization framework
Biologically-inspired alternatives that match SOTA
Continuous-space generative models (no tokens)
Neural architecture search for tokenization

Long-term (2029+)

Speculative:

Fundamental rethinking of tokenization
New architectures that don't need tokens
True biological plausibility
Modality-agnostic generative models

10. The Bigger Picture

The Tension

Engineering Pragmatism     vs     Principled Design
"Does it work?"            vs     "Is it right?"
Empirical tuning           vs     Theory-driven
Fast iteration             vs     Deep understanding

Current state: Pragmatism dominates

Patch sizes: Empirically tuned
Architecture choices: What works on benchmarks
Limited theoretical understanding

Future direction: More principled approaches

Understanding WHY things work
Biologically-inspired designs
Learned, adaptive strategies

Why This Matters

For science:

Understanding principles leads to better models
Biological inspiration may unlock new capabilities
Theory guides experimentation

For engineering:

Principled designs generalize better
Less hyperparameter tuning
More robust to distribution shift

For biology applications:

Gene expression needs better representations
Biological structure should inform design
Interpretability matters

11. Conclusion

The honest assessment:

Patch-based tokenization is arbitrary and unnatural.

16×16 is not "right" — it's empirically tuned
Should vary by resolution, task, data
Doesn't respect semantic boundaries
Not biologically inspired

But it works.

Achieves SOTA on many tasks
Computationally efficient
Easy to implement

For gene expression, it's even worse.

No natural tokenization exists
Current approaches are hacks
Open research problem

The field is still figuring this out.

Active research area
No consensus
Your skepticism is warranted

Recommendations:

Use standard approaches to get started
Experiment with alternatives
Let performance guide you
Contribute to the research!

References

Tokenization Approaches

Patch-based:

Dosovitskiy et al. (2020): "An Image is Worth 16x16 Words" (ViT)
Peebles & Xie (2023): "Scalable Diffusion Models with Transformers" (DiT)

Hierarchical:

Liu et al. (2021): "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows"
Wang et al. (2021): "Pyramid Vision Transformer"

Learned Tokenization:

Bao et al. (2021): "BEiT: BERT Pre-Training of Image Transformers"
Esser et al. (2021): "Taming Transformers for High-Resolution Image Synthesis" (VQGAN)
Chang et al. (2022): "MaskGIT: Masked Generative Image Transformer"

Gene Expression:

Theodoris et al. (2023): "Transfer learning enables predictions in network biology" (Geneformer)
Cui et al. (2024): "scGPT: Toward Building a Foundation Model for Single-Cell Multi-omics"

Biological Inspiration

Hinton et al. (2017): "Dynamic Routing Between Capsules" (Capsule Networks)
Rao & Ballard (1999): "Predictive coding in the visual cortex"

Discussion Questions

For researchers:

Can we develop a principled theory of tokenization?
Should tokenization be learned end-to-end with the model?
How important is biological plausibility?
Can we unify tokenization across modalities?

For practitioners:

How to choose patch size for my data?
When should I use hierarchical models?
Is learned tokenization worth the complexity?
How to tokenize gene expression data?

For the field:

Are we over-engineering tokenization?
Should we move beyond tokens entirely?
What can biology teach us?
How to balance pragmatism and principles?

Status: Open research area — contribute your ideas!

Last updated: January 13, 2026

FilesExpand file tree

open_research_tokenization.md

Latest commit

History

open_research_tokenization.md

File metadata and controls

Open Research: Tokenization in Diffusion Transformers

The Problem

The Uncomfortable Truth

1. Images: The Patch Problem

Current Approach

Why This Feels Wrong

Trade-offs

2. Gene Expression: Even Less Obvious

Current Approaches (2023-2026)

The Core Issue

3. Why Do Patches Work Despite Being Arbitrary?

Pragmatic Reasons

But This Doesn't Make It "Right"

4. Alternative Approaches (Research Frontiers)

4.1 Hierarchical Tokenization

4.2 Learned Tokenization

4.3 Convolutional Stem

4.4 No Tokenization (Continuous)

5. Biological Inspiration: How Should We Think About This?

How Visual Cortex Works

Current Models vs Biology

Should We Care?

6. Open Research Questions

For Images

For Gene Expression

General Questions

7. Current State of the Field (January 2026)

What's Working

What's Being Researched

What's Still Unknown

8. Recommendations for Practitioners

For Image Generation (DiT)

For Gene Expression

General Advice

9. Future Directions

Near-term (2026-2027)

Medium-term (2027-2029)

Long-term (2029+)

10. The Bigger Picture

The Tension

Why This Matters

11. Conclusion

References

Tokenization Approaches

Biological Inspiration

Discussion Questions