Skip to content

Latest commit

 

History

History
626 lines (431 loc) · 15 KB

File metadata and controls

626 lines (431 loc) · 15 KB

Open Research: Tokenization in Diffusion Transformers

Status: Active research area (as of January 2026)

Core Question: What is the "right" way to tokenize complex objects (images, gene expression, molecules) for transformer-based generative models?


The Problem

Transformers operate on sequences of tokens. For natural language and DNA/RNA, tokenization is natural — these are inherently sequential. But for other modalities, tokenization feels contrived and arbitrary.

The Uncomfortable Truth

Patch-based tokenization (16×16, 8×8, etc.) is a pragmatic hack that works, but lacks principled justification.

Engineering Reality: "Does it work?" ✅
Theoretical Satisfaction: "Is it right?" ❌

This document explores why current approaches feel unsatisfying and outlines open research directions.


1. Images: The Patch Problem

Current Approach

Standard practice (ViT, DiT, Stable Diffusion 3):

# Split image into fixed-size patches
patches = image.unfold(dimension=2, size=16, step=16)  # 16×16 patches
tokens = embed(patches)
output = transformer(tokens)

Why This Feels Wrong

Q1: Why 16×16?

  • Not "right" — just empirically tuned
  • Different models use different sizes (2×2, 4×4, 8×8, 14×14, 16×16)
  • No principled way to choose

Q2: Should patch size depend on content?

  • Medical images (smooth gradients): Large patches OK
  • Text images (fine details): Small patches needed
  • Satellite images: Depends on scale of features
  • Current approach: One size for all!

Q3: Do patches respect semantic boundaries?

  • A 16×16 patch might contain:
    • Half a face, half background
    • Part of an object, part of another
    • Arbitrary image regions
  • Our visual cortex doesn't work this way

Trade-offs

Patch Size Pros Cons
Small (2×2, 4×4) Fine details, local structure More tokens, O(n²) attention cost
Large (16×16, 32×32) Fewer tokens, faster Loss of detail, coarse representation

The problem: This is a hyperparameter, not a principled design choice.


2. Gene Expression: Even Less Obvious

Gene expression vectors: $x \in \mathbb{R}^{20000}$ (20K genes)

Properties:

  • Unordered: No natural sequence (unlike DNA)
  • Dense: Most genes have non-zero expression
  • Compositional: Relative values matter
  • High-dimensional: 10K-30K genes typical

Current Approaches (2023-2026)

Approach 1: Rank by Expression (Geneformer)

# Sort genes by expression level
genes_sorted = sort_by_expression(gene_expression)
tokens = [gene_1, gene_2, ..., gene_20000]
output = transformer(tokens)

Problems:

  • Ranking is arbitrary — not biological
  • 20K tokens = huge sequences (O(n²) = 400M operations)
  • What about genes with same expression?
  • Loses biological structure

Approach 2: Gene Modules/Pathways

# Group genes by function
modules = {
    "glycolysis": [gene_1, gene_5, ...],
    "cell_cycle": [gene_2, gene_15, ...],
}
tokens = [module_embeddings]  # ~500 pathways

Problems:

  • How to define modules? (Also arbitrary!)
  • Loses individual gene information
  • Ignores within-module correlations

Approach 3: No Explicit Tokenization

# Direct embedding to latent space
z = encoder(gene_expression)  # (20000,) → (512,)
output = model(z)  # No tokens!

Problems:

  • Less interpretable
  • Loses biological structure
  • Black box

Approach 4: Graph-Structured (GRN-aware)

# Use gene regulatory network
grn = load_gene_regulatory_network()
output = graph_transformer(gene_expression, grn)

Problems:

  • GRN knowledge incomplete
  • Still 20K nodes to handle
  • Which GRN to use?

The Core Issue

There is no natural "tokenization" for gene expression.

Unlike images (spatial structure) or language (sequential structure), gene expression is:

  • A set (unordered)
  • A vector (continuous)
  • A network (interconnected)

Forcing it into a sequence feels wrong because it is wrong.


3. Why Do Patches Work Despite Being Arbitrary?

Pragmatic Reasons

1. Computational Efficiency

256×256 image = 65,536 pixels
With 16×16 patches = 256 tokens
Attention: 65,536² → 256² (65,000× reduction!)

2. Transfer from NLP

  • Transformers proven for sequences
  • Patches make images "sequence-like"
  • Can reuse architectures

3. Good Enough in Practice

  • ImageNet SOTA achieved
  • Stable Diffusion works
  • Empirical success

4. Implementation Simplicity

  • Easy to code
  • GPU-efficient
  • Standard operations

But This Doesn't Make It "Right"

Engineering success ≠ Principled design

The field has optimized for what works, not what makes sense.


4. Alternative Approaches (Research Frontiers)

4.1 Hierarchical Tokenization

Idea: Learn local semantics first, then group into "super tokens"

Swin Transformer (2021):

Image → Small patches → Local attention → Merge
              ↓
      "Super tokens" (hierarchical)
              ↓
      Global attention

Status: Works well, but still uses fixed patch sizes at each level.

4.2 Learned Tokenization

Idea: Don't fix patch size — learn how to tokenize!

BEiT, VQGAN, MaskGIT:

# Instead of fixed patches
tokens = split_into_patches(image, size=16)  # Fixed

# Learn tokenization
tokens = learned_tokenizer(image)  # Adaptive!

Advantages:

  • Content-aware
  • Can adapt to different regions
  • Potentially more semantic

Challenges:

  • How to train the tokenizer?
  • Discrete vs continuous tokens?
  • Computational cost

4.3 Convolutional Stem

Idea: Use CNNs for local features, Transformers for global

class HybridModel(nn.Module):
    def __init__(self):
        # CNN extracts local semantics
        self.conv_stem = ResNet(...)
        # Transformer on CNN features
        self.transformer = Transformer(...)

Status: Used in some models, but not standard for DiT.

4.4 No Tokenization (Continuous)

Idea: Work directly in continuous space

For images:

  • Latent diffusion (VAE → continuous latent → diffusion)
  • No explicit tokens

For gene expression:

  • Direct MLP/attention on expression vector
  • Treat as continuous state, not sequence

Advantage: No arbitrary discretization

Disadvantage: May lose interpretability


5. Biological Inspiration: How Should We Think About This?

How Visual Cortex Works

Retina → V1 (edges) → V2 (motion) → V3 (shape) → V4 → IT (objects)

Key properties:

  1. Hierarchical: Simple → complex features
  2. Local receptive fields that grow
  3. Specialization: Different areas for different features
  4. Sparse coding: Neurons fire selectively
  5. Feedback: Top-down and bottom-up

Current Models vs Biology

Aspect Biology Patch-based DiT
Hierarchy Yes (V1→V2→V3→V4) Flat (all patches equal)
Local first Yes (small receptive fields) No (global attention)
Adaptive Yes (attention, feedback) No (fixed patches)
Sparse Yes (selective firing) No (dense attention)

Conclusion: Current approaches are not biologically inspired.

Should We Care?

Two perspectives:

Pragmatic: "Biology is slow, backprop works, patches work — who cares?"

  • Valid for engineering
  • Gets SOTA results

Principled: "Understanding biology might lead to better architectures"

  • Valid for research
  • May unlock new capabilities

Reality: Field is mostly pragmatic (for now).


6. Open Research Questions

For Images

Q1: What is the optimal tokenization strategy?

  • Fixed patches? Learned? Hierarchical?
  • Content-adaptive?
  • Task-specific?

Q2: Can we learn tokenization end-to-end?

  • Jointly with the generative model?
  • Discrete vs continuous?

Q3: How important is biological plausibility?

  • Should we model V1→V2→V3→V4?
  • Or is attention enough?

For Gene Expression

Q4: What is the "right" representation?

  • Tokens (if so, what kind)?
  • Continuous embeddings?
  • Graph structure?

Q5: Should tokenization respect biological structure?

  • Gene modules/pathways?
  • Regulatory networks?
  • Or learn from data?

Q6: How to handle high dimensionality?

  • 20K genes → how many tokens?
  • Latent space diffusion?
  • Hierarchical representation?

General Questions

Q7: Is tokenization necessary at all?

  • Can we do generative modeling without tokens?
  • Continuous-space alternatives?

Q8: Should tokenization be modality-specific?

  • Images: Patches
  • Audio: Time patches
  • Gene expression: ???
  • Or unified approach?

Q9: How to evaluate tokenization quality?

  • Reconstruction error?
  • Downstream task performance?
  • Interpretability?

7. Current State of the Field (January 2026)

What's Working

For images:

  • Fixed patches (8×8, 16×16) are standard
  • Empirically tuned per model
  • Stable Diffusion 3, Sora use patch-based approaches

For gene expression:

  • Multiple approaches being explored
  • No clear winner yet
  • Geneformer (ranking), scPPDM (tabular), others

What's Being Researched

Active areas:

  1. Learned tokenization (VQ-VAE, MaskGIT)
  2. Hierarchical models (Swin, PVT)
  3. Hybrid CNN-Transformer
  4. Graph-structured attention
  5. Continuous-space alternatives

What's Still Unknown

Open problems:

  • Principled way to choose patch size
  • Optimal tokenization for non-image modalities
  • Whether biological inspiration helps
  • Unified tokenization across modalities

8. Recommendations for Practitioners

For Image Generation (DiT)

Current best practice:

# Use empirically-tuned patch sizes
patch_size = 8  # For 256×256 images (DiT-XL/8)
# or
patch_size = 4  # For higher quality (more compute)

Experiment with:

  • Different patch sizes for your data
  • Hierarchical approaches if quality matters
  • Latent diffusion (VAE + diffusion) to avoid tokenization

For Gene Expression

Recommended approach (as of 2026):

# Option 1: No explicit tokenization
z = encoder(gene_expression)  # (20000,) → (512,)
output = diffusion_model(z)

# Option 2: Biologically-structured
grn = load_gene_regulatory_network()
output = graph_diffusion(gene_expression, grn)

# Option 3: Learned modules
modules = learn_gene_modules(data)  # Data-driven
tokens = embed_by_modules(gene_expression, modules)
output = transformer(tokens)

Then:

  • Compare approaches empirically
  • Publish ablation study
  • Let performance guide you

General Advice

Start simple:

  1. Use standard approaches (patches for images, embeddings for other)
  2. Get baseline working
  3. Then experiment with alternatives

Don't overthink:

  • If patches work for your task, use them
  • Principled design is nice, but results matter

But do explore:

  • This is an open research area
  • Novel tokenization strategies could be publishable
  • Especially for non-image modalities

9. Future Directions

Near-term (2026-2027)

Likely developments:

  1. More learned tokenization methods
  2. Better hierarchical models
  3. Modality-specific tokenization strategies
  4. Improved understanding of why patches work

Medium-term (2027-2029)

Possible breakthroughs:

  1. Unified tokenization framework
  2. Biologically-inspired alternatives that match SOTA
  3. Continuous-space generative models (no tokens)
  4. Neural architecture search for tokenization

Long-term (2029+)

Speculative:

  1. Fundamental rethinking of tokenization
  2. New architectures that don't need tokens
  3. True biological plausibility
  4. Modality-agnostic generative models

10. The Bigger Picture

The Tension

Engineering Pragmatism     vs     Principled Design
"Does it work?"            vs     "Is it right?"
Empirical tuning           vs     Theory-driven
Fast iteration             vs     Deep understanding

Current state: Pragmatism dominates

  • Patch sizes: Empirically tuned
  • Architecture choices: What works on benchmarks
  • Limited theoretical understanding

Future direction: More principled approaches

  • Understanding WHY things work
  • Biologically-inspired designs
  • Learned, adaptive strategies

Why This Matters

For science:

  • Understanding principles leads to better models
  • Biological inspiration may unlock new capabilities
  • Theory guides experimentation

For engineering:

  • Principled designs generalize better
  • Less hyperparameter tuning
  • More robust to distribution shift

For biology applications:

  • Gene expression needs better representations
  • Biological structure should inform design
  • Interpretability matters

11. Conclusion

The honest assessment:

Patch-based tokenization is arbitrary and unnatural.

  • 16×16 is not "right" — it's empirically tuned
  • Should vary by resolution, task, data
  • Doesn't respect semantic boundaries
  • Not biologically inspired

But it works.

  • Achieves SOTA on many tasks
  • Computationally efficient
  • Easy to implement

For gene expression, it's even worse.

  • No natural tokenization exists
  • Current approaches are hacks
  • Open research problem

The field is still figuring this out.

  • Active research area
  • No consensus
  • Your skepticism is warranted

Recommendations:

  1. Use standard approaches to get started
  2. Experiment with alternatives
  3. Let performance guide you
  4. Contribute to the research!

References

Tokenization Approaches

Patch-based:

  • Dosovitskiy et al. (2020): "An Image is Worth 16x16 Words" (ViT)
  • Peebles & Xie (2023): "Scalable Diffusion Models with Transformers" (DiT)

Hierarchical:

  • Liu et al. (2021): "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows"
  • Wang et al. (2021): "Pyramid Vision Transformer"

Learned Tokenization:

  • Bao et al. (2021): "BEiT: BERT Pre-Training of Image Transformers"
  • Esser et al. (2021): "Taming Transformers for High-Resolution Image Synthesis" (VQGAN)
  • Chang et al. (2022): "MaskGIT: Masked Generative Image Transformer"

Gene Expression:

  • Theodoris et al. (2023): "Transfer learning enables predictions in network biology" (Geneformer)
  • Cui et al. (2024): "scGPT: Toward Building a Foundation Model for Single-Cell Multi-omics"

Biological Inspiration

  • Hinton et al. (2017): "Dynamic Routing Between Capsules" (Capsule Networks)
  • Rao & Ballard (1999): "Predictive coding in the visual cortex"

Discussion Questions

For researchers:

  1. Can we develop a principled theory of tokenization?
  2. Should tokenization be learned end-to-end with the model?
  3. How important is biological plausibility?
  4. Can we unify tokenization across modalities?

For practitioners:

  1. How to choose patch size for my data?
  2. When should I use hierarchical models?
  3. Is learned tokenization worth the complexity?
  4. How to tokenize gene expression data?

For the field:

  1. Are we over-engineering tokenization?
  2. Should we move beyond tokens entirely?
  3. What can biology teach us?
  4. How to balance pragmatism and principles?

Status: Open research area — contribute your ideas!

Last updated: January 13, 2026