Status: Active research area (as of January 2026)
Core Question: What is the "right" way to tokenize complex objects (images, gene expression, molecules) for transformer-based generative models?
Transformers operate on sequences of tokens. For natural language and DNA/RNA, tokenization is natural — these are inherently sequential. But for other modalities, tokenization feels contrived and arbitrary.
Patch-based tokenization (16×16, 8×8, etc.) is a pragmatic hack that works, but lacks principled justification.
Engineering Reality: "Does it work?" ✅
Theoretical Satisfaction: "Is it right?" ❌
This document explores why current approaches feel unsatisfying and outlines open research directions.
Standard practice (ViT, DiT, Stable Diffusion 3):
# Split image into fixed-size patches
patches = image.unfold(dimension=2, size=16, step=16) # 16×16 patches
tokens = embed(patches)
output = transformer(tokens)Q1: Why 16×16?
- Not "right" — just empirically tuned
- Different models use different sizes (2×2, 4×4, 8×8, 14×14, 16×16)
- No principled way to choose
Q2: Should patch size depend on content?
- Medical images (smooth gradients): Large patches OK
- Text images (fine details): Small patches needed
- Satellite images: Depends on scale of features
- Current approach: One size for all!
Q3: Do patches respect semantic boundaries?
- A 16×16 patch might contain:
- Half a face, half background
- Part of an object, part of another
- Arbitrary image regions
- Our visual cortex doesn't work this way
| Patch Size | Pros | Cons |
|---|---|---|
| Small (2×2, 4×4) | Fine details, local structure | More tokens, O(n²) attention cost |
| Large (16×16, 32×32) | Fewer tokens, faster | Loss of detail, coarse representation |
The problem: This is a hyperparameter, not a principled design choice.
Gene expression vectors:
Properties:
- Unordered: No natural sequence (unlike DNA)
- Dense: Most genes have non-zero expression
- Compositional: Relative values matter
- High-dimensional: 10K-30K genes typical
Approach 1: Rank by Expression (Geneformer)
# Sort genes by expression level
genes_sorted = sort_by_expression(gene_expression)
tokens = [gene_1, gene_2, ..., gene_20000]
output = transformer(tokens)Problems:
- Ranking is arbitrary — not biological
- 20K tokens = huge sequences (O(n²) = 400M operations)
- What about genes with same expression?
- Loses biological structure
Approach 2: Gene Modules/Pathways
# Group genes by function
modules = {
"glycolysis": [gene_1, gene_5, ...],
"cell_cycle": [gene_2, gene_15, ...],
}
tokens = [module_embeddings] # ~500 pathwaysProblems:
- How to define modules? (Also arbitrary!)
- Loses individual gene information
- Ignores within-module correlations
Approach 3: No Explicit Tokenization
# Direct embedding to latent space
z = encoder(gene_expression) # (20000,) → (512,)
output = model(z) # No tokens!Problems:
- Less interpretable
- Loses biological structure
- Black box
Approach 4: Graph-Structured (GRN-aware)
# Use gene regulatory network
grn = load_gene_regulatory_network()
output = graph_transformer(gene_expression, grn)Problems:
- GRN knowledge incomplete
- Still 20K nodes to handle
- Which GRN to use?
There is no natural "tokenization" for gene expression.
Unlike images (spatial structure) or language (sequential structure), gene expression is:
- A set (unordered)
- A vector (continuous)
- A network (interconnected)
Forcing it into a sequence feels wrong because it is wrong.
1. Computational Efficiency
256×256 image = 65,536 pixels
With 16×16 patches = 256 tokens
Attention: 65,536² → 256² (65,000× reduction!)
2. Transfer from NLP
- Transformers proven for sequences
- Patches make images "sequence-like"
- Can reuse architectures
3. Good Enough in Practice
- ImageNet SOTA achieved
- Stable Diffusion works
- Empirical success
4. Implementation Simplicity
- Easy to code
- GPU-efficient
- Standard operations
Engineering success ≠ Principled design
The field has optimized for what works, not what makes sense.
Idea: Learn local semantics first, then group into "super tokens"
Swin Transformer (2021):
Image → Small patches → Local attention → Merge
↓
"Super tokens" (hierarchical)
↓
Global attention
Status: Works well, but still uses fixed patch sizes at each level.
Idea: Don't fix patch size — learn how to tokenize!
BEiT, VQGAN, MaskGIT:
# Instead of fixed patches
tokens = split_into_patches(image, size=16) # Fixed
# Learn tokenization
tokens = learned_tokenizer(image) # Adaptive!Advantages:
- Content-aware
- Can adapt to different regions
- Potentially more semantic
Challenges:
- How to train the tokenizer?
- Discrete vs continuous tokens?
- Computational cost
Idea: Use CNNs for local features, Transformers for global
class HybridModel(nn.Module):
def __init__(self):
# CNN extracts local semantics
self.conv_stem = ResNet(...)
# Transformer on CNN features
self.transformer = Transformer(...)Status: Used in some models, but not standard for DiT.
Idea: Work directly in continuous space
For images:
- Latent diffusion (VAE → continuous latent → diffusion)
- No explicit tokens
For gene expression:
- Direct MLP/attention on expression vector
- Treat as continuous state, not sequence
Advantage: No arbitrary discretization
Disadvantage: May lose interpretability
Retina → V1 (edges) → V2 (motion) → V3 (shape) → V4 → IT (objects)
Key properties:
- Hierarchical: Simple → complex features
- Local receptive fields that grow
- Specialization: Different areas for different features
- Sparse coding: Neurons fire selectively
- Feedback: Top-down and bottom-up
| Aspect | Biology | Patch-based DiT |
|---|---|---|
| Hierarchy | Yes (V1→V2→V3→V4) | Flat (all patches equal) |
| Local first | Yes (small receptive fields) | No (global attention) |
| Adaptive | Yes (attention, feedback) | No (fixed patches) |
| Sparse | Yes (selective firing) | No (dense attention) |
Conclusion: Current approaches are not biologically inspired.
Two perspectives:
Pragmatic: "Biology is slow, backprop works, patches work — who cares?"
- Valid for engineering
- Gets SOTA results
Principled: "Understanding biology might lead to better architectures"
- Valid for research
- May unlock new capabilities
Reality: Field is mostly pragmatic (for now).
Q1: What is the optimal tokenization strategy?
- Fixed patches? Learned? Hierarchical?
- Content-adaptive?
- Task-specific?
Q2: Can we learn tokenization end-to-end?
- Jointly with the generative model?
- Discrete vs continuous?
Q3: How important is biological plausibility?
- Should we model V1→V2→V3→V4?
- Or is attention enough?
Q4: What is the "right" representation?
- Tokens (if so, what kind)?
- Continuous embeddings?
- Graph structure?
Q5: Should tokenization respect biological structure?
- Gene modules/pathways?
- Regulatory networks?
- Or learn from data?
Q6: How to handle high dimensionality?
- 20K genes → how many tokens?
- Latent space diffusion?
- Hierarchical representation?
Q7: Is tokenization necessary at all?
- Can we do generative modeling without tokens?
- Continuous-space alternatives?
Q8: Should tokenization be modality-specific?
- Images: Patches
- Audio: Time patches
- Gene expression: ???
- Or unified approach?
Q9: How to evaluate tokenization quality?
- Reconstruction error?
- Downstream task performance?
- Interpretability?
For images:
- Fixed patches (8×8, 16×16) are standard
- Empirically tuned per model
- Stable Diffusion 3, Sora use patch-based approaches
For gene expression:
- Multiple approaches being explored
- No clear winner yet
- Geneformer (ranking), scPPDM (tabular), others
Active areas:
- Learned tokenization (VQ-VAE, MaskGIT)
- Hierarchical models (Swin, PVT)
- Hybrid CNN-Transformer
- Graph-structured attention
- Continuous-space alternatives
Open problems:
- Principled way to choose patch size
- Optimal tokenization for non-image modalities
- Whether biological inspiration helps
- Unified tokenization across modalities
Current best practice:
# Use empirically-tuned patch sizes
patch_size = 8 # For 256×256 images (DiT-XL/8)
# or
patch_size = 4 # For higher quality (more compute)Experiment with:
- Different patch sizes for your data
- Hierarchical approaches if quality matters
- Latent diffusion (VAE + diffusion) to avoid tokenization
Recommended approach (as of 2026):
# Option 1: No explicit tokenization
z = encoder(gene_expression) # (20000,) → (512,)
output = diffusion_model(z)
# Option 2: Biologically-structured
grn = load_gene_regulatory_network()
output = graph_diffusion(gene_expression, grn)
# Option 3: Learned modules
modules = learn_gene_modules(data) # Data-driven
tokens = embed_by_modules(gene_expression, modules)
output = transformer(tokens)Then:
- Compare approaches empirically
- Publish ablation study
- Let performance guide you
Start simple:
- Use standard approaches (patches for images, embeddings for other)
- Get baseline working
- Then experiment with alternatives
Don't overthink:
- If patches work for your task, use them
- Principled design is nice, but results matter
But do explore:
- This is an open research area
- Novel tokenization strategies could be publishable
- Especially for non-image modalities
Likely developments:
- More learned tokenization methods
- Better hierarchical models
- Modality-specific tokenization strategies
- Improved understanding of why patches work
Possible breakthroughs:
- Unified tokenization framework
- Biologically-inspired alternatives that match SOTA
- Continuous-space generative models (no tokens)
- Neural architecture search for tokenization
Speculative:
- Fundamental rethinking of tokenization
- New architectures that don't need tokens
- True biological plausibility
- Modality-agnostic generative models
Engineering Pragmatism vs Principled Design
"Does it work?" vs "Is it right?"
Empirical tuning vs Theory-driven
Fast iteration vs Deep understanding
Current state: Pragmatism dominates
- Patch sizes: Empirically tuned
- Architecture choices: What works on benchmarks
- Limited theoretical understanding
Future direction: More principled approaches
- Understanding WHY things work
- Biologically-inspired designs
- Learned, adaptive strategies
For science:
- Understanding principles leads to better models
- Biological inspiration may unlock new capabilities
- Theory guides experimentation
For engineering:
- Principled designs generalize better
- Less hyperparameter tuning
- More robust to distribution shift
For biology applications:
- Gene expression needs better representations
- Biological structure should inform design
- Interpretability matters
The honest assessment:
Patch-based tokenization is arbitrary and unnatural.
- 16×16 is not "right" — it's empirically tuned
- Should vary by resolution, task, data
- Doesn't respect semantic boundaries
- Not biologically inspired
But it works.
- Achieves SOTA on many tasks
- Computationally efficient
- Easy to implement
For gene expression, it's even worse.
- No natural tokenization exists
- Current approaches are hacks
- Open research problem
The field is still figuring this out.
- Active research area
- No consensus
- Your skepticism is warranted
Recommendations:
- Use standard approaches to get started
- Experiment with alternatives
- Let performance guide you
- Contribute to the research!
Patch-based:
- Dosovitskiy et al. (2020): "An Image is Worth 16x16 Words" (ViT)
- Peebles & Xie (2023): "Scalable Diffusion Models with Transformers" (DiT)
Hierarchical:
- Liu et al. (2021): "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows"
- Wang et al. (2021): "Pyramid Vision Transformer"
Learned Tokenization:
- Bao et al. (2021): "BEiT: BERT Pre-Training of Image Transformers"
- Esser et al. (2021): "Taming Transformers for High-Resolution Image Synthesis" (VQGAN)
- Chang et al. (2022): "MaskGIT: Masked Generative Image Transformer"
Gene Expression:
- Theodoris et al. (2023): "Transfer learning enables predictions in network biology" (Geneformer)
- Cui et al. (2024): "scGPT: Toward Building a Foundation Model for Single-Cell Multi-omics"
- Hinton et al. (2017): "Dynamic Routing Between Capsules" (Capsule Networks)
- Rao & Ballard (1999): "Predictive coding in the visual cortex"
For researchers:
- Can we develop a principled theory of tokenization?
- Should tokenization be learned end-to-end with the model?
- How important is biological plausibility?
- Can we unify tokenization across modalities?
For practitioners:
- How to choose patch size for my data?
- When should I use hierarchical models?
- Is learned tokenization worth the complexity?
- How to tokenize gene expression data?
For the field:
- Are we over-engineering tokenization?
- Should we move beyond tokens entirely?
- What can biology teach us?
- How to balance pragmatism and principles?
Status: Open research area — contribute your ideas!
Last updated: January 13, 2026