This document explores a fundamental challenge in modern deep learning: how to represent continuous numerical values in neural networks, particularly in contexts where tokenization breaks down. We'll examine:
- The problem of numerical representation in language models
- Recent research directions (2024-2026) for numerical embeddings
- Connections to time embedding in diffusion models
- Whether these techniques are relevant for computational biology applications
Language models excel at discrete tokens (words, subwords, characters) but struggle with continuous numerical values. Here's why:
The Tokenization Problem:
- Numbers like
3.14159get tokenized as["3", ".", "14", "159"] - This destroys numerical relationships:
3.14and3.15are close numerically but may have completely different token sequences - The model can't learn that
3.14159 ≈ πor that1000 > 999in a meaningful way - Arithmetic operations become nearly impossible: the model can't reliably compute
2 + 2 = 4
The Scale Problem:
- Small numbers (
0.001) and large numbers (1,000,000) are treated as unrelated tokens - No inherent understanding of magnitude, order, or relationships
- Scientific notation (
1.23e-4) is even more fragmented
The Precision Problem:
- Floating-point precision is lost in tokenization
3.141592653589793and3.141592653589794might tokenize identically- Fine-grained distinctions disappear
Numbers appear everywhere:
- Scientific literature: Measurements, statistics, experimental results
- Code: Variables, constants, calculations
- Financial data: Prices, quantities, percentages
- Biological data: Expression levels, concentrations, measurements
- Time series: Timestamps, durations, intervals
If LLMs can't handle numbers well, they can't truly understand these domains.
Approach: Treat numbers as a special token type with learned embeddings.
Methods:
- Number-aware tokenization: Split numbers into components (integer part, decimal part, exponent) and learn embeddings for each
- Magnitude-aware embeddings: Embed numbers in a way that preserves scale relationships
- Hybrid approaches: Combine tokenization with learned numerical representations
Example Architecture:
class NumericalEmbedding(nn.Module):
"""Learned embedding for numerical values."""
def __init__(self, embedding_dim=128):
super().__init__()
# Separate embeddings for different number components
self.int_embedding = nn.Embedding(10000, embedding_dim) # Integer part
self.dec_embedding = nn.Embedding(1000, embedding_dim) # Decimal part
self.exp_embedding = nn.Embedding(100, embedding_dim) # Exponent
def forward(self, number):
# Parse number into components
int_part, dec_part, exp_part = self.parse_number(number)
# Combine embeddings
return self.int_embedding(int_part) + self.dec_embedding(dec_part) + self.exp_embedding(exp_part)Pros:
- Fully learnable, can adapt to task
- Can capture domain-specific numerical patterns
Cons:
- Doesn't generalize to unseen numbers well
- Requires careful design of number decomposition
- Still loses some precision
Approach: Use sinusoidal functions (similar to positional encoding) to represent numbers.
Key Insight: This is exactly the same idea as time embedding in diffusion models!
Mathematical Form:
For a number
where frequencies
Why This Works:
-
Bounded: Values stay in
$[-1, 1]$ , training stable - Smooth: Differentiable, allows interpolation
- Multi-scale: Different frequencies capture different magnitudes
-
Relative relationships: Can represent that
$n_1$ is close to$n_2$
Connection to Time Embedding:
This is identical to the time embedding used in diffusion models! Time
Approach: Feed numbers directly as floating-point values, but with special processing.
Methods:
-
Normalization: Scale numbers to a standard range (e.g.,
$[0, 1]$ or$[-1, 1]$ ) -
Log scaling: Use
$\log(n + \epsilon)$ for wide-ranging values - Quantization: Discretize into bins, then use embeddings
- Feature engineering: Extract magnitude, sign, precision as separate features
Example:
def encode_number(n):
"""Direct encoding with normalization."""
# Extract components
sign = 1 if n >= 0 else -1
magnitude = abs(n)
# Log scale for wide range
log_mag = torch.log(magnitude + 1e-8)
# Normalize
normalized = torch.tanh(log_mag / 10.0) # Scale to [-1, 1]
# Combine
return torch.cat([sign * normalized, log_mag])Pros:
- Simple, interpretable
- Preserves exact values (within precision)
- Can handle arbitrary ranges with normalization
Cons:
- Requires careful normalization
- May not capture complex numerical relationships
- Less expressive than learned embeddings
Approach: Extend Rotary Position Embedding (RoPE) to numerical values.
RoPE for Positions: RoPE rotates query/key vectors by an angle proportional to position, preserving relative distances.
RoPE for Numbers: Apply similar rotation based on numerical value:
Why This Might Work:
- Preserves relative relationships: numbers close in value have similar rotations
- Can be applied to attention mechanisms
- Naturally handles different scales through frequency selection
Status (2026): This is an active area of research, with some promising results for mathematical reasoning tasks.
Approach: Combine multiple methods.
Example Architecture:
class HybridNumericalEmbedding(nn.Module):
"""Combines multiple numerical representation strategies."""
def __init__(self, embedding_dim=128):
super().__init__()
self.sinusoidal = SinusoidalEmbedding(embedding_dim // 2)
self.learned = nn.Linear(1, embedding_dim // 2)
self.magnitude_embedding = nn.Embedding(100, embedding_dim // 4)
def forward(self, n):
# Sinusoidal component (multi-scale)
sin_emb = self.sinusoidal(n)
# Learned component (task-specific)
learned_emb = self.learned(n.unsqueeze(-1))
# Magnitude bucket (coarse scale)
mag_bucket = self.get_magnitude_bucket(n)
mag_emb = self.magnitude_embedding(mag_bucket)
# Combine
return torch.cat([sin_emb, learned_emb, mag_emb], dim=-1)Time embedding in diffusion models and numerical embeddings in LLMs solve the same fundamental problem: representing a continuous scalar in a way that neural networks can process effectively.
Time Embedding (Diffusion):
- Input: Continuous time
$t \in [0, 1]$ - Challenge: Network needs to distinguish
$t=0.5$ from$t=0.51$ and learn time-dependent behavior - Solution: Sinusoidal embedding
$\gamma(t)$ with multiple frequencies
Numerical Embedding (LLMs):
- Input: Continuous number
$n \in \mathbb{R}$ - Challenge: Network needs to understand
$n=3.14$ vs.$n=3.15$ and numerical relationships - Solution: Similar sinusoidal embedding
$\gamma(n)$ with multiple frequencies
-
Multi-scale representation: Different frequencies capture different scales
- Low frequencies: Coarse magnitude (thousands vs. millions)
- High frequencies: Fine distinctions (3.14 vs. 3.15)
-
Bounded and stable: Values in
$[-1, 1]$ prevent training instability -
Smooth interpolation: Can represent values between training examples
-
Relative relationships: Embeddings for close values are similar
Time embedding: Usually
Numerical embedding:
For numerical embeddings, you typically need:
-
Log scaling:
$\log(|n| + \epsilon)$ for wide ranges -
Normalization:
$\tanh(n / \text{scale})$ to bound values - Sign handling: Separate representation for positive/negative
- Gene Expression: Expression levels (TPM, FPKM, counts)
- Concentrations: Protein concentrations, drug doses
- Measurements: Cell counts, viability, size
- Time: Time points in time-series experiments
- Coordinates: Genomic positions, spatial coordinates
- Scores: Prediction scores, p-values, fold changes
If you're building a joint latent space (as discussed in joint_latent_space_and_JEPA.md), you need to embed:
- Discrete: Gene IDs, cell types, perturbations
- Continuous: Expression values, concentrations, time
Numerical embedding techniques could help create a unified representation.
For time-series data (Perturb-seq, lineage tracing), you need to embed:
-
Time points:
$t \in [0, T]$ -
Expression values:
$x(t) \in \mathbb{R}^d$
Both benefit from sinusoidal embeddings, potentially sharing the same embedding architecture.
In diffusion models for gene expression:
-
Time embedding: For noise level
$t$ in the diffusion process -
Expression embedding: For continuous expression values
$x$
These could use similar sinusoidal embedding strategies, creating architectural consistency.
If using Transformers for biological sequences:
- Positional encoding: For sequence position
- Numerical encoding: For expression values, scores, measurements
RoPE-style approaches could unify these.
Good for:
- Continuous values that need smooth interpolation
- Values with known ranges (can normalize)
- When you want multi-scale representation
- When you need to generalize to unseen values
Not ideal for:
- Very sparse or discrete values (learned embeddings better)
- Values with complex, non-smooth relationships
- When exact precision is critical (may need direct encoding)
- Normalize your inputs: Scale to a reasonable range before embedding
- Choose frequencies carefully: Too many high frequencies can cause instability
- Combine with learned components: Hybrid approaches often work best
- Test interpolation: Verify that close values have similar embeddings
class ExpressionEmbedding(nn.Module):
"""Embedding for gene expression values."""
def __init__(self, embedding_dim=64):
super().__init__()
self.embedding_dim = embedding_dim
def forward(self, expression):
"""
Args:
expression: Expression values [batch_size, num_genes]
Returns:
embeddings: [batch_size, num_genes, embedding_dim]
"""
# Log transform (expression is typically log-normal)
log_expr = torch.log(expression + 1e-8)
# Normalize to reasonable range
normalized = torch.tanh(log_expr / 10.0)
# Sinusoidal embedding (same as time embedding!)
return self.sinusoidal_embedding(normalized)
def sinusoidal_embedding(self, x):
"""Same implementation as time embedding."""
half_dim = self.embedding_dim // 2
emb = np.log(10000) / (half_dim - 1)
emb = torch.exp(torch.arange(half_dim, device=x.device) * -emb)
emb = x[..., None] * emb[None, ...]
emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
return emb- Mathematical reasoning: Improving LLM performance on math problems
- Scientific literature: Better understanding of numerical data in papers
- Code generation: Handling numerical constants and calculations
- Multi-modal learning: Combining numerical and textual information
While specific 2026 papers are still emerging, the following directions are active:
- Number-aware tokenization: Better ways to split numbers into tokens
- Magnitude-preserving embeddings: Maintaining scale relationships
- RoPE extensions: Applying rotary embeddings to numerical values
- Hybrid architectures: Combining multiple representation strategies
-
Optimal frequency selection: How to choose
$\omega_i$ for different domains? - Normalization strategies: Best practices for different value ranges?
- Integration with attention: How to effectively use numerical embeddings in Transformers?
- Domain adaptation: Can embeddings learned on one domain transfer to another?
-
The Problem: Continuous numerical values are hard for token-based models (LLMs) because tokenization destroys numerical relationships.
-
The Solution: Sinusoidal embeddings (like time embedding) provide a natural way to represent continuous values with:
- Multi-scale representation
- Smooth interpolation
- Bounded, stable values
-
The Connection: Time embedding in diffusion models and numerical embeddings in LLMs solve the same problem—representing continuous scalars effectively.
-
The Relevance: For computational biology, numerical embeddings could help:
- Create unified representations of discrete and continuous biological data
- Improve time-series modeling
- Enhance score networks for biological data
- Enable better attention mechanisms in biological Transformers
-
The Future: This is an active area of research, with promising directions including RoPE-style approaches and hybrid architectures.
- Experiment with sinusoidal embeddings for gene expression values in your diffusion models
- Compare sinusoidal vs. learned vs. direct encoding for your specific biological data
- Explore RoPE-style approaches if using Transformers for biological sequences
- Monitor research in this area—it's rapidly evolving
- Vaswani et al. (2017). "Attention Is All You Need" - Original sinusoidal positional encoding
- Su et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding"
- Active research area (2024-2026) - watch for recent papers on:
- Number-aware tokenization
- Magnitude-preserving embeddings
- Mathematical reasoning improvements
- Time Embedding:
docs/diffusion/score_network/time_embedding_and_film.md - Joint Latent Spaces:
docs/incubation/joint_latent_space_and_JEPA.md - Score Networks:
docs/diffusion/score_network/advanced_architectures.md