Embedding models convert text into high-dimensional vectors. In late 2025, we have moved beyond "static vectors" to Multi-Resolution & Late-Interaction representations.
- The Embedding Frontier (Matryoshka)
- Late Interaction (ColBERT v2)
- Binary and Int8 Quantization
- Model Selection Criteria
- Multimodal Embeddings (Vision + Text)
- Interview Questions
- References
Traditionally, if you embedded text into 1,536 dimensions, you were stuck using all 1,536 dimensions for search.
2025 Innovation: Matryoshka Represenation Learning (MRL)
- Models are trained to "store" the most important info in the first few dimensions.
- The Win: You can embed at 1,536 dims, but index only the first 64 dims for a "fast search" pass, then refine the top results with the full 1,536 dims.
- Efficiency: 20x reduction in memory/index size with <2% drop in accuracy.
Standard embeddings are "Bi-Encoders" (one vector per chunk). ColBERT (Contextualized Late Interaction over BERT) uses a "token-level" approach.
- How: Instead of 1 vector per chunk, ColBERT stores 1 vector per token.
- Interaction: At query time, the model compares every token in your query to every token in the documents (the "MaxSim" operation).
- 2025 Status: ColBERT v2 is drastically compressed (PLAID indexing), making it feasible for production. It achieves much higher precision for "needle in a haystack" technical queries.
Storing float32 vectors is expensive. In 2025, we use In-Model Quantization.
- Binary Embeddings: Convert vectors to 1s and 0s.
- Memory: 32x reduction.
- Speed: Hamming distance (XOR operations) is 10x faster than Cosine similarity on modern CPUs.
- Int8/Int4: Supported natively by models like
text-embedding-3-small.
| Model | Provider | Features | Context |
|---|---|---|---|
| Text-Embedding-4 | OpenAI | Matryoshka, Native Int8 | 32k |
| Cohere Embed v3.5 | Cohere | Binary quantization, "Compressible" | 1M |
| BGE-M3 | Open Source | Multilingual, Multi-granularity | 8k |
| Jina-Embeddings-v3 | Jina AI | Late-interaction support | 128k |
In late 2025, "Text-only RAG" is dying.
- CLIP (2025 version): Embeds images and text into the same space.
- Architecture: You can search a library of schematics (images) using a natural language query ("Where is the emergency shutoff valve?").
Strong answer: Embeddings rely on the semantic space learned during training. If a user query uses a newer term (e.g., "DeepSeek-V3") that wasn't in the embedding model's training set, the model might assign it a generic "AI" vector, missing the specific nuances. In 2025, we solve this with Hybrid Search (using BM25 to catch the specific keyword) or Cross-Encoder Reranking, which is better at handling out-of-distribution vocabulary by looking at the query and document tokens simultaneously.
Strong answer:
Scaling to 1 billion vectors with standard float32 1536-dim embeddings requires ~6TB of high-speed RAM for an HNSW index, which is prohibitively expensive. With a Matryoshka model, I can use the first 128 dimensions (Binary quantized) for the initial retrieval. This reduces the memory footprint by over 90%, allowing the "Top 1,000" candidates to be found on significantly cheaper hardware. I can then fetch the full-resolution vectors for just those 1,000 candidates to perform the final reranking.
- Kusupati et al. "Matryoshka Representation Learning" (2022/2024 update)
- Khattab et al. "ColBERT v1 & v2: Efficient Late Interaction" (2021/2023)
- OpenAI. "Introducing New Embedding Models with Matryoshka Support" (2024)
Next: Vector Databases