Scaling CellMapper to 1B cells (out-of-core / lazy-loaded k-NN mapping)

### Description of feature

## Goal

Make k-NN mapping work when the query (and/or reference) no longer fits in host RAM — target ~1 billion cells. Today both representations are materialized as dense in-memory arrays, so the ceiling is set by host RAM, not by the algorithm.

This issue scopes what's required, separates **what we can already do with existing tools** from **what's still open**, and proposes milestones. Related: #70 (JZ-Tree / faster GPU k-NN) is the orthogonal *speed* axis; this issue is about *memory / out-of-core*.

## Current state (what the code does today)

- `CellMapper.compute_neighbors` materializes both representations from AnnData into dense host arrays — `xrep = reference.obsm[use_rep]`, `yrep = query.obsm[use_rep]` (or `.X`), `np.ascontiguousarray(...)` — and hands them to `Kernel`.
- `Kernel.__init__` just stores them (`self.xrep`, `self.yrep`): **full dense numpy arrays resident in host RAM**.
- `Kernel.compute_neighbors` builds the reference index **once** (`backend.fit(xrep)`), then queries. Cross-mapping builds **two** indices and runs **four** batched queries (xx, yy, xy, yx).
- `_batched_query` slices an **already-in-memory** array (`points[start:end]`), runs `backend.query(batch, k)`, accumulates results in Python lists, and `np.vstack`-es them at the end.
- `batch_size` exists, but its **sole purpose is GPU-OOM avoidance** (rapids/faiss-gpu) — it does not touch the on-disk file and does not reduce host-memory footprint.
- Backends: `sklearn` (exact CPU), `pynndescent` (approx CPU), `faiss-cpu`, `faiss-gpu`, `rapids` (cuML GPU).

**Takeaway:** there is no lazy/on-disk path anywhere in the k-NN pipeline. Both reps are fully in host RAM before mapping.

## Where the memory actually goes at 1B scale

It is *not* primarily the embeddings:
- Query embedding, 1B × 50-D float32 ≈ **200 GB** (large, but the *smallest* of the three costs).
- Reference index — must hold all reference vectors (+ index structure); flat is infeasible at 1B, so this forces an approximate/compressed index.
- **Output is the real killer:** neighbor indices + distances at 1B × k=30 ≈ **360 GB** (`int64` idx + `float32` dist), and the sparse mapping matrix `M` (n_ref × n_query) has ~30B nonzeros. We currently `np.vstack` the full result and build `M` in one shot.

So lazy-loading the *input* is necessary but not sufficient — the output/transfer path has to stream too.

## What we can already do with existing tools (no research needed)

1. **Stream the query** through the existing `_batched_query` seam. It already iterates query rows in slices; point it at a lazy on-disk reader (zarr/dask slicing of the `.obsm` array, or AnnData backed mode) instead of an in-memory array. Order-preserving sequential batches → write each batch's results out before moving on.
2. **Out-of-core / compressed reference index** via faiss: `IndexIVFFlat` / `IndexIVFPQ`, optionally `OnDiskInvertedLists`, or GPU IVF via `rapids`/cuVS. This removes the "reference must fit in RAM as a flat array" constraint and is the established way to index ≥10⁸ vectors.
3. **Approximate search** (IVF/HNSW/CAGRA) rather than exact — mandatory at 1B; we already expose pynndescent/faiss/rapids.
4. **Stream the output:** write neighbor idx/dist to zarr per batch instead of accumulating + `np.vstack`; build `M` as row-blocks (CSR per query-chunk) and apply the transfer (`M · one-hot(x)` / `M · x`) chunk-wise, writing results incrementally. Never hold the full `M`.
5. **Multi-GPU sharding** for the index (faiss sharded index / cuVS) — available today, just not wired up.

## What's open (needs design / resolution)

1. **Lazy reference reader interface.** The reference index build (`backend.fit`) currently needs the whole reference resident. Decide: (a) out-of-core index build (faiss train on a sample + add in chunks read from disk), and (b) a clean abstraction for "embedding lives on disk" — store the embedding as the `X` of a derived on-disk AnnData/zarr, or slice `.obsm` via zarr/dask? Pick one and thread it through `compute_neighbors`/`Kernel.__init__`.
2. **annbatch fit?** annbatch (scverse) is a *training* dataloader: shuffle-then-stream, `.X`-focused. Two mismatches for us: (a) k-NN query streaming must be **order-preserving** (results map back to specific cells), but annbatch's chunked access requires shuffling; (b) we map on `.obsm` embeddings, not `.X`. It *could* serve order-preserving query batches if we store the embedding as `X` and disable shuffling — needs a spike to confirm it's not fighting the library. Likely a thinner zarr/dask reader is the better fit; annbatch only clearly pays off once even the query embedding doesn't fit in RAM.
3. **Self-mapping (all-vs-all) case.** For self-mapping query == reference. We never materialize the full N×N (good) — but the single index over all N must be out-of-core too, and the streamed query passes over the same on-disk array. Confirm the chunking logic handles the shared-array case.
4. **Transfer / downstream at scale.** Presence scores, row-normalization, categorical majority vote, numerical diffusion — all operate on `M`. Re-express them as query-chunked streaming ops with incremental writes. Repeated diffusion (MAGIC-style, t>1) is the hardest: it needs the operator applied multiple times, so either keep `M` for the chunk across iterations or recompute.
5. **Index that doesn't fit even on one node.** IVF-PQ compression vs. multi-GPU/multi-node sharding — decide the supported path and which backends provide it (faiss vs. cuVS).
6. **Exact GPU k-NN in high-D (#70 / JZ-Tree).** JZ-Tree's wins are benchmarked in 3-D (cosmology); in 30–50-D embeddings exact tree search likely loses to ANN (curse of dimensionality). Treat as exploratory; default to ANN at 1B.
7. **API surface & defaults.** A streaming/out-of-core mode (flag or separate path), chunk-size defaults, where outputs get written (zarr store), and a backend capability matrix (which backends support on-disk/sharded add). Plus a defined 1B benchmark (dataset + hardware) to track progress.

## Proposed milestones

- [ ] M1 — Lazy **query** streaming from disk through `_batched_query` + streamed output writes (reference still in RAM). Unblocks "huge query, normal reference."
- [ ] M2 — Out-of-core / IVF(-PQ) **reference** index (faiss/cuVS), built in chunks. Unblocks "huge reference."
- [ ] M3 — Streamed **transfer** (categorical + numerical, presence scores) with chunked `M`; never materialize full `M`.
- [ ] M4 — Multi-GPU sharded index; 1B-cell end-to-end benchmark.
- [ ] M5 — (exploratory) evaluate cuVS/CAGRA and JZ-Tree (#70) for the search kernel.

## Open questions for discussion

- Lazy-reader abstraction: zarr/dask slicing of `.obsm` vs. embedding-as-`X` on-disk AnnData vs. annbatch — preference?
- Minimum viable target: is M1 (huge query, in-RAM reference) enough for the near-term use cases, or do we need M2 immediately?
- Exact vs. approximate contract at scale — are we OK making ANN the default above some N?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling CellMapper to 1B cells (out-of-core / lazy-loaded k-NN mapping) #73

Description of feature

Goal

Current state (what the code does today)

Where the memory actually goes at 1B scale

What we can already do with existing tools (no research needed)

What's open (needs design / resolution)

Proposed milestones

Open questions for discussion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scaling CellMapper to 1B cells (out-of-core / lazy-loaded k-NN mapping) #73

Description

Description of feature

Goal

Current state (what the code does today)

Where the memory actually goes at 1B scale

What we can already do with existing tools (no research needed)

What's open (needs design / resolution)

Proposed milestones

Open questions for discussion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions