You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make k-NN mapping work when the query (and/or reference) no longer fits in host RAM — target ~1 billion cells. Today both representations are materialized as dense in-memory arrays, so the ceiling is set by host RAM, not by the algorithm.
This issue scopes what's required, separates what we can already do with existing tools from what's still open, and proposes milestones. Related: #70 (JZ-Tree / faster GPU k-NN) is the orthogonal speed axis; this issue is about memory / out-of-core.
Current state (what the code does today)
CellMapper.compute_neighbors materializes both representations from AnnData into dense host arrays — xrep = reference.obsm[use_rep], yrep = query.obsm[use_rep] (or .X), np.ascontiguousarray(...) — and hands them to Kernel.
Kernel.__init__ just stores them (self.xrep, self.yrep): full dense numpy arrays resident in host RAM.
Kernel.compute_neighbors builds the reference index once (backend.fit(xrep)), then queries. Cross-mapping builds two indices and runs four batched queries (xx, yy, xy, yx).
_batched_query slices an already-in-memory array (points[start:end]), runs backend.query(batch, k), accumulates results in Python lists, and np.vstack-es them at the end.
batch_size exists, but its sole purpose is GPU-OOM avoidance (rapids/faiss-gpu) — it does not touch the on-disk file and does not reduce host-memory footprint.
Takeaway: there is no lazy/on-disk path anywhere in the k-NN pipeline. Both reps are fully in host RAM before mapping.
Where the memory actually goes at 1B scale
It is not primarily the embeddings:
Query embedding, 1B × 50-D float32 ≈ 200 GB (large, but the smallest of the three costs).
Reference index — must hold all reference vectors (+ index structure); flat is infeasible at 1B, so this forces an approximate/compressed index.
Output is the real killer: neighbor indices + distances at 1B × k=30 ≈ 360 GB (int64 idx + float32 dist), and the sparse mapping matrix M (n_ref × n_query) has ~30B nonzeros. We currently np.vstack the full result and build M in one shot.
So lazy-loading the input is necessary but not sufficient — the output/transfer path has to stream too.
What we can already do with existing tools (no research needed)
Stream the query through the existing _batched_query seam. It already iterates query rows in slices; point it at a lazy on-disk reader (zarr/dask slicing of the .obsm array, or AnnData backed mode) instead of an in-memory array. Order-preserving sequential batches → write each batch's results out before moving on.
Out-of-core / compressed reference index via faiss: IndexIVFFlat / IndexIVFPQ, optionally OnDiskInvertedLists, or GPU IVF via rapids/cuVS. This removes the "reference must fit in RAM as a flat array" constraint and is the established way to index ≥10⁸ vectors.
Approximate search (IVF/HNSW/CAGRA) rather than exact — mandatory at 1B; we already expose pynndescent/faiss/rapids.
Stream the output: write neighbor idx/dist to zarr per batch instead of accumulating + np.vstack; build M as row-blocks (CSR per query-chunk) and apply the transfer (M · one-hot(x) / M · x) chunk-wise, writing results incrementally. Never hold the full M.
Multi-GPU sharding for the index (faiss sharded index / cuVS) — available today, just not wired up.
What's open (needs design / resolution)
Lazy reference reader interface. The reference index build (backend.fit) currently needs the whole reference resident. Decide: (a) out-of-core index build (faiss train on a sample + add in chunks read from disk), and (b) a clean abstraction for "embedding lives on disk" — store the embedding as the X of a derived on-disk AnnData/zarr, or slice .obsm via zarr/dask? Pick one and thread it through compute_neighbors/Kernel.__init__.
annbatch fit? annbatch (scverse) is a training dataloader: shuffle-then-stream, .X-focused. Two mismatches for us: (a) k-NN query streaming must be order-preserving (results map back to specific cells), but annbatch's chunked access requires shuffling; (b) we map on .obsm embeddings, not .X. It could serve order-preserving query batches if we store the embedding as X and disable shuffling — needs a spike to confirm it's not fighting the library. Likely a thinner zarr/dask reader is the better fit; annbatch only clearly pays off once even the query embedding doesn't fit in RAM.
Self-mapping (all-vs-all) case. For self-mapping query == reference. We never materialize the full N×N (good) — but the single index over all N must be out-of-core too, and the streamed query passes over the same on-disk array. Confirm the chunking logic handles the shared-array case.
Transfer / downstream at scale. Presence scores, row-normalization, categorical majority vote, numerical diffusion — all operate on M. Re-express them as query-chunked streaming ops with incremental writes. Repeated diffusion (MAGIC-style, t>1) is the hardest: it needs the operator applied multiple times, so either keep M for the chunk across iterations or recompute.
Index that doesn't fit even on one node. IVF-PQ compression vs. multi-GPU/multi-node sharding — decide the supported path and which backends provide it (faiss vs. cuVS).
Exact GPU k-NN in high-D (Add support for JZ-Tree #70 / JZ-Tree). JZ-Tree's wins are benchmarked in 3-D (cosmology); in 30–50-D embeddings exact tree search likely loses to ANN (curse of dimensionality). Treat as exploratory; default to ANN at 1B.
API surface & defaults. A streaming/out-of-core mode (flag or separate path), chunk-size defaults, where outputs get written (zarr store), and a backend capability matrix (which backends support on-disk/sharded add). Plus a defined 1B benchmark (dataset + hardware) to track progress.
Proposed milestones
M1 — Lazy query streaming from disk through _batched_query + streamed output writes (reference still in RAM). Unblocks "huge query, normal reference."
M2 — Out-of-core / IVF(-PQ) reference index (faiss/cuVS), built in chunks. Unblocks "huge reference."
M3 — Streamed transfer (categorical + numerical, presence scores) with chunked M; never materialize full M.
Description of feature
Goal
Make k-NN mapping work when the query (and/or reference) no longer fits in host RAM — target ~1 billion cells. Today both representations are materialized as dense in-memory arrays, so the ceiling is set by host RAM, not by the algorithm.
This issue scopes what's required, separates what we can already do with existing tools from what's still open, and proposes milestones. Related: #70 (JZ-Tree / faster GPU k-NN) is the orthogonal speed axis; this issue is about memory / out-of-core.
Current state (what the code does today)
CellMapper.compute_neighborsmaterializes both representations from AnnData into dense host arrays —xrep = reference.obsm[use_rep],yrep = query.obsm[use_rep](or.X),np.ascontiguousarray(...)— and hands them toKernel.Kernel.__init__just stores them (self.xrep,self.yrep): full dense numpy arrays resident in host RAM.Kernel.compute_neighborsbuilds the reference index once (backend.fit(xrep)), then queries. Cross-mapping builds two indices and runs four batched queries (xx, yy, xy, yx)._batched_queryslices an already-in-memory array (points[start:end]), runsbackend.query(batch, k), accumulates results in Python lists, andnp.vstack-es them at the end.batch_sizeexists, but its sole purpose is GPU-OOM avoidance (rapids/faiss-gpu) — it does not touch the on-disk file and does not reduce host-memory footprint.sklearn(exact CPU),pynndescent(approx CPU),faiss-cpu,faiss-gpu,rapids(cuML GPU).Takeaway: there is no lazy/on-disk path anywhere in the k-NN pipeline. Both reps are fully in host RAM before mapping.
Where the memory actually goes at 1B scale
It is not primarily the embeddings:
int64idx +float32dist), and the sparse mapping matrixM(n_ref × n_query) has ~30B nonzeros. We currentlynp.vstackthe full result and buildMin one shot.So lazy-loading the input is necessary but not sufficient — the output/transfer path has to stream too.
What we can already do with existing tools (no research needed)
_batched_queryseam. It already iterates query rows in slices; point it at a lazy on-disk reader (zarr/dask slicing of the.obsmarray, or AnnData backed mode) instead of an in-memory array. Order-preserving sequential batches → write each batch's results out before moving on.IndexIVFFlat/IndexIVFPQ, optionallyOnDiskInvertedLists, or GPU IVF viarapids/cuVS. This removes the "reference must fit in RAM as a flat array" constraint and is the established way to index ≥10⁸ vectors.np.vstack; buildMas row-blocks (CSR per query-chunk) and apply the transfer (M · one-hot(x)/M · x) chunk-wise, writing results incrementally. Never hold the fullM.What's open (needs design / resolution)
backend.fit) currently needs the whole reference resident. Decide: (a) out-of-core index build (faiss train on a sample + add in chunks read from disk), and (b) a clean abstraction for "embedding lives on disk" — store the embedding as theXof a derived on-disk AnnData/zarr, or slice.obsmvia zarr/dask? Pick one and thread it throughcompute_neighbors/Kernel.__init__..X-focused. Two mismatches for us: (a) k-NN query streaming must be order-preserving (results map back to specific cells), but annbatch's chunked access requires shuffling; (b) we map on.obsmembeddings, not.X. It could serve order-preserving query batches if we store the embedding asXand disable shuffling — needs a spike to confirm it's not fighting the library. Likely a thinner zarr/dask reader is the better fit; annbatch only clearly pays off once even the query embedding doesn't fit in RAM.M. Re-express them as query-chunked streaming ops with incremental writes. Repeated diffusion (MAGIC-style, t>1) is the hardest: it needs the operator applied multiple times, so either keepMfor the chunk across iterations or recompute.Proposed milestones
_batched_query+ streamed output writes (reference still in RAM). Unblocks "huge query, normal reference."M; never materialize fullM.Open questions for discussion
.obsmvs. embedding-as-Xon-disk AnnData vs. annbatch — preference?