Skip to content

Scaling CellMapper to 1B cells (out-of-core / lazy-loaded k-NN mapping) #73

@Marius1311

Description

@Marius1311

Description of feature

Goal

Make k-NN mapping work when the query (and/or reference) no longer fits in host RAM — target ~1 billion cells. Today both representations are materialized as dense in-memory arrays, so the ceiling is set by host RAM, not by the algorithm.

This issue scopes what's required, separates what we can already do with existing tools from what's still open, and proposes milestones. Related: #70 (JZ-Tree / faster GPU k-NN) is the orthogonal speed axis; this issue is about memory / out-of-core.

Current state (what the code does today)

  • CellMapper.compute_neighbors materializes both representations from AnnData into dense host arrays — xrep = reference.obsm[use_rep], yrep = query.obsm[use_rep] (or .X), np.ascontiguousarray(...) — and hands them to Kernel.
  • Kernel.__init__ just stores them (self.xrep, self.yrep): full dense numpy arrays resident in host RAM.
  • Kernel.compute_neighbors builds the reference index once (backend.fit(xrep)), then queries. Cross-mapping builds two indices and runs four batched queries (xx, yy, xy, yx).
  • _batched_query slices an already-in-memory array (points[start:end]), runs backend.query(batch, k), accumulates results in Python lists, and np.vstack-es them at the end.
  • batch_size exists, but its sole purpose is GPU-OOM avoidance (rapids/faiss-gpu) — it does not touch the on-disk file and does not reduce host-memory footprint.
  • Backends: sklearn (exact CPU), pynndescent (approx CPU), faiss-cpu, faiss-gpu, rapids (cuML GPU).

Takeaway: there is no lazy/on-disk path anywhere in the k-NN pipeline. Both reps are fully in host RAM before mapping.

Where the memory actually goes at 1B scale

It is not primarily the embeddings:

  • Query embedding, 1B × 50-D float32 ≈ 200 GB (large, but the smallest of the three costs).
  • Reference index — must hold all reference vectors (+ index structure); flat is infeasible at 1B, so this forces an approximate/compressed index.
  • Output is the real killer: neighbor indices + distances at 1B × k=30 ≈ 360 GB (int64 idx + float32 dist), and the sparse mapping matrix M (n_ref × n_query) has ~30B nonzeros. We currently np.vstack the full result and build M in one shot.

So lazy-loading the input is necessary but not sufficient — the output/transfer path has to stream too.

What we can already do with existing tools (no research needed)

  1. Stream the query through the existing _batched_query seam. It already iterates query rows in slices; point it at a lazy on-disk reader (zarr/dask slicing of the .obsm array, or AnnData backed mode) instead of an in-memory array. Order-preserving sequential batches → write each batch's results out before moving on.
  2. Out-of-core / compressed reference index via faiss: IndexIVFFlat / IndexIVFPQ, optionally OnDiskInvertedLists, or GPU IVF via rapids/cuVS. This removes the "reference must fit in RAM as a flat array" constraint and is the established way to index ≥10⁸ vectors.
  3. Approximate search (IVF/HNSW/CAGRA) rather than exact — mandatory at 1B; we already expose pynndescent/faiss/rapids.
  4. Stream the output: write neighbor idx/dist to zarr per batch instead of accumulating + np.vstack; build M as row-blocks (CSR per query-chunk) and apply the transfer (M · one-hot(x) / M · x) chunk-wise, writing results incrementally. Never hold the full M.
  5. Multi-GPU sharding for the index (faiss sharded index / cuVS) — available today, just not wired up.

What's open (needs design / resolution)

  1. Lazy reference reader interface. The reference index build (backend.fit) currently needs the whole reference resident. Decide: (a) out-of-core index build (faiss train on a sample + add in chunks read from disk), and (b) a clean abstraction for "embedding lives on disk" — store the embedding as the X of a derived on-disk AnnData/zarr, or slice .obsm via zarr/dask? Pick one and thread it through compute_neighbors/Kernel.__init__.
  2. annbatch fit? annbatch (scverse) is a training dataloader: shuffle-then-stream, .X-focused. Two mismatches for us: (a) k-NN query streaming must be order-preserving (results map back to specific cells), but annbatch's chunked access requires shuffling; (b) we map on .obsm embeddings, not .X. It could serve order-preserving query batches if we store the embedding as X and disable shuffling — needs a spike to confirm it's not fighting the library. Likely a thinner zarr/dask reader is the better fit; annbatch only clearly pays off once even the query embedding doesn't fit in RAM.
  3. Self-mapping (all-vs-all) case. For self-mapping query == reference. We never materialize the full N×N (good) — but the single index over all N must be out-of-core too, and the streamed query passes over the same on-disk array. Confirm the chunking logic handles the shared-array case.
  4. Transfer / downstream at scale. Presence scores, row-normalization, categorical majority vote, numerical diffusion — all operate on M. Re-express them as query-chunked streaming ops with incremental writes. Repeated diffusion (MAGIC-style, t>1) is the hardest: it needs the operator applied multiple times, so either keep M for the chunk across iterations or recompute.
  5. Index that doesn't fit even on one node. IVF-PQ compression vs. multi-GPU/multi-node sharding — decide the supported path and which backends provide it (faiss vs. cuVS).
  6. Exact GPU k-NN in high-D (Add support for JZ-Tree #70 / JZ-Tree). JZ-Tree's wins are benchmarked in 3-D (cosmology); in 30–50-D embeddings exact tree search likely loses to ANN (curse of dimensionality). Treat as exploratory; default to ANN at 1B.
  7. API surface & defaults. A streaming/out-of-core mode (flag or separate path), chunk-size defaults, where outputs get written (zarr store), and a backend capability matrix (which backends support on-disk/sharded add). Plus a defined 1B benchmark (dataset + hardware) to track progress.

Proposed milestones

  • M1 — Lazy query streaming from disk through _batched_query + streamed output writes (reference still in RAM). Unblocks "huge query, normal reference."
  • M2 — Out-of-core / IVF(-PQ) reference index (faiss/cuVS), built in chunks. Unblocks "huge reference."
  • M3 — Streamed transfer (categorical + numerical, presence scores) with chunked M; never materialize full M.
  • M4 — Multi-GPU sharded index; 1B-cell end-to-end benchmark.
  • M5 — (exploratory) evaluate cuVS/CAGRA and JZ-Tree (Add support for JZ-Tree #70) for the search kernel.

Open questions for discussion

  • Lazy-reader abstraction: zarr/dask slicing of .obsm vs. embedding-as-X on-disk AnnData vs. annbatch — preference?
  • Minimum viable target: is M1 (huge query, in-RAM reference) enough for the near-term use cases, or do we need M2 immediately?
  • Exact vs. approximate contract at scale — are we OK making ANN the default above some N?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions