Conversation
Replace the previous `Vec<(u64, bool)>`-materializing comparison-sort
build with a STAR-faithful sentinel-transform + caps-sa wrapper:
- src/index/sa_build.rs (new): per-segment sentinel transform of T =
forward || RC into a byte alphabet `{0..5} \cup {5+i}`, runs
caps-sa's `build_in_memory` / `build_ext_mem` on the transformed
text, then streams the resulting SA through a closure that filters
to ACGT starts + the `genomeSAsparseD` stride and packs into the
existing PackedArray with the strand-bit encoding. Use-ext-mem
decision is env-var gated for now via RUSTAR_USE_EXT_MEM.
- src/index/mod.rs: declare the new sa_build module.
- src/index/suffix_array.rs: `SuffixArray::build` now delegates to
sa_build::build. The pre-existing `compare_suffixes` is retained
behind #[cfg(test)] as a STAR-faithful differential oracle (the
previous build's `< 5` filter was relaxed to `< 4` to match
STAR's `G[ii] < 4` — a latent divergence that didn't surface on
yeast because yeast contains no N). Two new differential tests
(caps_sa_matches_naive_oracle_*) assert the caps-sa-backed builder
produces a PackedArray byte-identical to the oracle on small
synthetic genomes.
- Cargo.toml: depend on caps-sa via a path = "../caps-sa" dep (the
caps-sa crate sits next to rustar-aligner inside the
/scratch1/rob/STAR2/ workspace). The [profile.release] section was
moved to the workspace root Cargo.toml.
- Cargo.lock removed (moved to the workspace root).
Yeast end-to-end build: the resulting SA file is byte-identical to
the prior naive-oracle output (and therefore to STAR's, since the
prior oracle was byte-identical to STAR on yeast per CLAUDE.md
Phase G3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch the caps-sa dependency from a workspace path to the published crates.io coordinate, and restore the [profile.release] section that was moved out during the workspace introduction. Together these make rustar-aligner self-sufficient as a standalone clone: cloning just this repository now resolves caps-sa from registry+crates.io and applies the same release-profile tuning as before. - Cargo.toml: caps-sa = "0.1" (was: path = "../caps-sa"). [profile.release] restored — keeps lto="fat", codegen-units=1, strip=true for standalone release builds. The workspace root has its own copy of the profile; cargo emits a cosmetic "profile shadowed" warning during workspace builds, which we accept in exchange for self-sufficient standalone builds. - Cargo.lock (new): regenerated outside the workspace by running `cargo generate-lockfile` on a temp snapshot of the manifest + sources, so the lockfile resolves caps-sa from registry+https://github.com/rust-lang/crates.io-index rather than from the workspace path. 154 packages locked. Local-development workflow: the workspace root Cargo.toml at /scratch1/rob/STAR2/Cargo.toml carries a [patch.crates-io] entry that redirects caps-sa to the local workspace member, so edits to caps-sa/ propagate into rustar-aligner without a re-publish. That patch lives outside both git repos and only affects in-workspace builds. Verified: cargo test --workspace passes 438 tests (26 caps-sa unit + 394 rustar-aligner unit + 14 integration + 4 doc). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ny inputs)
`use_ext_mem(text_len)` now defaults to the caps-sa ext-mem path for
production-scale genomes (transformed text >= 16 MB) and falls back
to the in-memory path below that threshold.
Why a threshold rather than a strict default:
Bin-padding rounds n_genome up to the next 256 KB boundary by default
(`genomeChrBinNbits = 18`). On the synthetic 20 kb integration test
fixtures, >92% of the transformed text is constant spacer bytes that
caps-sa sorts and we discard afterwards. The sort itself takes
`O(spacer_run_len * subarray_size)` per subarray because every
spacer-starting suffix shares a near-maximal LCP with every other
spacer-starting suffix — measurably ~30x slower than in-memory at
this scale (14 s vs ~0.5 s for one `genomeGenerate` invocation in
local testing). On real genomes the spacer fraction is <1% (yeast ~5%,
human <0.01%), where ext-mem is the right default — bounded peak RAM
matters and the constant-region pathology doesn't trigger. The
threshold puts the synthetic tests on in-mem and yeast / chr22 / human
on ext-mem.
Explicit overrides (decision order, taken before the threshold):
1. RUSTAR_USE_IN_MEM={1,true,yes,on} forces in-memory.
2. RUSTAR_USE_EXT_MEM={0,false,no,off} forces in-memory (backward
compatibility alias).
3. RUSTAR_USE_EXT_MEM={1,true,yes,on} forces ext-memory.
Threshold chosen at 16 MB after measuring local pathology on the
20 kb synthetic test (524 KB transformed text). Yeast (~30 MB
transformed text) and any larger reference still goes through ext-mem.
Verified: full `cargo test --workspace` still passes (438 tests,
unchanged from before this change) in normal time — the threshold
keeps the synthetic tests on the in-memory path without any change to
the test harness.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
caps-sa 0.2.0 ships the subset-sort API (build_ext_mem_for_positions, build_in_memory_for_positions), the pooled ext-mem buckets, the AVX-512 hybrid LCP path, and the generic SaLcp<I> ext-mem index type. Switch our SA build to the _for_positions variants so the spacer / N / terminal-sentinel suffixes never enter the sort. This removes the O(spacer_run_len^2) LCP-comparison tax that motivated use_ext_mem's in-memory fallback for padding-dominated test fixtures, and lets the ext-mem path scale predictably regardless of the genomeChrBinNbits padding ratio. The pack_one closure simplifies to just the strand-bit encoding because every position caps-sa emits is by construction one we wanted to keep. The count_kept_positions helper is replaced by build_kept_positions (returns the Vec<u64> we now feed to caps-sa rather than just the count). 408 tests pass; both caps_sa_matches_naive_oracle_* differential tests continue to produce byte-identical packed arrays vs the brute-force oracle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the caps-sa 0.2.0 release: the SA build now uses caps-sa's pooled ext-mem buckets, _for_positions subset API, AVX-512 hybrid LCP, and u32 ext-mem index type. The bump signals the additive feature surface (in-memory + ext-mem SA construction, configurable bin-padding-safe filter) to downstream consumers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refactor the suffix array + SAindex construction pipeline on the human
genome (GRCh38, 32 threads on a 256-core AMD EPYC box) so the peak
resident set is dominated by genuine working sets, not by allocator
slack or in-RAM intermediates that just get written to disk after.
## Changes
### Stream the SA to disk; build SAindex on the fly (then in parallel)
- New `index::packed_stream::PackedStreamWriter` — bit-for-bit-compatible
streaming writer for STAR's `PackedArray` format. Used by the
streaming SA emit path; never materialises the 25 GB-class in-RAM
`PackedArray` the previous in-memory build held.
- New `sa_build::build_streaming` — caps-sa drives an `FnMut(packed) ->
Result<(), Error>` callback once per SA entry in lex order. Returns
`(gstrand_bit, gstrand_mask, n_entries)`. The existing
`SuffixArray::build` stays as a thin wrapper for tests / callers that
need the SA in RAM.
- New `GenomeIndex::generate_streaming(params)` orchestrator.
`genomeGenerate` calls this directly; `GenomeIndex::build` +
`GenomeIndex::write` remain for tests.
### SAindex now parallel, no longer inline with the SA emit
Previously every SAindex k-mer extraction (`SaIndexBuilder::add`) sat
on caps-sa's single-threaded phase-4 emit loop. On the human genome
that's ~16 min of pure serial work added to caps-sa's otherwise-
parallel SA build (~7 min phase 1-4 with 32 threads).
- `SaIndex::build_parallel(&genome, &sa_file, ...)` reads the on-disk SA
via chunked `pread` (1 M entries per chunk = ~4 MB buffer per worker,
~128 MB peak across 32 workers). SA pages live in *kernel* page
cache, **not** process RSS — `pread` instead of `memmap2::Mmap`
matters here: an mmap of the 24 GB SA file would pull the touched
pages back into `RssFile`, blowing the peak by 24 GB.
- Each worker decodes packed values, extracts k-mers from
`genome.sequence`, and `AtomicU64::fetch_min`s its `sa_idx` into a
`Vec<AtomicU64>` keyed by `(k, kmer_idx)` — first-occurrence is
monotonic under `fetch_min`, so the final value is exactly the
smallest `sa_idx` covering each k-mer.
- Final sequential pass packs the `Vec<AtomicU64>` into the SAindex's
`PackedArray`, then drops the AtomicU64 array.
- Inner k-mer loop is incremental (one base read per k, break on N)
instead of re-scanning the prefix for every k=1..=nbases. Applied
to the in-memory `SaIndex::build` too.
### Pass `&genome.sequence` directly to caps-sa (drop `text_sf`)
The previous segmented arm built a spacer-free copy of `genome.sequence`
(~6.3 GB on the human genome) so caps-sa's `SegmentedText` could model
contiguous segments. That copy is unnecessary — the predicate
`|p| genome.sequence[p] < 4` already filters spacer (5) and N (4)
positions out of the sort, so caps-sa's `lim_at` is only ever called
on ACGT positions. Pass the original spacer-bordered text with a
`SegmentedText::from_ends(spacer_positions)` instead. Saves the 6.3 GB
copy. Verified byte-identical to the spacer-free arm via the existing
`segmented_arm_matches_sentinel_arm_byte_for_byte_*` differential
tests across five fixtures.
### Bump caps-sa to 0.5
caps-sa 0.5.0 adds `build_ext_mem_for_filter` / `_with` — predicate-
driven entries that build a 1-bit-per-position bitmap + popcount
prefix sum (≈ 770 MB on the human genome) instead of requiring a
materialised `Vec<u64>` of kept positions (~47 GB before this change).
~60× memory reduction for STAR's ACGT-only sampling pattern.
### Rayon pool sized from `--runThreadN` at the `run()` dispatcher
Previously only `align_reads` configured the rayon pool;
`genomeGenerate` fell through to rayon's default of
`num_cpus::get()` *logical* cores. On a 256-core machine that was 256
worker threads (vs the requested 32), and with mimalloc's per-thread
heaps each holding some MB of segment cache, this added ~15 GB of
allocator slack to peak RSS for no benefit. Move the
`ThreadPoolBuilder` call into `run()` so both run-modes honour
`--runThreadN`.
### mimalloc as the global allocator
`mimalloc` (0.1, default-features = false) replaces glibc's malloc.
Lower per-allocation cost on the millions of small allocations
caps-sa makes during phase 1; per-thread heaps return whole segments
to the OS when abandoned, so the cache size stays bounded.
## Bench numbers (GRCh38, 32 threads)
```
peak RSS wall
baseline (_for_positions path) ~113 GB 30+ min
+ caps-sa 0.5 filter API ~47 GB 25 min
+ streaming SA + on-the-fly SAindex ~32 GB 22:46 (256t)
+ Stage 1 (no text_sf) + rayon-thread fix 12.0 GB 28:07 (32t)
+ decoupled parallel SAindex via pread 10.7 GB 18:14
```
For reference: caps-sa CLI on the same input is ~5 GB / ~12 min. The
remaining gap is genome.sequence (6.3 GB) + SAindex `Vec<AtomicU64>`
working set (2.86 GB transient) + caps-sa scratch + SAindex output
(1.5 GB).
## Tests + verification
- 407/407 lib tests pass.
- `streaming_build_matches_in_memory_build` test asserts the streaming
emit's byte stream is bit-for-bit identical to the in-memory
`SuffixArray::build`'s `PackedArray::data()`.
- `segmented_arm_matches_sentinel_arm_byte_for_byte_*` tests still
pass across all 5 fixtures, confirming Stage 1 (drop `text_sf`)
preserves the byte-identical-to-STAR SA.
- `clippy --release` clean (one transitive caps-sa doc-comment nit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes to `SaIndex::build_parallel`, both ported from STAR's
`genomeSAindex.cpp::genomeSAindexChunk`:
## Phase 1: isaStep + binary search skip (~80× SAindex speedup)
Consecutive `SA[isa]` entries share monotonically non-decreasing
k-mer prefixes (the SA is sorted lex). Walking every entry — which
the previous parallel build did — is `O(nSA × nbases)` work, but
distinct k-mers number at most `4^nbases = 268 M` for nbases=14, so
~22 consecutive entries share their full 14-mer on the human genome.
STAR's algorithm visits ~boundary entries only:
1. Keep per-worker `ind0_local[il]` = last-written k-mer index at
level `il`. Skip writes when `ind_pref <= ind0_local[il]`.
2. After each visit, jump by `isa_step = nSA / 4^nbases`; when
`(indFull, iL4)` changes, binary-search inside the last
`isa_step` window for the exact boundary.
Total visited entries per worker chunk ≈ `chunk_size / isa_step ×
log isa_step` (~1.7 M for a 1 M-entry chunk on the human genome) —
17 × less work than visiting every entry, and ~95 % of the
`fetch_min` ops are gone because we only emit at boundaries.
Result: SAindex phase **10:50 → 0:08** (~80× speedup) on GRCh38 /
32 threads. Total rustar `genomeGenerate` wall **18:14 → 7:36**.
## Phase 2: STAR's absent-slot encoding
Absent slots used to carry a constant sentinel
`(1 << gstrand_bit+2) | ((1 << gstrand_bit) - 1)`. STAR instead
fills each absent slot with `next_present_sa_idx | absent_mask`
(or `nSA | absent_mask` for tail-gaps after a level's last present
slot). Implemented as a sequential **backward** pass per SAindex
level inside `firsts[]`, replacing `u64::MAX` sentinels with the
correct value before the final pack into `PackedArray`. ~3 s for
all 358 M slots; small enough to keep sequential.
`hierarchical_lookup` is unchanged because it only consults the
absent flag bit, not the slot's value. This change makes the
on-disk `SAindex` bytes closer to STAR's; the only remaining
divergence is the N flag bit at `gstrand_bit + 1`, which we don't
track (and which `hierarchical_lookup` also ignores).
## Per-worker SA buffering
Per-worker chunk pre-read of the SA byte range stays the same
(4 MB per worker), but accesses are now random within the buffer
(STAR's skip algorithm jumps + binary-searches inside the chunk).
The buffer keeps every visit in L1/L2; only `genome.sequence`
remains cache-cold, and STAR's algorithm minimises those reads to
~boundaries only.
## Benchmark (GRCh38, 32 threads on a 256-core EPYC)
Final head-to-head vs upstream STAR 2.7.11b:
rustar-aligner: 7:36 wall, 10.66 GB peak RSS
STAR (C++): 11:26 wall, 31.42 GB peak RSS
rustar is now both faster and ~3× lighter on the human genome.
407/407 lib tests + 14/14 integration tests pass; the integration
tests exercise `generate_streaming` → `SaIndex::build_parallel` via
`assert_cmd::cargo_bin_cmd`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the in-house suffix-array builder with caps-sa-backed construction and rewrites genomeGenerate to stream the SA to disk, cutting genome-index build memory and time by an order of magnitude. Also switches the global allocator to mimalloc. Branch is merged up to date with main.
Motivation
Building the suffix array for large genomes held the entire ~25 GB PackedArray (plus allocator slack) in RAM, pushing peak RSS past 110 GB on the human genome and taking 30+ minutes — a barrier to running genomeGenerate on commodity machines.
What's in this PR
rustarindex a drop in replacement for STAR).Testing
Merge note
main was merged into this branch; conflicts in Cargo.toml, Cargo.lock, src/lib.rs, and src/index/mod.rs were resolved (new deps unioned; the --limitGenomeGenerateRAM warning kept alongside the streaming build; main's new sjdb_overhang field wired into the refactored GenomeIndex construction). All of main's commits are included, so this merges cleanly.