Skip to content

Add thrifty scoring backend to larch-usher CLI#113

Open
davidrich27 wants to merge 13 commits into
pixi-environmentfrom
add-ml-scoring-cli
Open

Add thrifty scoring backend to larch-usher CLI#113
davidrich27 wants to merge 13 commits into
pixi-environmentfrom
add-ml-scoring-cli

Conversation

@davidrich27

@davidrich27 davidrich27 commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Expose Thrifty/netam model scoring through larch-usher CLI flags, enabling thrifty likelihood-guided DAG optimization alongside or instead of parsimony scoring.

  • Add --scoring-backend [parsimony|ml] to select scoring method
  • Add --model-config and --model-weights for ML model paths
  • Add --move-coeff-ml to weight ML score relative to parsimony
  • ML scoring computes per-edge delta log-likelihood (new LL - old LL) for fragment edges affected by each SPR move
  • Scoring coefficients default based on backend: parsimony mode defaults to pscore=1, ml=0; ML mode defaults to pscore=0, ml=1; blended scoring available via explicit coefficients
  • Log best-tree log-likelihood at each iteration when ML model is loaded
  • Gate debug artifacts (before_optimize.pb, after_optimize.pb, intermediate_newick*.pb.gz) behind USE_TEST_LOGS cmake option (default: OFF)
  • Refactor optimization logging into OptimizationLogger class

Related: matsengrp/bcr-larch-experiments#6

CLI API changes

New flags

Flag Description
--scoring-backend [parsimony|ml] Select scoring method (default: parsimony). ML requires -DUSE_NETAM=yes build.
--model-config FILE Path to YAML model config (required with --scoring-backend ml)
--model-weights FILE Path to libtorch .pth weights (required with --scoring-backend ml)
--move-coeff-ml FLOAT ML log-likelihood coefficient (default: 1.0 with ml backend, 0.0 otherwise)

Changed flags

Flag Change
-l,--logpath DIR Now opt-in (was on by default). Enables logfile.csv, intermediate DAGs, and --inter-save snapshots — all written inside the given directory.
--inter-save N Now requires -l to be set.

Removed flags

Flag Replacement
--quiet No longer needed — logging is off by default. Use -l to opt in.

Logging directory layout (when -l DIR is used)

DIR/
  logfile.csv              # per-iteration metrics (+ BestTreeLL column when ML is active)
  intermediate.pb          # latest intermediate DAG (overwritten each iteration)
  snapshot.5.pb            # numbered snapshots (when --inter-save is used)
  snapshot.10.pb

Default coefficient behavior

Backend --move-coeff-pscore --move-coeff-nodes --move-coeff-ml
parsimony (default) 1 1 0
ml 0 0 1

All coefficients can be overridden explicitly for blended scoring.

Example usage

# Parsimony only (unchanged default)
larch-usher -i input.pb -o output.pb -c 10

# ML scoring with Thrifty model
larch-usher -i input.pb -o output.pb -c 10 \
  --scoring-backend ml \
  --model-config data/linearham/ThriftyHumV0.2-45.yml \
  --model-weights data/linearham/ThriftyHumV0.2-45-libtorch.pth

# Blended parsimony + ML
larch-usher -i input.pb -o output.pb -c 10 \
  --scoring-backend ml \
  --model-config data/linearham/ThriftyHumV0.2-45.yml \
  --model-weights data/linearham/ThriftyHumV0.2-45-libtorch.pth \
  --move-coeff-pscore 1

# With optimization logging
larch-usher -i input.pb -o output.pb -c 10 \
  -l my_optimization_log

Test plan

  • Parsimony-only mode unchanged behavior
  • ML-only mode accepts/rejects moves based on likelihood delta
  • Blended mode combines parsimony + ML scores
  • USE_TEST_LOGS=OFF suppresses debug protobuf artifacts
  • Logging off by default; -l enables it

🤖 Generated with Claude Code

davidrich27 and others added 11 commits March 18, 2026 21:49
Expose Thrifty/netam ML scoring through larch-usher CLI flags:
- --scoring-backend [parsimony|ml] to select scoring method
- --model-config FILE and --model-weights FILE for ML model paths
- --move-coeff-ml FLOAT to weight ML score relative to parsimony

When ML backend is active, fragment edges are scored using
netam::crepe log-likelihood and blended with parsimony scores
in the move accept/reject decision. The ML score is computed
per-edge via poisson_context_log_likelihood and subtracted from
the effective score (higher LL = better = lower score).

CMakeLists.txt links netam to larch-usher when USE_NETAM=yes.
All ML code paths are gated behind #ifdef USE_NETAM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change ML scoring from absolute fragment LL to a proper delta:
compute edge log-likelihoods for both new (post-move) and old
(pre-move) parent-child relationships in the fragment, return
the difference. Only changed edges are scored (~5-10 per move),
not full tree traversals.

Refactor ComputeFragmentMLScore into ComputeEdgeLL helper and
delta computation. For each fragment edge, the old LL uses
GetOld() to access pre-move compact genomes and parent topology.
MoveNew nodes (created by SPR) have no old edge and contribute
only to new LL.

Scoring coefficients now default based on --scoring-backend:
- parsimony (default): pscore=1, nodes=1, ml=0
- ml: pscore=0, nodes=1, ml=1.0
User-specified values always override. Validate that
--move-coeff-ml requires --scoring-backend ml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When --scoring-backend ml is active, compute and display the total
log-likelihood of the best parsimony tree at each iteration in the
optimization log. Samples the min-parsimony tree from the DAG and
sums per-edge log-likelihoods using the loaded netam model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move before_optimize.pb and after_optimize.pb writes behind
#ifdef USE_TEST_LOGS (default: OFF). Previously after_optimize.pb
was always written unconditionally, and before_optimize.pb was
tied to KEEP_ASSERTS (which also gates the MAT/MADAG equality
check, kept separate).

Add USE_TEST_LOGS option to CMakeLists.txt, larch_compile_opts,
and the pixi build env template.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When --quiet is passed, skip creating the optimization_log/
directory, writing logfile.csv, and writing intermediate DAGs.
The logfile open/write/close and directory creation are now
inside the existing write_intermediate_dag guard. Stdout logging
(parsimony scores, tree counts, etc.) is unaffected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sampled tree from MinWeightSampleTree needs
RecomputeCompactGenomes to populate node compact genomes
from edge mutations. Without this, all nodes had empty CGs
(equal to reference), producing LL=0 for all edges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The new-node penalty is scaled for parsimony (1 node ≈ 1 mutation)
but is not meaningful on the log-likelihood scale. Default to
nodes=0 for ML mode so only LL delta drives accept/reject.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract the logging lambda and related setup into an OptimizationLogger
class that owns the logfile, timer, intermediate DAG saves, and ML
scoring. This replaces the large lambda with scattered #ifdef blocks.

CLI changes:
- Logging is now opt-in via -l,--logpath DIR (was on by default)
- --quiet flag removed (no longer needed)
- --inter-save now requires -l
- Intermediate DAGs and snapshots are written inside the log directory
  (e.g. <logdir>/intermediate.pb, <logdir>/snapshot.5.pb)
- BestTreeLL column added to logfile.csv when ML backend is active

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract MLScoringConfig struct with AdjustScore() method, replacing
  duplicated ML scoring blocks and loose ml_model_/ml_coeff_ members
  across three callback structs
- Branch upfront in ComputeMetrics for use_ua_free_parsimony to avoid
  wasting a full BinaryParsimonyScore traversal
- Fix BestTreeLL column missing from logfile header by setting ML model
  before opening the logfile

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --list-tags to larch-test to print all available test tags
- Add pixi run test-netam task for running all netam-tagged tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- WriteHeader/WriteRow: each line now ends with newline (CSV well-formed mid-run)
- Rename ComputeBestTreeLL -> ComputeBestParsimonyTreeLL
- FormatExt: use switch with Fail() for unsupported formats
- Comment on Merge_All_Moves_Found_Callback explaining why it skips ML scoring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davidrich27

Copy link
Copy Markdown
Contributor Author

Clean Code Review

Ran a clean-code-reviewer analysis on the PR. Here's a summary of issues found and how they were addressed:

# Issue Resolution
1 ComputeMetrics wasted a full BinaryParsimonyScore traversal when use_ua_free_parsimony_ was true Branch upfront with if/else; added comment that max_parsimony always uses MaxBinaryParsimonyScore
2 ML scoring block duplicated across 3 callback structs (identical code + two loose members each) Introduced MLScoringConfig struct with AdjustScore() method; each callback holds a single ml_config_ member
3 set_ml_scoring lambda broke encapsulation by accessing members directly Resolved by #2ml_config_ is set via assignment
4 timer_ never reset between iterations — SecondsElapsed cumulative? No change needed — verified this matches old behavior. SecondsElapsed is cumulative wall clock from start; per-iteration time derivable by diffing rows
5 WriteHeader() had no trailing newline; CSV malformed mid-run Header and rows now each end with '\n'
6 ComputeBestTreeLL misleading — computes LL of best-parsimony tree, not best-LL tree Renamed to ComputeBestParsimonyTreeLL. DAG-level LL optimization (Felsenstein-like DP) not yet implemented
7 FormatExt() silently defaulted to .pb for unknown formats Replaced with switch + Fail() on default
8 Merge_All_Moves_Found_Callback silently skipped ML scoring with no explanation Added comment: this callback is for max DAG exploration (--callback-option all-moves), intentionally skips all scoring

Strengths noted

  • OptimizationLogger class is a clear improvement over the captured lambda
  • -l opt-in default is the right design
  • Coefficient resolution with std::optional and backend-dependent defaults is clean
  • ComputeFragmentMLScore documentation explains the non-obvious delta LL well
  • Fail-fast argument validation gives actionable error messages

@davidrich27 davidrich27 changed the title WIP: Add ML scoring backend to larch-usher CLI Add thrifty scoring backend to larch-usher CLI Mar 25, 2026
@davidrich27 davidrich27 marked this pull request as ready for review March 25, 2026 11:37
@davidrich27 davidrich27 requested a review from matsen March 25, 2026 11:37
@matsen

matsen commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

🤖

PR Review: Add thrifty scoring backend to larch-usher CLI

Summary

This PR adds ML-based (Thrifty/netam) scoring to larch-usher's SPR move accept/reject decisions, enabling likelihood-guided DAG optimization. Key changes:

  1. New CLI flags: --scoring-backend [parsimony|ml], --model-config, --model-weights, --move-coeff-ml
  2. Delta LL scoring: ComputeFragmentMLScore computes (new_LL - old_LL) for edges affected by each SPR move — only changed edges, not full tree traversals
  3. OptimizationLogger class: Refactored from a large lambda; owns logfile, timer, intermediate DAG saves, ML scoring
  4. MLScoringConfig struct: Deduplicates ML scoring logic across 3 callback structs via AdjustScore() method
  5. Logging changes: -l,--logpath is now opt-in (was on by default), --quiet removed, --inter-save requires -l
  6. USE_TEST_LOGS cmake option: Gates before_optimize.pb/after_optimize.pb debug artifacts (default OFF)
  7. Test utilities: --list-tags flag for larch-test, pixi run test-netam task

No benchmarks or performance data are included in the PR description.

Concerns

1. Thread safety of ML scoring in callbacks

ComputeFragmentMLScore is called from OnMove in the callback structs. OnMove runs inside moves_batch_.AddElement() which holds only a shared_lock on mat_mtx_ — meaning multiple OnMove calls (and thus multiple model->forward() calls) can execute concurrently. The netam::crepe model is accessed via a shared raw pointer. Is netam::crepe::forward() thread-safe for concurrent reads? If not, this is a data race. The torch::NoGradGuard is thread-local, but the model weights/buffers are shared.

2. Per-edge forward passes (performance note)

Each ComputeEdgeLL call does a full model forward pass (encode → forward → softmax). For ~5-10 fragment edges per move × thousands of moves per iteration, this could be a significant bottleneck. Batching edges into a single forward pass would help. Acceptable for an initial implementation, but worth flagging for future optimization.

3. Score sign convention deserves a comment

In MLScoringConfig::AdjustScore:

return base_score - coeff * fragment_ll;

This is correct (positive delta LL → lower effective score → accepted), but the sign inversion relative to the LL convention is non-obvious. A one-line comment explaining the sign would help readers.

4. ComputeBestParsimonyTreeLL ignores UA-free parsimony setting

ComputeBestParsimonyTreeLL always uses BinaryParsimonyScore to sample the "best" tree, even when use_ua_free_parsimony_ is true. ComputeMetrics correctly branches between UAFreeParsimonyScore and BinaryParsimonyScore, but the LL computation doesn't follow suit. When --ignore-root_edge_mutations is set, the tree sampled for LL computation may not be the one reported as having the best parsimony score.

5. std::remove("intermediate_newick1.pb.gz") cleanup is fragile

In produce_mat_impl.hpp, the code removes a hardcoded filename that matOptimize writes unconditionally. The comment explains the rationale well, but if matOptimize changes this filename or writes additional files, the cleanup silently becomes incomplete. Pragmatic but brittle.

Minor note (pre-existing)

The CSV header writes MaxParsimony for the min-parsimony column and WorstParsimony for max-parsimony. This is inherited from the old code — not introduced by this PR — but confusing.

Overall

Well-structured PR. The OptimizationLogger refactor and MLScoringConfig deduplication are clean improvements. The delta-LL scoring approach (only scoring changed edges) is the right design. The CLI API with backend-dependent defaults and explicit override is well-designed. David's clean code review comment already addressed several quality issues.

Main items to resolve: (1) thread safety of concurrent model access, and (4) UA-free parsimony mismatch in LL computation.

davidrich27 and others added 2 commits June 3, 2026 20:39
The kmer indices feed libtorch embedding lookups using weights trained
in Python netam, so the C++ encoder's index layout must match netam's.
netam.sequences.generate_kmers prepends "N" (index 0) and places the
4^k ACGT kmers at indices 1..4^k; previously the C++ encoder put the
N-kmer placeholder at the last index (4^k), misaligning every kmer's
embedding row. Prepend the N placeholder instead and update the test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants