Skip to content

Latest commit

 

History

History
134 lines (91 loc) · 14.2 KB

File metadata and controls

134 lines (91 loc) · 14.2 KB

Roadmap

Where the pipeline is headed. Items are grouped by priority and roughly ordered within each tier.


v0.2.0 — Variant caller benchmarking & alternative tools ✅

Alternative callers, benchmarking infrastructure, and tool rationale documentation. Thanks to @madmolecularman for pushing this direction.

  • Alternative aligner: BWA-MEM2 (scripts/02a-alignment-bwamem2.shaligned_bwamem2/) — produces XS tags needed by Strelka2. All alternative caller scripts accept ALIGN_DIR=aligned_bwamem2
  • Alternative SNP callers: GATK HaplotypeCaller + FreeBayes + Strelka2 — three optional callers alongside DeepVariant, each writing to isolated output directories (vcf_gatk/, vcf_freebayes/, vcf_strelka2/). Note: Strelka2 is a small variant caller (SNVs + indels ≤49bp), not an SV caller
  • Concordance benchmarking script (scripts/benchmark-variants.sh) — two modes: pairwise concordance (bcftools isec with PASS filter + normalization) and truth set benchmarking (hap.py). Auto-discovers all caller VCFs
  • Alternative SV caller: TIDDIT (scripts/04a-tiddit.shsv_tiddit/) — excels at large inversions and translocations; auto-detects BWA index for assembly mode
  • Documented tool rationale (docs/tool-rationale.md) — per-step rationale with references to benchmarking data and decision matrices

v0.3.0 — Tool upgrades, QC, & expanded coverage

Upgrade pinned tools, add pre-alignment QC (#14), fill coverage gaps, and add new sequencing platform support. Thanks to @madmolecularman for driving the QC discussion.

  • fastp QC + adapter trimming (scripts/01b-fastp-qc.shfastq_trimmed/) — pre-alignment step: adapter removal, quality trimming, polyG tail removal, JSON+HTML reports. Default on, skippable with SKIP_TRIM=true. Addresses #14
  • mosdepth coverage statistics (scripts/16b-mosdepth.sh) — fast per-base and per-region depth from BAM; coverage distributions, threshold reports, and WES on-target rate
  • MultiQC aggregation (scripts/28-multiqc.sh) — single HTML report combining fastp, samtools flagstat, mosdepth, and other QC outputs
  • ExpansionHunter upgrade to v5.0.0 — new --variant-catalog flag replacing --variant-catalog-format, biocontainers image replacing deprecated weisburd image
  • PharmCAT upgrade to 3.2.0 — preprocessor renamed (no .py), explicit reporter flags (-reporterJson -reporterHtml), wildtypeAllelereferenceAllele in JSON, NAT2 calling added. Step 7 and step 27 updated together
  • PCGR/CPSR upgrade to 2.2.5 — upgraded from 1.4.1 to 2.2.5 with new ref data bundle (20250314), separate VEP cache mount, and completely rewritten CLI
  • Octopus variant caller (scripts/03d-octopus.shvcf_octopus/) — haplotype-aware Bayesian caller as a 5th benchmarking alternative. Auto-discovered by benchmark-variants.sh
  • GRIDSS structural variant caller (scripts/04b-gridss.shsv_gridss/) — assembly-based SV caller for complex rearrangements; strengthens SV consensus alongside Manta/Delly. Requires BWA index and 32GB RAM
  • CYP2D6 star allele calling — evaluated Aldy v4.8.3 (best CYP2D6 SV caller per Twesigomwe 2020), StellarPGx (broken Docker), and BCyrius (no public repo). Aldy documented as recommended optional replacement for Cyrius in docs/21-cyrius.md. Note: Aldy uses an academic-only license (IURTC) incompatible with GPL-3.0, so it cannot be a required dependency. pypgx (step 32) is the GPL-compatible alternative included in the pipeline
  • Long-read support (scripts/02b-alignment-longread.sh, scripts/03e-clair3.sh, scripts/04c-sniffles2.sh) — ONT and PacBio HiFi alignment (minimap2 map-ont/map-hifi), Clair3 v2.0.0 variant calling, Sniffles2 SV calling. Comprehensive guide in docs/long-read-guide.md
  • Whole exome sequencing (WES) entry path — comprehensive guide in docs/wes-guide.md covering per-step compatibility, capture BED files, DATA_TYPE=WES env var, coverage QC metrics, and limitations. Thanks to @madmolecularman for domain expertise here
  • Somatic variant calling (scripts/29-mutect2-somatic.sh) — [EXPERIMENTAL] tumor-only Mutect2 mode with gnomAD germline resource and Panel of Normals filtering. Marked experimental due to high false positive rate without matched normal

v0.4.0 — Expanded annotation & clinical interpretation ✅

Deep pathogenicity scoring, structured variant querying, and broader pharmacogenomics. All new annotation tracks are optional — scripts detect which databases are present and degrade gracefully.

  • CADD scores — Combined Annotation Dependent Depletion scores for all variants via vcfanno. Pre-scored whole-genome SNVs (~81.5 GB) + gnomAD indels (~1.2 GB). PHRED >= 20 flagged as clinically interesting
  • SpliceAI — deep learning splice-site variant predictions via vcfanno. Pre-scored files (~20 GB). Delta score >= 0.2 flagged for cryptic splice variants
  • REVEL scores — ensemble missense pathogenicity scoring via vcfanno (~526 MB). ClinGen-recommended thresholds: >= 0.644 (PP3_Moderate), >= 0.932 (PP3_Very Strong)
  • AlphaMissense — DeepMind's protein-structure-informed missense classifier via vcfanno (~613 MB). Thresholds: < 0.34 benign, > 0.564 pathogenic
  • gnomAD v4 constraint metrics — per-gene pLI, LOEUF, and missense Z-scores (~91 MB). Integrated into clinical filter summary TSV and slivar output
  • vcfanno annotation engine (scripts/30-vcfanno.sh) — adds CADD, SpliceAI, REVEL, AlphaMissense to VEP VCFs via TOML config in a single pass. Handles CADD chr prefix mismatch with two-pass approach
  • Variant prioritization with inheritance queries (scripts/31-slivar.sh) — slivar (GEMINI successor) for streaming VCF filtering with JS expressions. Rare HIGH/MODERATE variants, ClinVar pathogenic, compound het detection, gene constraint enrichment
  • pypgx alongside PharmCAT (scripts/32-pypgx.sh) — 23-gene curated star allele calling including CYP2D6 structural variation from BAM read depth. Cross-validates with PharmCAT on shared genes

v0.5.0 — Nextflow workflow engine

The bash scripts work but lack built-in parallelism, resume-on-failure, and HPC portability. v0.5.0 adds a Nextflow DSL2 execution path alongside the existing bash scripts (which remain first-class).

Why Nextflow over Snakemake?

Both are mature workflow engines. We chose Nextflow because:

  • nf-core ecosystem: 147 community pipelines including sarek (WGS variant calling) and raredisease (clinical genomics). Sarek's output is our primary input — channel compatibility matters.
  • nf-core/modules: 1000+ reusable modules. We can both use existing modules and contribute novel ones (PharmCAT, pypgx, slivar) under MIT.
  • Container-first design: Nextflow's container directive maps directly to our Docker-based architecture. Singularity support is automatic.
  • Resume: Content-hash caching is more robust than file-existence checks.
  • Industry momentum: Seqera/Nextflow has commercial backing; major sequencing centers standardize on Nextflow.

Snakemake's Python DSL and HPC scheduler integration are genuine strengths, but the nf-core ecosystem size and sarek compatibility are decisive.

Scope: Post-processing focus

Steps 1-6 (alignment, variant calling) are already covered by nf-core/sarek. Rather than duplicate that work, this pipeline focuses on what sarek does NOT cover: pharmacogenomics, PRS, ancestry, telomere, repeat expansions, clinical interpretation, and reporting. The Nextflow pipeline accepts sarek output (VCF + BAM) as its primary input.

Delivery

All 27 modules across 6 workflows are implemented. The stub-testable subset (tools that do not require external databases) is CI-validated; database-dependent tools (vep, cpsr, clinvar, expansion_hunter) are validated manually. The Nextflow path is usable for post-calling interpretation and produces biologically equivalent results to the bash scripts. See docs/nextflow.md for known limitations.

  • PR #17 — Full Nextflow pipeline (v0.5.0): All 6 workflows (PGX, ANNOTATION, CLINICAL, BAM_ANALYSIS, SV, REPORTING) with 27 modules, --tools gating, stub CI, Docker + Singularity profiles

Parallel track: nf-core module contributions

PharmCAT, pypgx, and slivar modules will be contributed to nf-core/modules under MIT license — independent of the pipeline's GPL-3.0 license. Once merged, these modules will be available to all nf-core pipelines.

Bash scripts

The bash scripts remain in scripts/ as a maintained, simpler alternative for users who do not need workflow orchestration. After PR 3 validates the Nextflow path end-to-end, new features will be Nextflow-first. Bash scripts will continue to receive bug fixes and tool version bumps but not new analysis steps.

v0.6.0 — Multi-sample & joint analysis

Every step currently runs on a single sample in isolation. v0.6.0 focuses on making the pipeline useful for families and cohorts.

  • Joint PCA with 1000 Genomes reference panel — project sample PCs onto a reference PCA, replacing the current single-sample ancestry stub (step 26) with real population placement
  • Multi-sample SV merging — merge Manta/Delly calls across 2+ samples (e.g., partners, parent-child) to identify shared and private structural variants
  • Carrier cross-check automation — given two VCFs, automatically check shared autosomal recessive carrier status (currently manual; see docs/multi-sample.md)
  • PRS percentile estimation — use a public reference cohort (e.g., UK Biobank summary stats) to convert raw PRS scores into approximate percentiles
  • Somalier sample identity QC — ultra-fast relatedness and sample-swap detection from BAM/VCF; replaces ad-hoc sex-check with proper identity QC for multi-sample runs
  • GLNexus joint genotyping — merge per-sample gVCFs into joint-called cohort VCFs; requires switching DeepVariant to --output_gvcf mode
  • Trio analysis support — de novo variant calling and compound heterozygote phasing for parent-child trios, with slivar inheritance model queries (de novo, compound het, X-linked recessive, autosomal recessive)

v0.7.0 — Reporting & user experience

Make results more accessible to non-bioinformaticians.

  • Interactive HTML dashboard — single-page HTML report combining all step outputs with collapsible sections, variant tables, PRS charts, and pharmacogenomics summaries
  • PDF clinical summary — one-page printable summary designed to hand to a healthcare provider (PharmCAT results, ClinVar pathogenic hits, key carrier findings)
  • Automated database update scriptscripts/update-databases.sh that downloads the latest ClinVar, VEP cache, and PGS Catalog scoring files, with version tracking
  • Progress dashboard — real-time terminal UI showing step status, elapsed time, and resource usage during run-all.sh execution
  • Conda/Bioconda alternative — offer a non-Docker installation path for HPC environments where Docker is not available

v1.0.0 — Local-first distributed platform

The long-term vision: a self-hostable, open-source health data platform — the "Home Assistant of personal genomics."

  • Local worker agent — lightweight daemon that runs the pipeline on the user's own hardware, reporting progress to a central coordinator. Users with powerful machines run their own analysis; users without can opt into a shared compute pool
  • Central web UI — result visualization, step management, multi-sample comparison dashboard. No raw genomic data stored centrally — only aggregate results and metadata
  • Data sovereignty by design — raw FASTQ/BAM/VCF never leaves the user's machine. Only aggregate, non-re-identifiable results (PRS scores, PGx diplotypes, carrier status) are optionally shared with the coordinator for cross-sample features
  • Health integration layer — combine genomic findings with clinical history, lab results, and supplement plans in a unified personal health record. Structured data import from common lab formats (HL7 FHIR, PDF extraction)
  • Consumer DNA one-click import — streamlined import from 23andMe, AncestryDNA, MyHeritage, and other consumer genotyping platforms, building on the existing chip-to-vcf.sh converter
  • Open-source everything, self-hostable, no vendor lock-in

Ongoing maintenance

These are not versioned milestones but continuous responsibilities.

  • ClinVar monthly refresh — re-download on the first Thursday of each month; re-run step 6 on at least one sample to verify
  • VEP cache refresh — upgrade with each Ensembl release (~every 6 months; release 116 expected Apr 2026)
  • PGS Catalog quarterly check — verify scoring file versions haven't changed; treat any change as a result-changing event
  • CPIC guideline monitoring — recheck when PharmCAT bumps versions or when a drug-gene pair used in step 27 gets updated upstream
  • CI health — keep ShellCheck, markdown-links, and contract-validation passing; add new contract checks as steps are added

Not planned

These are out of scope for this pipeline and unlikely to be added.

  • Cloud-only execution — the pipeline is local-first by design. v1.0.0 envisions optional cloud portability via workflow engines, but the default path will always be local consumer hardware.
  • Clinical validation / CLIA compliance — this pipeline is for personal exploration, not clinical diagnostics. It will never carry a clinical validation stamp.
  • Somatic-only (tumor without normal) — requires a panel of normals and is methodologically fraught without matched germline.

Suggestions and contributions welcome via issues or pull requests.