Roadmap

Where the pipeline is headed. Items are grouped by priority and roughly ordered within each tier.

v0.2.0 — Variant caller benchmarking & alternative tools ✅

Alternative callers, benchmarking infrastructure, and tool rationale documentation. Thanks to @madmolecularman for pushing this direction.

Alternative aligner: BWA-MEM2 (scripts/02a-alignment-bwamem2.sh → aligned_bwamem2/) — produces XS tags needed by Strelka2. All alternative caller scripts accept ALIGN_DIR=aligned_bwamem2
Alternative SNP callers: GATK HaplotypeCaller + FreeBayes + Strelka2 — three optional callers alongside DeepVariant, each writing to isolated output directories (vcf_gatk/, vcf_freebayes/, vcf_strelka2/). Note: Strelka2 is a small variant caller (SNVs + indels ≤49bp), not an SV caller
Concordance benchmarking script (scripts/benchmark-variants.sh) — two modes: pairwise concordance (bcftools isec with PASS filter + normalization) and truth set benchmarking (hap.py). Auto-discovers all caller VCFs
Alternative SV caller: TIDDIT (scripts/04a-tiddit.sh → sv_tiddit/) — excels at large inversions and translocations; auto-detects BWA index for assembly mode
Documented tool rationale (docs/tool-rationale.md) — per-step rationale with references to benchmarking data and decision matrices

v0.3.0 — Tool upgrades, QC, & expanded coverage

Upgrade pinned tools, add pre-alignment QC (#14), fill coverage gaps, and add new sequencing platform support. Thanks to @madmolecularman for driving the QC discussion.

v0.4.0 — Expanded annotation & clinical interpretation ✅

Deep pathogenicity scoring, structured variant querying, and broader pharmacogenomics. All new annotation tracks are optional — scripts detect which databases are present and degrade gracefully.

CADD scores — Combined Annotation Dependent Depletion scores for all variants via vcfanno. Pre-scored whole-genome SNVs (~81.5 GB) + gnomAD indels (~1.2 GB). PHRED >= 20 flagged as clinically interesting
SpliceAI — deep learning splice-site variant predictions via vcfanno. Pre-scored files (~20 GB). Delta score >= 0.2 flagged for cryptic splice variants
REVEL scores — ensemble missense pathogenicity scoring via vcfanno (~526 MB). ClinGen-recommended thresholds: >= 0.644 (PP3_Moderate), >= 0.932 (PP3_Very Strong)
AlphaMissense — DeepMind's protein-structure-informed missense classifier via vcfanno (~613 MB). Thresholds: < 0.34 benign, > 0.564 pathogenic
gnomAD v4 constraint metrics — per-gene pLI, LOEUF, and missense Z-scores (~91 MB). Integrated into clinical filter summary TSV and slivar output
vcfanno annotation engine (scripts/30-vcfanno.sh) — adds CADD, SpliceAI, REVEL, AlphaMissense to VEP VCFs via TOML config in a single pass. Handles CADD chr prefix mismatch with two-pass approach
Variant prioritization with inheritance queries (scripts/31-slivar.sh) — slivar (GEMINI successor) for streaming VCF filtering with JS expressions. Rare HIGH/MODERATE variants, ClinVar pathogenic, compound het detection, gene constraint enrichment
pypgx alongside PharmCAT (scripts/32-pypgx.sh) — 23-gene curated star allele calling including CYP2D6 structural variation from BAM read depth. Cross-validates with PharmCAT on shared genes

v0.5.0 — Nextflow workflow engine

The bash scripts work but lack built-in parallelism, resume-on-failure, and HPC portability. v0.5.0 adds a Nextflow DSL2 execution path alongside the existing bash scripts (which remain first-class).

Why Nextflow over Snakemake?

Both are mature workflow engines. We chose Nextflow because:

nf-core ecosystem: 147 community pipelines including sarek (WGS variant calling) and raredisease (clinical genomics). Sarek's output is our primary input — channel compatibility matters.
nf-core/modules: 1000+ reusable modules. We can both use existing modules and contribute novel ones (PharmCAT, pypgx, slivar) under MIT.
Container-first design: Nextflow's container directive maps directly to our Docker-based architecture. Singularity support is automatic.
Resume: Content-hash caching is more robust than file-existence checks.
Industry momentum: Seqera/Nextflow has commercial backing; major sequencing centers standardize on Nextflow.

Snakemake's Python DSL and HPC scheduler integration are genuine strengths, but the nf-core ecosystem size and sarek compatibility are decisive.

Scope: Post-processing focus

Steps 1-6 (alignment, variant calling) are already covered by nf-core/sarek. Rather than duplicate that work, this pipeline focuses on what sarek does NOT cover: pharmacogenomics, PRS, ancestry, telomere, repeat expansions, clinical interpretation, and reporting. The Nextflow pipeline accepts sarek output (VCF + BAM) as its primary input.

Delivery

All 27 modules across 6 workflows are implemented. The stub-testable subset (tools that do not require external databases) is CI-validated; database-dependent tools (vep, cpsr, clinvar, expansion_hunter) are validated manually. The Nextflow path is usable for post-calling interpretation and produces biologically equivalent results to the bash scripts. See docs/nextflow.md for known limitations.

PR #17 — Full Nextflow pipeline (v0.5.0): All 6 workflows (PGX, ANNOTATION, CLINICAL, BAM_ANALYSIS, SV, REPORTING) with 27 modules, --tools gating, stub CI, Docker + Singularity profiles

Parallel track: nf-core module contributions

PharmCAT, pypgx, and slivar modules will be contributed to nf-core/modules under MIT license — independent of the pipeline's GPL-3.0 license. Once merged, these modules will be available to all nf-core pipelines.

Bash scripts

The bash scripts remain in scripts/ as a maintained, simpler alternative for users who do not need workflow orchestration. After PR 3 validates the Nextflow path end-to-end, new features will be Nextflow-first. Bash scripts will continue to receive bug fixes and tool version bumps but not new analysis steps.

v0.6.0 — Multi-sample & joint analysis

Every step currently runs on a single sample in isolation. v0.6.0 focuses on making the pipeline useful for families and cohorts.

Joint PCA with 1000 Genomes reference panel — project sample PCs onto a reference PCA, replacing the current single-sample ancestry stub (step 26) with real population placement
Multi-sample SV merging — merge Manta/Delly calls across 2+ samples (e.g., partners, parent-child) to identify shared and private structural variants
Carrier cross-check automation — given two VCFs, automatically check shared autosomal recessive carrier status (currently manual; see docs/multi-sample.md)
PRS percentile estimation — use a public reference cohort (e.g., UK Biobank summary stats) to convert raw PRS scores into approximate percentiles
Somalier sample identity QC — ultra-fast relatedness and sample-swap detection from BAM/VCF; replaces ad-hoc sex-check with proper identity QC for multi-sample runs
GLNexus joint genotyping — merge per-sample gVCFs into joint-called cohort VCFs; requires switching DeepVariant to --output_gvcf mode
Trio analysis support — de novo variant calling and compound heterozygote phasing for parent-child trios, with slivar inheritance model queries (de novo, compound het, X-linked recessive, autosomal recessive)

v0.7.0 — Reporting & user experience

Make results more accessible to non-bioinformaticians.

Interactive HTML dashboard — single-page HTML report combining all step outputs with collapsible sections, variant tables, PRS charts, and pharmacogenomics summaries
PDF clinical summary — one-page printable summary designed to hand to a healthcare provider (PharmCAT results, ClinVar pathogenic hits, key carrier findings)
Automated database update script — scripts/update-databases.sh that downloads the latest ClinVar, VEP cache, and PGS Catalog scoring files, with version tracking
Progress dashboard — real-time terminal UI showing step status, elapsed time, and resource usage during run-all.sh execution
Conda/Bioconda alternative — offer a non-Docker installation path for HPC environments where Docker is not available

v1.0.0 — Local-first distributed platform

The long-term vision: a self-hostable, open-source health data platform — the "Home Assistant of personal genomics."

Local worker agent — lightweight daemon that runs the pipeline on the user's own hardware, reporting progress to a central coordinator. Users with powerful machines run their own analysis; users without can opt into a shared compute pool
Central web UI — result visualization, step management, multi-sample comparison dashboard. No raw genomic data stored centrally — only aggregate results and metadata
Data sovereignty by design — raw FASTQ/BAM/VCF never leaves the user's machine. Only aggregate, non-re-identifiable results (PRS scores, PGx diplotypes, carrier status) are optionally shared with the coordinator for cross-sample features
Health integration layer — combine genomic findings with clinical history, lab results, and supplement plans in a unified personal health record. Structured data import from common lab formats (HL7 FHIR, PDF extraction)
Consumer DNA one-click import — streamlined import from 23andMe, AncestryDNA, MyHeritage, and other consumer genotyping platforms, building on the existing chip-to-vcf.sh converter
Open-source everything, self-hostable, no vendor lock-in

Ongoing maintenance

These are not versioned milestones but continuous responsibilities.

ClinVar monthly refresh — re-download on the first Thursday of each month; re-run step 6 on at least one sample to verify
VEP cache refresh — upgrade with each Ensembl release (~every 6 months; release 116 expected Apr 2026)
PGS Catalog quarterly check — verify scoring file versions haven't changed; treat any change as a result-changing event
CPIC guideline monitoring — recheck when PharmCAT bumps versions or when a drug-gene pair used in step 27 gets updated upstream
CI health — keep ShellCheck, markdown-links, and contract-validation passing; add new contract checks as steps are added

Not planned

These are out of scope for this pipeline and unlikely to be added.

Cloud-only execution — the pipeline is local-first by design. v1.0.0 envisions optional cloud portability via workflow engines, but the default path will always be local consumer hardware.
Clinical validation / CLIA compliance — this pipeline is for personal exploration, not clinical diagnostics. It will never carry a clinical validation stamp.
Somatic-only (tumor without normal) — requires a panel of normals and is methodologically fraught without matched germline.

Suggestions and contributions welcome via issues or pull requests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap

v0.2.0 — Variant caller benchmarking & alternative tools ✅

v0.3.0 — Tool upgrades, QC, & expanded coverage

v0.4.0 — Expanded annotation & clinical interpretation ✅

v0.5.0 — Nextflow workflow engine

Why Nextflow over Snakemake?

Scope: Post-processing focus

Delivery

Parallel track: nf-core module contributions

Bash scripts

v0.6.0 — Multi-sample & joint analysis

v0.7.0 — Reporting & user experience

v1.0.0 — Local-first distributed platform

Ongoing maintenance

Not planned

FilesExpand file tree

ROADMAP.md

Latest commit

History

ROADMAP.md

File metadata and controls

Roadmap

v0.2.0 — Variant caller benchmarking & alternative tools ✅

v0.3.0 — Tool upgrades, QC, & expanded coverage

v0.4.0 — Expanded annotation & clinical interpretation ✅

v0.5.0 — Nextflow workflow engine

Why Nextflow over Snakemake?

Scope: Post-processing focus

Delivery

Parallel track: nf-core module contributions

Bash scripts

v0.6.0 — Multi-sample & joint analysis

v0.7.0 — Reporting & user experience

v1.0.0 — Local-first distributed platform

Ongoing maintenance

Not planned