Where the pipeline is headed. Items are grouped by priority and roughly ordered within each tier.
Alternative callers, benchmarking infrastructure, and tool rationale documentation. Thanks to @madmolecularman for pushing this direction.
- Alternative aligner: BWA-MEM2 (
scripts/02a-alignment-bwamem2.sh→aligned_bwamem2/) — produces XS tags needed by Strelka2. All alternative caller scripts acceptALIGN_DIR=aligned_bwamem2 - Alternative SNP callers: GATK HaplotypeCaller + FreeBayes + Strelka2 — three optional callers alongside DeepVariant, each writing to isolated output directories (
vcf_gatk/,vcf_freebayes/,vcf_strelka2/). Note: Strelka2 is a small variant caller (SNVs + indels ≤49bp), not an SV caller - Concordance benchmarking script (
scripts/benchmark-variants.sh) — two modes: pairwise concordance (bcftools isecwith PASS filter + normalization) and truth set benchmarking (hap.py). Auto-discovers all caller VCFs - Alternative SV caller: TIDDIT (
scripts/04a-tiddit.sh→sv_tiddit/) — excels at large inversions and translocations; auto-detects BWA index for assembly mode - Documented tool rationale (
docs/tool-rationale.md) — per-step rationale with references to benchmarking data and decision matrices
Upgrade pinned tools, add pre-alignment QC (#14), fill coverage gaps, and add new sequencing platform support. Thanks to @madmolecularman for driving the QC discussion.
- fastp QC + adapter trimming (
scripts/01b-fastp-qc.sh→fastq_trimmed/) — pre-alignment step: adapter removal, quality trimming, polyG tail removal, JSON+HTML reports. Default on, skippable withSKIP_TRIM=true. Addresses #14 - mosdepth coverage statistics (
scripts/16b-mosdepth.sh) — fast per-base and per-region depth from BAM; coverage distributions, threshold reports, and WES on-target rate - MultiQC aggregation (
scripts/28-multiqc.sh) — single HTML report combining fastp, samtools flagstat, mosdepth, and other QC outputs - ExpansionHunter upgrade to v5.0.0 — new
--variant-catalogflag replacing--variant-catalog-format, biocontainers image replacing deprecated weisburd image - PharmCAT upgrade to 3.2.0 — preprocessor renamed (no
.py), explicit reporter flags (-reporterJson -reporterHtml),wildtypeAllele→referenceAllelein JSON, NAT2 calling added. Step 7 and step 27 updated together - PCGR/CPSR upgrade to 2.2.5 — upgraded from 1.4.1 to 2.2.5 with new ref data bundle (
20250314), separate VEP cache mount, and completely rewritten CLI - Octopus variant caller (
scripts/03d-octopus.sh→vcf_octopus/) — haplotype-aware Bayesian caller as a 5th benchmarking alternative. Auto-discovered bybenchmark-variants.sh - GRIDSS structural variant caller (
scripts/04b-gridss.sh→sv_gridss/) — assembly-based SV caller for complex rearrangements; strengthens SV consensus alongside Manta/Delly. Requires BWA index and 32GB RAM - CYP2D6 star allele calling — evaluated Aldy v4.8.3 (best CYP2D6 SV caller per Twesigomwe 2020), StellarPGx (broken Docker), and BCyrius (no public repo). Aldy documented as recommended optional replacement for Cyrius in
docs/21-cyrius.md. Note: Aldy uses an academic-only license (IURTC) incompatible with GPL-3.0, so it cannot be a required dependency. pypgx (step 32) is the GPL-compatible alternative included in the pipeline - Long-read support (
scripts/02b-alignment-longread.sh,scripts/03e-clair3.sh,scripts/04c-sniffles2.sh) — ONT and PacBio HiFi alignment (minimap2map-ont/map-hifi), Clair3 v2.0.0 variant calling, Sniffles2 SV calling. Comprehensive guide indocs/long-read-guide.md - Whole exome sequencing (WES) entry path — comprehensive guide in
docs/wes-guide.mdcovering per-step compatibility, capture BED files,DATA_TYPE=WESenv var, coverage QC metrics, and limitations. Thanks to @madmolecularman for domain expertise here - Somatic variant calling (
scripts/29-mutect2-somatic.sh) — [EXPERIMENTAL] tumor-only Mutect2 mode with gnomAD germline resource and Panel of Normals filtering. Marked experimental due to high false positive rate without matched normal
Deep pathogenicity scoring, structured variant querying, and broader pharmacogenomics. All new annotation tracks are optional — scripts detect which databases are present and degrade gracefully.
- CADD scores — Combined Annotation Dependent Depletion scores for all variants via vcfanno. Pre-scored whole-genome SNVs (~81.5 GB) + gnomAD indels (~1.2 GB). PHRED >= 20 flagged as clinically interesting
- SpliceAI — deep learning splice-site variant predictions via vcfanno. Pre-scored files (~20 GB). Delta score >= 0.2 flagged for cryptic splice variants
- REVEL scores — ensemble missense pathogenicity scoring via vcfanno (~526 MB). ClinGen-recommended thresholds: >= 0.644 (PP3_Moderate), >= 0.932 (PP3_Very Strong)
- AlphaMissense — DeepMind's protein-structure-informed missense classifier via vcfanno (~613 MB). Thresholds: < 0.34 benign, > 0.564 pathogenic
- gnomAD v4 constraint metrics — per-gene pLI, LOEUF, and missense Z-scores (~91 MB). Integrated into clinical filter summary TSV and slivar output
- vcfanno annotation engine (
scripts/30-vcfanno.sh) — adds CADD, SpliceAI, REVEL, AlphaMissense to VEP VCFs via TOML config in a single pass. Handles CADD chr prefix mismatch with two-pass approach - Variant prioritization with inheritance queries (
scripts/31-slivar.sh) — slivar (GEMINI successor) for streaming VCF filtering with JS expressions. Rare HIGH/MODERATE variants, ClinVar pathogenic, compound het detection, gene constraint enrichment - pypgx alongside PharmCAT (
scripts/32-pypgx.sh) — 23-gene curated star allele calling including CYP2D6 structural variation from BAM read depth. Cross-validates with PharmCAT on shared genes
The bash scripts work but lack built-in parallelism, resume-on-failure, and HPC portability. v0.5.0 adds a Nextflow DSL2 execution path alongside the existing bash scripts (which remain first-class).
Both are mature workflow engines. We chose Nextflow because:
- nf-core ecosystem: 147 community pipelines including sarek (WGS variant calling) and raredisease (clinical genomics). Sarek's output is our primary input — channel compatibility matters.
- nf-core/modules: 1000+ reusable modules. We can both use existing modules and contribute novel ones (PharmCAT, pypgx, slivar) under MIT.
- Container-first design: Nextflow's
containerdirective maps directly to our Docker-based architecture. Singularity support is automatic. - Resume: Content-hash caching is more robust than file-existence checks.
- Industry momentum: Seqera/Nextflow has commercial backing; major sequencing centers standardize on Nextflow.
Snakemake's Python DSL and HPC scheduler integration are genuine strengths, but the nf-core ecosystem size and sarek compatibility are decisive.
Steps 1-6 (alignment, variant calling) are already covered by nf-core/sarek. Rather than duplicate that work, this pipeline focuses on what sarek does NOT cover: pharmacogenomics, PRS, ancestry, telomere, repeat expansions, clinical interpretation, and reporting. The Nextflow pipeline accepts sarek output (VCF + BAM) as its primary input.
All 27 modules across 6 workflows are implemented. The stub-testable subset (tools that do not require external databases) is CI-validated; database-dependent tools (vep, cpsr, clinvar, expansion_hunter) are validated manually. The Nextflow path is usable for post-calling interpretation and produces biologically equivalent results to the bash scripts. See docs/nextflow.md for known limitations.
- PR #17 — Full Nextflow pipeline (v0.5.0): All 6 workflows (PGX, ANNOTATION, CLINICAL, BAM_ANALYSIS, SV, REPORTING) with 27 modules,
--toolsgating, stub CI, Docker + Singularity profiles
PharmCAT, pypgx, and slivar modules will be contributed to nf-core/modules under MIT license — independent of the pipeline's GPL-3.0 license. Once merged, these modules will be available to all nf-core pipelines.
The bash scripts remain in scripts/ as a maintained, simpler alternative for users who do not need workflow orchestration. After PR 3 validates the Nextflow path end-to-end, new features will be Nextflow-first. Bash scripts will continue to receive bug fixes and tool version bumps but not new analysis steps.
Every step currently runs on a single sample in isolation. v0.6.0 focuses on making the pipeline useful for families and cohorts.
- Joint PCA with 1000 Genomes reference panel — project sample PCs onto a reference PCA, replacing the current single-sample ancestry stub (step 26) with real population placement
- Multi-sample SV merging — merge Manta/Delly calls across 2+ samples (e.g., partners, parent-child) to identify shared and private structural variants
- Carrier cross-check automation — given two VCFs, automatically check shared autosomal recessive carrier status (currently manual; see
docs/multi-sample.md) - PRS percentile estimation — use a public reference cohort (e.g., UK Biobank summary stats) to convert raw PRS scores into approximate percentiles
- Somalier sample identity QC — ultra-fast relatedness and sample-swap detection from BAM/VCF; replaces ad-hoc sex-check with proper identity QC for multi-sample runs
- GLNexus joint genotyping — merge per-sample gVCFs into joint-called cohort VCFs; requires switching DeepVariant to
--output_gvcfmode - Trio analysis support — de novo variant calling and compound heterozygote phasing for parent-child trios, with slivar inheritance model queries (de novo, compound het, X-linked recessive, autosomal recessive)
Make results more accessible to non-bioinformaticians.
- Interactive HTML dashboard — single-page HTML report combining all step outputs with collapsible sections, variant tables, PRS charts, and pharmacogenomics summaries
- PDF clinical summary — one-page printable summary designed to hand to a healthcare provider (PharmCAT results, ClinVar pathogenic hits, key carrier findings)
- Automated database update script —
scripts/update-databases.shthat downloads the latest ClinVar, VEP cache, and PGS Catalog scoring files, with version tracking - Progress dashboard — real-time terminal UI showing step status, elapsed time, and resource usage during
run-all.shexecution - Conda/Bioconda alternative — offer a non-Docker installation path for HPC environments where Docker is not available
The long-term vision: a self-hostable, open-source health data platform — the "Home Assistant of personal genomics."
- Local worker agent — lightweight daemon that runs the pipeline on the user's own hardware, reporting progress to a central coordinator. Users with powerful machines run their own analysis; users without can opt into a shared compute pool
- Central web UI — result visualization, step management, multi-sample comparison dashboard. No raw genomic data stored centrally — only aggregate results and metadata
- Data sovereignty by design — raw FASTQ/BAM/VCF never leaves the user's machine. Only aggregate, non-re-identifiable results (PRS scores, PGx diplotypes, carrier status) are optionally shared with the coordinator for cross-sample features
- Health integration layer — combine genomic findings with clinical history, lab results, and supplement plans in a unified personal health record. Structured data import from common lab formats (HL7 FHIR, PDF extraction)
- Consumer DNA one-click import — streamlined import from 23andMe, AncestryDNA, MyHeritage, and other consumer genotyping platforms, building on the existing
chip-to-vcf.shconverter - Open-source everything, self-hostable, no vendor lock-in
These are not versioned milestones but continuous responsibilities.
- ClinVar monthly refresh — re-download on the first Thursday of each month; re-run step 6 on at least one sample to verify
- VEP cache refresh — upgrade with each Ensembl release (~every 6 months; release 116 expected Apr 2026)
- PGS Catalog quarterly check — verify scoring file versions haven't changed; treat any change as a result-changing event
- CPIC guideline monitoring — recheck when PharmCAT bumps versions or when a drug-gene pair used in step 27 gets updated upstream
- CI health — keep ShellCheck, markdown-links, and contract-validation passing; add new contract checks as steps are added
These are out of scope for this pipeline and unlikely to be added.
- Cloud-only execution — the pipeline is local-first by design. v1.0.0 envisions optional cloud portability via workflow engines, but the default path will always be local consumer hardware.
- Clinical validation / CLIA compliance — this pipeline is for personal exploration, not clinical diagnostics. It will never carry a clinical validation stamp.
- Somatic-only (tumor without normal) — requires a panel of normals and is methodologically fraught without matched germline.
Suggestions and contributions welcome via issues or pull requests.