Skip to content

Commit d12979f

Browse files
authored
feat: v0.5.0 — Nextflow DSL2 post-calling pipeline (27 modules)
* feat: v0.5.0 — Nextflow DSL2 scaffold with PharmCAT and ClinVar modules Add Nextflow execution path alongside existing bash scripts. This is the architecture validation PR (PR 1a) for the Nextflow conversion: - main.nf entry point with samplesheet parsing and channel branching - PharmCAT module (preprocess + star allele calling, 2-step process) - ClinVar screen module (PASS filter + bcftools isec intersection) - PGX workflow composing both modules in parallel (fan-out pattern) - nextflow.config with Docker/Singularity profiles and resource limits - nextflow_schema.json for parameter validation - conf/base.config with process labels (low/medium/high) - conf/test.config and conf/test_full.config profiles - .nf-core.yml lint configuration (external pipeline, skip branding) - Samplesheet schema accepting sarek output (VCF+BAM) - docs/nextflow.md with quickstart, sarek integration, bash comparison - ROADMAP updated with Nextflow vs Snakemake rationale and staged PRs - README updated with Nextflow badge and quickstart section Bash scripts remain first-class — this adds a parallel execution path for bioinformaticians who want DAG parallelism, resume, and HPC support. * fix: address review findings — stubs, normalization, CI, docs - Add stub blocks to all 3 Nextflow processes for -stub dry runs - Add stub test data (reference, VCF) and fix test.config to be self-contained with reference param - Add VCF left-normalization (bcftools norm) before ClinVar isec to prevent representation-sensitive misses - Add Nextflow CI workflow (config validation, syntax check, container tag consistency with versions.env) - Tone down ClinVar docs: "pathogenic hits" → "positions overlapping ClinVar pathogenic entries", add limitations section about review status, conflict flags, and research-grade disclaimer - Clarify ROADMAP: PR 1a is alpha scaffold (not usable execution path), PR 3 SV scope for standalone users only, bash scripts stay in scripts/ (no legacy/ move) * fix: CPIC report now surfaces uncallable genes explicitly - No Result/N/A/Indeterminate genes no longer silently dropped — they appear in gene results table and in a new "Uncallable Genes" section at the end of the report - Docs: "require no dosing changes" → "no dosing adjustment signal detected from currently callable variants" with note that absence from recommendations ≠ normal function * fix: address second-round review findings - CPIC: detect PharmCAT parse failure before uncalled genes section, preventing false "all genes successfully called" on NO_JSON_FOUND - Nextflow CI: add stub test run (nextflow run -profile test,docker -stub) so module wiring is actually exercised in CI - Container tag drift guard: read from temp file instead of piped subshell so FAIL=1 propagates correctly - Align Nextflow messaging: README and docs/nextflow.md now say "alpha scaffold" matching ROADMAP, not "runnable alternative" * feat(pgx): add PyPGx and CPIC lookup modules PyPGx: BAM-based star allele calling for CYP2D6/CYP2A6/GSTM1/GSTT1 with SV detection. CPIC: parses PharmCAT JSON for drug-gene recommendations using built-in static CPIC guideline table. Both gated on params.tools. * feat(annotation): add VEP, vcfanno, slivar, clinical filter modules Sequential pipeline: VEP → vcfanno (CADD/SpliceAI/REVEL/AlphaMissense) → slivar (prioritization + compound hets) / clinical_filter (parallel). Handles chr-prefix mismatch via two-pass vcfanno annotation. * feat(clinical): add CPSR, ROH, PRS, ancestry, mito haplogroup modules Cancer predisposition (CPSR/PCGR), runs of homozygosity (bcftools), polygenic risk scores (plink2), ancestry PCA, and mitochondrial haplogroup classification (bcftools extract → haplogrep3). All gated on params.tools. Mito uses two-step VCF-based approach. * feat(bam_analysis): add HLA, STR, telomere, coverage, mito, CYP2D6 modules Six parallel BAM-based analyses: T1K HLA typing, ExpansionHunter STR detection, TelomereHunter length estimation, mosdepth coverage stats, GATK Mutect2 mitochondrial variants, and Cyrius CYP2D6 star alleles. * feat(sv): add Manta, Delly, CNVnator, duphold, AnnotSV, SURVIVOR modules Opt-in SV workflow: three callers in parallel (Manta, Delly, CNVnator), duphold depth annotation on Manta output, AnnotSV ACMG classification, and bcftools-based consensus merge requiring 2+ caller support. * feat(reporting): add HTML report and MultiQC modules Per-sample consolidated HTML report (ClinVar, PharmCAT, CPSR, slivar, clinical filter status) and cross-sample MultiQC QC dashboard. Both gated on params.tools. * feat: wire all 6 workflows into main.nf with --tools parameter Orchestrates PGX, ANNOTATION, CLINICAL, BAM_ANALYSIS, SV, and REPORTING workflows from main.nf. Adds params.tools (sarek-style comma-separated tool gating), 20+ new annotation/reference data params, NO_FILE/EMPTY sentinel pattern for optional inputs, and stub BAM/FAI/DICT test data. * fix: replace NO_FILE sentinel with [] to prevent path name collisions CPSR failed in stub test because pcgr_data and vep_cache_cpsr both resolved to the same NO_FILE file, causing Nextflow staging collision. VCFANNO had the same latent issue with 12 optional annotation inputs. Switch to standard nf-core pattern: pass [] (empty list) for absent optional path inputs. Processes check truthiness instead of filename. Also removes redundant .first() on value channels (eliminates 30+ Nextflow warnings). * fix: disable trace/dag in test profile to fix Docker permission issue The ensembl-vep image runs as non-root user 'vep'. When trace is enabled, Nextflow's wrapper creates .command.trace inside the container, but the non-root user cannot write to the host-owned work directory on GitHub Actions. Disable trace and dag (needs graphviz) for CI stub tests. * fix: use touch instead of bgzip/tabix in all stub blocks Some Docker images (e.g. staphb/bcftools) don't include bgzip/tabix. Stub blocks only need placeholder files, not valid VCF/index content. Replace bgzip/tabix with touch across all 12 affected modules. * fix: disable Nextflow trace/reports in CI via override config The test.config trace.enabled=false was overridden by the global trace block in nextflow.config (loaded after profiles). Use -c ci.config in the CI workflow to ensure overrides take effect last. * fix: run CI Docker containers as root to fix permission errors Non-root bioinformatics images (ensembl-vep, telomere_hunter) cannot write to Nextflow work directories on GitHub Actions runners. Override with -u 0:0 in CI config. * fix: address CodeRabbit review findings across 12 files - Fix haplogrep3 Docker image (genepi/ removed from Hub → jtb114/) - Fix Cyrius single-quote interpolation bug (echo '${bam}' → "${bam}") - Add IMPACT guard to clinical_filter (fail fast on unannotated VCF) - Emit header-only VCF for MODERATE tier when gnomAD AF absent - Add reference_fai input to CNVnator, remove || true error hiding - Thread ch_reference_fai into PGX workflow instead of local derivation - Emit clinvar_dir from PGX and wire into REPORTING - Add vcf_index to required samplesheet validation - Add dependency warnings for duphold/annotsv/clinical_filter - Fix contradictory comment in test_full.config - Fix Java version range in docs, future tense in ROADMAP * fix: achieve bash↔nextflow parity for PRS, PyPGx, ancestry, slivar PRS: fix shell pipeline precedence — `zcat || cat | grep` binds wrong; wrap in subshell so `(zcat || cat) | grep | awk` chains correctly. PyPGx: add VCF input alongside BAM (was referencing nonexistent `${bam.baseName}.vcf.gz`); join BAM+VCF channels in pgx.nf workflow; link pypgx-bundle to /root/pypgx-bundle for container compatibility. Ancestry: use ref_panel input for plink2 variant intersection before LD pruning (was declared but unused); convert VCF to pgen first for efficient downstream operations; report shared SNP count in summary. Slivar: add two-pass MODERATE filtering — pass 1 extracts rare MODERATE via split-vep, pass 2 gates on INFO-level deleteriousness predictors (CADD>=20, REVEL>=0.5, AlphaMissense, SpliceAI) matching the bash script's vcfanno-aware behavior. Add gnomAD constraint enrichment (LOEUF, pLI, mis_z) to summary TSV with CONSTRAINED flag. * fix: CNVnator PASS marking, CPIC safety, sample ID sanitization CNVnator: change FILTER=PASS to FILTER=. (unfiltered) since CNVnator has no confidence filtering — marking all calls as PASS is misleading. CPIC: exit with error when PharmCAT JSON format is unrecognized instead of silently producing a reassuring "no affected medications" report. main.nf: validate sample IDs against [a-zA-Z0-9._-] regex to prevent shell injection via crafted samplesheet entries. * docs: update nextflow.md and README for full v0.5.0 coverage Remove alpha/scaffold language — Nextflow path now covers all 31 pipeline steps across 6 workflows with 24 modules. Update output structure to show all per-sample directories. Add quick-start command to README Nextflow section. * fix: slivar gnomAD constraint enrichment graceful python3 fallback The slivar biocontainer may not include python3. Check for python3 availability before attempting constraint enrichment; fall back to simpler summary without LOEUF/pLI/mis_z columns if unavailable. * fix: address review findings — XSS, schema, contigs, docs - Escape ClinVar-derived HTML fields (GENEINFO, CLNSIG) in report - Fix ancestry_ref schema from directory-path to file-path - Add reference FAI contig headers to survivor_merge VCF output - Fix ancestry shared count to use actual pvar intersection - Use versions.env HAPLOGREP3_IMAGE in bash mito script - Correct 5 publishDir mismatches in docs output tree - Add BAM requirement footnote to samplesheet docs - Clarify post-calling scope in README and nextflow docs - Add Known Limitations & Design Decisions section * fix: report semantics, HTML hardening, tools consistency - Rename Slivar card from "Compound-Het / De Novo" to "Variant Prioritization" — it shows prioritized variants, not compound hets - Escape all VCF-derived HTML fields (CHROM, REF, ALT, GENEINFO, CLNSIG) in both Nextflow and bash HTML reports - Add html_escape helper to bash report for SAMPLE, haplogroup, telomere, and ExpansionHunter values - Add .collect{it.trim()} to all --tools parsing in sv.nf (was the only workflow missing whitespace trimming) - Gate PharmCAT behind --tools like all other modules — no longer runs unconditionally; CPIC requires pharmcat to also be enabled - Fix ancestry metric: label as "autosomal_snps" when no ref panel used, "ref_panel_shared_snps" only with actual panel intersection - Pin Cyrius to 1.1.1 instead of floating pip install - Document Cyrius network dependency in Known Limitations * fix: startup contract, messaging, ClinVar gating, CI accuracy - Add fail-fast validation in main.nf: warn when enabled tools lack required databases (vep_cache, pcgr_data, expansion_catalog), error on unpaired clinvar/clinvar_index - Gate ClinVar behind --tools like all other modules - Add 'clinvar' to default tools list in nextflow.config and test.config - Fix README mermaid graph: remove GRIDSS (not in NF pipeline), move ExpansionHunter from VCF to BAM, fix duphold flow (manta->duphold not consensus->duphold) - Update ROADMAP.md staged delivery: replace stale PR 1a/1b/2/3 checklist with single completed PR #17 entry - Pin nf-core/setup-nextflow to SHA in CI workflow - Rename JSON validation steps to "syntax" (honest about scope) - Update test.config comment to document what's covered vs not - Fix HTML report header: "key results" not "all results", note BAM-analysis and SV outputs are not included * fix: fail-fast validation, VEP deps, haplogrep3, module count - Change database validation from log.warn to error — vep, cpsr, expansion_hunter now fail at startup without their databases - Add checkIfExists for reference .fai and .dict sidecars - Add BAM/BAI paired validation in samplesheet parser - Enforce VEP dependency: slivar and clinical_filter now error at startup if vep is not also in --tools - Remove vep/slivar/clinical_filter/cpsr/expansion_hunter/clinvar from stub test tools (they now require databases to start) - Fix schema: add clinvar to default tools, replace "skip gracefully" with accurate database requirement description - Fix module count 24 → 27 in docs/nextflow.md and ROADMAP.md - Replace all stale genepi/haplogrep3 refs with jtb114/haplogrep3 in README, docs, and renovate.json * fix: EMPTY sentinel collision in HTML_REPORT staging When multiple report inputs are absent (e.g. clinvar, cpsr, slivar all disabled), they all resolved to the same EMPTY file, causing Nextflow staging collisions. Use distinct per-slot sentinel files (EMPTY_CLINVAR, EMPTY_PHARMCAT, etc.) and check with startsWith instead of equals. * fix: python:3.11-slim → python:3.11 for Nextflow metrics, review fixes - Cyrius and CPIC_LOOKUP containers changed from python:3.11-slim to python:3.11 (slim lacks procps/ps, causing Nextflow task metrics failure) - Default --tools reduced to database-free baseline (vep, cpsr, clinvar, expansion_hunter now opt-in) - Added cpic+pharmcat fail-fast validation in main.nf - README: added full database-requiring tools example - ROADMAP/docs: accurate CI stub coverage claims - CodeRabbit: disable auto_pause_after_reviewed_commits * fix: ancestry ID mismatch, HLA file naming, docker.userEmulation - Ancestry: split VCF→plink conversion and ref panel extraction into two steps — plink2 applies --extract before --set-all-var-ids, so DeepVariant '.' IDs never matched the position-based snplist - HLA: use glob to find t1k-build output files — naming convention changed between T1K versions (_dna_seq.fa vs prefix_dna_seq.fa) - Remove deprecated docker.userEmulation from nextflow.config - Update versions.env python:3.11-slim → python:3.11 (slim lacks procps/ps required by Nextflow task metrics) * fix: address CodeRabbit findings and telomerehunter permissions - telomere_hunter: add containerOptions --user root:root (container runs as uid=1000 but Nextflow work dirs are root-owned) - vep: add --af_gnomade for explicit gnomAD exome frequencies (--everything includes genome but not exome AF) - main.nf: add .fna to reference extension regex for .dict lookup - pypgx: remove || true from bundle symlink (fail-fast on errors) - sv.nf: upgrade duphold/annotsv dependency warnings to errors (match cpic/pharmcat fail-fast pattern) * fix: global docker --user root:root for non-root containers VEP (uid=999) and telomerehunter (uid=1000) both fail with "cannot touch .command.trace: Permission denied" because Nextflow work dirs are root-owned. Set docker.runOptions globally instead of per-module containerOptions. * fix(cyrius): use correct CLI entry point name Cyrius 1.1.1 registers its console script as 'cyrius', not 'star_caller'. This caused exit 127 (command not found) on every run. * fix(cpic,haplogrep): fix PharmCAT 3.x JSON parsing and add --tree flag CPIC_LOOKUP: PharmCAT 3.2.0 uses a flat genes dict {gene_name: data}, not a nested {source: {gene_name: data}} structure. Removed extra iteration level. MITO_HAPLOGROUP: haplogrep3 requires --tree; use phylotree-fu-rcrs@1.2 (PhyloTree 17 Forensic Update, latest available). * fix(hla,pypgx): use pre-downloaded HLA database, fix pypgx quoting HLA_TYPING: t1k-build.pl cannot reach ftp.ebi.ac.uk from Docker containers. Accept pre-downloaded hla.dat via --hla_dat parameter instead of downloading at runtime. PYPGX: escaped double quotes inside Nextflow triple-quoted string broke bash quoting for python3 -c argument. Split into separate variable assignment to avoid the issue. * fix(ancestry): match chr prefix when building ref panel snplist plink2 --output-chr chrM ensures ref panel variant IDs use chr1:pos format matching the sample VCF, instead of bare 1:pos which caused zero intersection. * ci: sync container-test matrix with versions.env python:3.11-slim in the matrix did not match python:3.11 in versions.env and the actual Nextflow modules. * fix(slivar): use bcftools container with pre-built slivar binary The slivar biocontainer lacks bcftools which the module heavily uses. Switch to staphb/bcftools:1.21 and accept slivar as a pre-downloaded static binary via --slivar_bin parameter. * fix(multiqc): match output dir name to MultiQC 1.33 actual output MultiQC 1.33 writes data to multiqc_report_data/ not multiqc_data/. Nextflow declared the old name, causing task failure despite exit 0. * fix: sync schema defaults, add fail-fast for hla_dat/slivar_bin, fix multiqc stub - nextflow_schema.json tools.default now matches nextflow.config DB-light default - Add hla_dat and slivar_bin to schema and startup validation - Fix MULTIQC stub to create multiqc_report_data (matching output declaration) * fix: move hla_typing to opt-in tools (requires --hla_dat) hla_typing needs a pre-downloaded hla.dat file, so it doesn't belong in the zero-database default set. Sync config, schema, and test profile. * fix: align bash/docs with Nextflow, add clinvar/pypgx fail-fast - Add --tree phylotree-fu-rcrs@1.2 to bash haplogrep3 script and docs - Add clinvar + clinvar_index and pypgx_bundle to startup validation - README: add --hla_dat, --slivar_bin, --pypgx_bundle, --ancestry_ref - docs/nextflow.md: fix vcfanno output dir (publishes to vep/), update BAM-required tool list - docs/interpreting-results.md: PharmCAT output path for both paths * fix: address remaining CodeRabbit review findings - MultiQC: guard against VCF-only runs with no QC inputs - slivar: copy binary to workdir for Singularity compatibility - cpic_lookup: use word-boundary matching for phenotype filtering - cyrius: remove unused reference input - docs/nextflow.md: update fail-fast contract language - mito-haplogroup: align bash/docs mount paths to /genome convention * fix(slivar): chmod staged binary instead of copying to same path Nextflow stages slivar_bin into workdir as 'slivar', so cp to ./slivar fails with 'same file'. Just chmod +x the staged file and add PWD to PATH.
1 parent f59f250 commit d12979f

73 files changed

Lines changed: 4933 additions & 47 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.coderabbit.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ reviews:
1111
auto_review:
1212
enabled: true
1313
drafts: false
14+
auto_pause_after_reviewed_commits: 0
1415
ignore_title_keywords:
1516
- "WIP"
1617
- "DO NOT REVIEW"

.github/workflows/container-test.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,12 +39,11 @@ jobs:
3939
cmd: "java -version"
4040
- image: pgkb/pharmcat:3.2.0
4141
cmd: "pharmcat_pipeline --version"
42-
- image: python:3.11-slim
42+
- image: python:3.11
4343
cmd: "python3 --version"
4444
- image: quay.io/biocontainers/vcfanno:0.3.7--he881be0_0
4545
cmd: "bash -c 'vcfanno 2>&1 | grep -q \"vcfanno version\"'"
46-
- image: quay.io/biocontainers/slivar:0.3.3--h5f107b1_0
47-
cmd: "bash -c 'slivar 2>&1 | grep -q \"slivar version\"'"
46+
# slivar uses staphb/bcftools + pre-built binary (no dedicated container)
4847
- image: quay.io/biocontainers/pypgx:0.26.0--pyh7e72e81_0
4948
cmd: "pypgx --version"
5049
steps:

.github/workflows/nextflow.yml

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
name: Nextflow
2+
3+
on:
4+
push:
5+
branches: [main]
6+
paths: ['main.nf', 'nextflow.config', 'modules/**', 'workflows/**', 'conf/**', 'nextflow_schema.json', '.github/workflows/nextflow.yml']
7+
pull_request:
8+
branches: [main]
9+
workflow_dispatch:
10+
11+
permissions:
12+
contents: read
13+
14+
concurrency:
15+
group: ${{ github.workflow }}-${{ github.ref }}
16+
cancel-in-progress: true
17+
18+
jobs:
19+
nextflow-validate:
20+
runs-on: ubuntu-latest
21+
steps:
22+
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
23+
24+
- uses: actions/setup-java@c5195efecf7bdfc987ee8bae7a71cb8b11521c00 # v4
25+
with:
26+
distribution: 'temurin'
27+
java-version: '17'
28+
29+
- uses: nf-core/setup-nextflow@561fcfc7146dcb12e3871909b635ab092a781f34 # v2.0.0
30+
31+
- name: Validate Nextflow config
32+
run: nextflow config main.nf -profile docker
33+
34+
- name: Validate pipeline entry point
35+
run: nextflow run main.nf -help
36+
37+
- name: Validate nextflow_schema.json syntax
38+
run: python3 -m json.tool nextflow_schema.json > /dev/null
39+
40+
- name: Validate samplesheet schema syntax
41+
run: python3 -m json.tool assets/schema_input.json > /dev/null
42+
43+
- name: Run stub test
44+
run: |
45+
# Override Docker user + disable reports for CI:
46+
# Many bioinformatics images run as non-root, causing permission
47+
# issues with Nextflow work directories on GitHub Actions runners.
48+
cat > ci.config <<'CICONF'
49+
trace.enabled = false
50+
dag.enabled = false
51+
timeline.enabled = false
52+
report.enabled = false
53+
docker.userEmulation = false
54+
docker.runOptions = '-u 0:0'
55+
CICONF
56+
nextflow run main.nf -profile test,docker -stub -c ci.config
57+
58+
- name: Check Nextflow module container tags match versions.env
59+
run: |
60+
echo "Checking Nextflow module container tags against versions.env..."
61+
. ./versions.env
62+
FAIL=0
63+
64+
for nf_file in modules/local/*/main.nf; do
65+
[ -f "$nf_file" ] || continue
66+
# Extract container directives and check against versions.env
67+
# Use temp file to propagate failures out of the loop
68+
grep -oP "container\s+'[^']+'" "$nf_file" | sed "s/container '//;s/'//" > /tmp/nf_images.txt
69+
while read -r image; do
70+
base="${image%%:*}"
71+
tag="${image##*:}"
72+
[ "$base" = "$tag" ] && continue
73+
74+
match=$(grep -F "$base" versions.env | head -1 || true)
75+
if [ -n "$match" ]; then
76+
env_tag=$(echo "$match" | grep -oP ':\K[^"]+' | tr -d '"')
77+
if [ "$tag" != "$env_tag" ]; then
78+
echo "FAIL: $nf_file uses ${base}:${tag} but versions.env has ${base}:${env_tag}"
79+
FAIL=1
80+
fi
81+
fi
82+
done < /tmp/nf_images.txt
83+
done
84+
85+
[ "$FAIL" -eq 0 ] && echo "OK: Nextflow container tags match versions.env"
86+
exit "$FAIL"

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,9 @@ __pycache__/
3030
*.mmi
3131
*.dict
3232

33+
# Stub test data (override genomics ignores above)
34+
!assets/stub/**
35+
3336
# Archives
3437
*.tar.gz
3538
*.tgz

.nf-core.yml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# nf-core lint configuration
2+
# This pipeline uses nf-core template patterns but is NOT an official nf-core pipeline.
3+
# Skip branding and naming checks that only apply to nf-core organization pipelines.
4+
5+
repository_type: pipeline
6+
nf_core_version: '3.2.0'
7+
8+
lint:
9+
# Skip checks that require nf-core organization membership
10+
pipeline_name_conventions: false
11+
# Skip checks for nf-core-specific CI workflows
12+
actions_ci: false
13+
actions_awstest: false
14+
actions_awsfulltest: false
15+
# Skip nf-core branding requirements
16+
readme: false
17+
# Skip multiqc (will be added in PR 3)
18+
multiqc_config: false

README.md

Lines changed: 32 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
<a href="https://github.com/GeiserX/Personal-Genome-Pipeline/actions/workflows/lint.yml"><img src="https://img.shields.io/github/actions/workflow/status/GeiserX/Personal-Genome-Pipeline/lint.yml?style=flat-square&label=CI" alt="CI"></a>
1515
<a href="https://github.com/GeiserX/Personal-Genome-Pipeline/stargazers"><img src="https://img.shields.io/github/stars/GeiserX/Personal-Genome-Pipeline?style=flat-square&logo=github" alt="GitHub Stars"></a>
1616
<a href="https://www.docker.com/"><img src="https://img.shields.io/badge/runs%20with-Docker-0db7ed?style=flat-square&logo=docker&logoColor=white" alt="Docker"></a>
17+
<a href="https://www.nextflow.io/"><img src="https://img.shields.io/badge/runs%20with-Nextflow-3ac486?style=flat-square&logo=data:image/svg%2bxml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCIgZmlsbD0id2hpdGUiPjxwYXRoIGQ9Ik0xMiAyTDIgMTlIMjJMMTIgMloiLz48L3N2Zz4=&logoColor=white" alt="Nextflow"></a>
1718
<a href="https://geiserx.github.io/personal-genome-pipeline"><img src="https://img.shields.io/badge/docs-GitHub%20Pages-blue?style=flat-square&logo=github" alt="Docs"></a>
1819
</p>
1920

@@ -79,23 +80,21 @@ graph LR
7980
vcfanno --> slivar["slivar<br/><small>Prioritization</small>"]
8081
vcfanno --> clinical["Clinical Filter"]
8182
VCF --> cpsr["CPSR<br/><small>Cancer predisposition</small>"]
82-
VCF --> eh["ExpansionHunter<br/><small>STRs</small>"]
8383
VCF --> roh["ROH Analysis"]
8484
VCF --> prs["PRS<br/><small>Polygenic risk</small>"]
8585
VCF --> ancestry["Ancestry SNPs"]
8686
8787
%% BAM-based analyses
8888
BAM --> manta["Manta<br/><small>SVs</small>"]
8989
BAM --> delly["Delly<br/><small>SVs</small>"]
90-
BAM --> gridss["GRIDSS<br/><small>SVs</small>"]
9190
BAM --> cnvnator["CNVnator<br/><small>CNVs</small>"]
91+
manta --> duphold["duphold"]
92+
duphold --> annotsv["AnnotSV"]
9293
manta --> consensus["SV Consensus"]
9394
delly --> consensus
94-
gridss --> consensus
9595
cnvnator --> consensus
96-
consensus --> duphold["duphold"]
97-
duphold --> annotsv["AnnotSV"]
9896
97+
BAM --> eh["ExpansionHunter<br/><small>STRs</small>"]
9998
BAM --> pypgx["pypgx<br/><small>23-gene PGx<br/>+ CYP2D6 SV</small>"]
10099
BAM --> cyrius["Cyrius<br/><small>CYP2D6</small>"]
101100
BAM --> telomere["TelomereHunter"]
@@ -121,7 +120,7 @@ graph LR
121120
class FASTQ,BAM,VCF input
122121
class fastp,align,DV core
123122
class clinvar,pharmcat,cpic,cpsr,eh,roh,prs,ancestry,pypgx,cyrius,telomere,coverage,mito,haplo analysis
124-
class manta,delly,gridss,cnvnator,consensus,duphold,annotsv sv
123+
class manta,delly,cnvnator,consensus,duphold,annotsv sv
125124
class vep,vcfanno,slivar,clinical annotation
126125
class report report
127126
```
@@ -142,7 +141,7 @@ graph LR
142141
| 9 | [STR Expansions](docs/09-str-expansions.md) | ExpansionHunter | `quay.io/biocontainers/expansionhunter:5.0.0` | ~15 min | Recommended |
143142
| 10 | [Telomere Length](docs/10-telomere-analysis.md) | TelomereHunter | `lgalarno/telomerehunter:latest` | ~1 hr | Optional |
144143
| 11 | [ROH Analysis](docs/11-roh-analysis.md) | bcftools roh | `staphb/bcftools:1.21` | ~5 min | Recommended |
145-
| 12 | [Mito Haplogroup](docs/12-mito-haplogroup.md) | haplogrep3 | `genepi/haplogrep3:latest`\* | ~1 min | Optional |
144+
| 12 | [Mito Haplogroup](docs/12-mito-haplogroup.md) | haplogrep3 | `jtb114/haplogrep3:latest`\* | ~1 min | Optional |
146145
| 13 | [VEP Annotation](docs/13-vep-annotation.md) | VEP | `ensemblorg/ensembl-vep:release_112.0` | ~2-4 hr | Recommended |
147146
| 14 | [Imputation Prep](docs/14-imputation-prep.md) | bcftools | `staphb/bcftools:1.21` | ~10 min | Optional |
148147
| 15 | [SV Quality](docs/15-duphold.md) | duphold | `brentp/duphold:v0.2.3` | ~20 min | If step 4 run |
@@ -286,6 +285,32 @@ ORA is Illumina's proprietary compressed FASTQ format. Decompress first, then fo
286285
# ... continue as Path A
287286
```
288287

288+
### Nextflow
289+
290+
A Nextflow DSL2 execution path (v0.5.0) covers post-calling interpretation and clinical analysis — it accepts VCF + BAM from any upstream caller and runs the same pharmacogenomics, annotation, and clinical steps as the bash scripts. Both paths are maintained and produce biologically equivalent results (output file names and report scope may differ).
291+
292+
```bash
293+
# Minimal run — default tools need no external databases
294+
nextflow run main.nf --input samplesheet.csv --reference /path/to/GRCh38.fasta -profile docker
295+
296+
# Enable database-requiring tools (VEP, CPSR, ClinVar, ExpansionHunter)
297+
nextflow run main.nf --input samplesheet.csv --reference /path/to/GRCh38.fasta \
298+
--tools 'pharmcat,cpic,vcfanno,roh,prs,mito_haplogroup,hla_typing,telomere_hunter,mosdepth,mito_variants,cyrius,html_report,multiqc,vep,slivar,clinical_filter,cpsr,clinvar,expansion_hunter,pypgx,ancestry' \
299+
--vep_cache /path/to/vep_cache \
300+
--pcgr_data /path/to/pcgr_data \
301+
--vep_cache_cpsr /path/to/vep_cache_113 \
302+
--clinvar /path/to/clinvar.vcf.gz \
303+
--clinvar_index /path/to/clinvar.vcf.gz.tbi \
304+
--expansion_catalog /path/to/variant_catalog.json \
305+
--hla_dat /path/to/hla.dat \
306+
--slivar_bin /path/to/slivar \
307+
--pypgx_bundle /path/to/pypgx-bundle \
308+
--ancestry_ref /path/to/1kg_common_snps.vcf.gz \
309+
-profile docker
310+
```
311+
312+
See [docs/nextflow.md](docs/nextflow.md) for samplesheet format, tool selection, sarek integration, and bash vs Nextflow comparison.
313+
289314
---
290315

291316
## Prerequisites

ROADMAP.md

Lines changed: 31 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -44,14 +44,39 @@ Deep pathogenicity scoring, structured variant querying, and broader pharmacogen
4444
- [x] **Variant prioritization with inheritance queries** (`scripts/31-slivar.sh`) — slivar (GEMINI successor) for streaming VCF filtering with JS expressions. Rare HIGH/MODERATE variants, ClinVar pathogenic, compound het detection, gene constraint enrichment
4545
- [x] **pypgx alongside PharmCAT** (`scripts/32-pypgx.sh`) — 23-gene curated star allele calling including CYP2D6 structural variation from BAM read depth. Cross-validates with PharmCAT on shared genes
4646

47-
## v0.5.0 — Workflow engine integration
47+
## v0.5.0 — Nextflow workflow engine
4848

49-
The 44 bash scripts work but lack built-in parallelism, resume-on-failure, and HPC portability. The [nf-core](https://nf-co.re/) ecosystem (147 community pipelines including [sarek](https://nf-co.re/sarek) with 15 variant callers and [raredisease](https://github.com/nf-core/raredisease) for clinical genomics) demonstrates the community standard.
49+
The bash scripts work but lack built-in parallelism, resume-on-failure, and HPC portability. v0.5.0 adds a [Nextflow](https://www.nextflow.io/) DSL2 execution path alongside the existing bash scripts (which remain first-class).
5050

51-
- [ ] **Nextflow DSL2 wrapper** — convert the pipeline into a Nextflow workflow with channels and processes, preserving the current Docker-based execution model
52-
- [ ] **nf-core module compatibility** — use [nf-core/modules](https://github.com/nf-core/modules) where they exist (BWA-MEM2, DeepVariant, VEP, Manta, bcftools) for community-maintained containers and automated testing
53-
- [ ] **Snakemake alternative** — optional Snakemake wrapper for HPC environments that prefer it over Nextflow
54-
- [ ] This unlocks: automatic parallelism via DAG-based step ordering, resume on failure, Singularity/Apptainer for HPC clusters, and optional cloud portability
51+
### Why Nextflow over Snakemake?
52+
53+
Both are mature workflow engines. We chose Nextflow because:
54+
55+
- **nf-core ecosystem**: 147 community pipelines including [sarek](https://nf-co.re/sarek) (WGS variant calling) and [raredisease](https://github.com/nf-core/raredisease) (clinical genomics). Sarek's output is our primary input — channel compatibility matters.
56+
- **nf-core/modules**: 1000+ reusable modules. We can both use existing modules and contribute novel ones (PharmCAT, pypgx, slivar) under MIT.
57+
- **Container-first design**: Nextflow's `container` directive maps directly to our Docker-based architecture. Singularity support is automatic.
58+
- **Resume**: Content-hash caching is more robust than file-existence checks.
59+
- **Industry momentum**: Seqera/Nextflow has commercial backing; major sequencing centers standardize on Nextflow.
60+
61+
Snakemake's Python DSL and HPC scheduler integration are genuine strengths, but the nf-core ecosystem size and sarek compatibility are decisive.
62+
63+
### Scope: Post-processing focus
64+
65+
Steps 1-6 (alignment, variant calling) are already covered by nf-core/sarek. Rather than duplicate that work, this pipeline focuses on what sarek does NOT cover: pharmacogenomics, PRS, ancestry, telomere, repeat expansions, clinical interpretation, and reporting. The Nextflow pipeline accepts sarek output (VCF + BAM) as its primary input.
66+
67+
### Delivery
68+
69+
All 27 modules across 6 workflows are implemented. The stub-testable subset (tools that do not require external databases) is CI-validated; database-dependent tools (vep, cpsr, clinvar, expansion_hunter) are validated manually. The Nextflow path is usable for post-calling interpretation and produces biologically equivalent results to the bash scripts. See [docs/nextflow.md](docs/nextflow.md) for known limitations.
70+
71+
- [x] **PR #17 — Full Nextflow pipeline** (v0.5.0): All 6 workflows (PGX, ANNOTATION, CLINICAL, BAM_ANALYSIS, SV, REPORTING) with 27 modules, `--tools` gating, stub CI, Docker + Singularity profiles
72+
73+
### Parallel track: nf-core module contributions
74+
75+
PharmCAT, pypgx, and slivar modules will be contributed to [nf-core/modules](https://github.com/nf-core/modules) under MIT license — independent of the pipeline's GPL-3.0 license. Once merged, these modules will be available to all nf-core pipelines.
76+
77+
### Bash scripts
78+
79+
The bash scripts remain in `scripts/` as a maintained, simpler alternative for users who do not need workflow orchestration. After PR 3 validates the Nextflow path end-to-end, new features will be Nextflow-first. Bash scripts will continue to receive bug fixes and tool version bumps but not new analysis steps.
5580

5681
## v0.6.0 — Multi-sample & joint analysis
5782

assets/schema_input.json

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
{
2+
"$schema": "http://json-schema.org/draft-07/schema",
3+
"$id": "https://github.com/GeiserX/Personal-Genome-Pipeline/blob/main/assets/schema_input.json",
4+
"title": "Personal Genome Pipeline — Samplesheet Schema",
5+
"description": "Schema for the input samplesheet CSV",
6+
"type": "array",
7+
"items": {
8+
"type": "object",
9+
"properties": {
10+
"sample": {
11+
"type": "string",
12+
"description": "Sample identifier (used as output prefix)",
13+
"pattern": "^[a-zA-Z0-9._-]+$"
14+
},
15+
"vcf": {
16+
"type": "string",
17+
"description": "Path to sample VCF file (bgzipped)",
18+
"pattern": "^\\S+\\.vcf\\.gz$"
19+
},
20+
"vcf_index": {
21+
"type": "string",
22+
"description": "Path to VCF tabix index (.tbi)",
23+
"pattern": "^\\S+\\.vcf\\.gz\\.tbi$"
24+
},
25+
"bam": {
26+
"type": "string",
27+
"description": "Path to aligned BAM file (optional, needed for BAM-based steps)",
28+
"pattern": "^\\S+\\.bam$"
29+
},
30+
"bam_index": {
31+
"type": "string",
32+
"description": "Path to BAM index (.bai)",
33+
"pattern": "^\\S+\\.bam\\.bai$"
34+
}
35+
},
36+
"required": ["sample", "vcf", "vcf_index"]
37+
}
38+
}

assets/stub/EMPTY

Whitespace-only changes.

assets/stub/EMPTY_CLINICAL

Whitespace-only changes.

0 commit comments

Comments
 (0)