Lessons Learned

Every failure encountered during pipeline development (Mar 2026), documented so they don't happen again.

Docker Image Issues

AnnotSV: Official image doesn't exist

Failed: bioinfochrustrasbourg/annotsv:3.4.4 — no such image on Docker Hub
Fix: Use getwilds/annotsv:3.4.4 instead (Fred Hutch maintained)

SnpEff/SnpSift: Combined package

Failed: quay.io/biocontainers/snpsift:5.2--hdfd78af_1 — no such manifest
Fix: Use quay.io/biocontainers/snpeff:5.2--hdfd78af_1 — SnpSift is bundled inside the SnpEff package

SnpEff Database: Azure blob storage outage

Failed: snpEff download GRCh38.105 and all database names — Azure blob storage returned 0-byte files for ALL URLs including GRCh38.105, GRCh38.mane.1.2.ensembl
Root cause: SnpEff databases are hosted on Azure blob storage which was down
Fix: Pivot to Ensembl VEP as alternative functional annotation tool

ExpansionHunter: Different image formats

Failed: quay.io/biocontainers/expansionhunter:5.0.0--hd03093a_1 — manifest not found. Also mgibio/expansionhunter:latest — not found
Fix: Use weisburd/expansionhunter:latest — binary at /ExpansionHunter/bin/ExpansionHunter, variant catalogs at /pathogenic_repeats/GRCh38/
Note: This is ExpansionHunter v2.5.5 which uses --repeat-specs (directory), not --variant-catalog (single JSON)

ExpansionHunter: Missing required --log parameter

Failed: Container exits immediately with "the option '--log' is required but missing"
Fix: Always include --log /output/sample_eh.log in the command

StellarPGx: Empty Docker image

Failed: twesigomwedavid/stellarpgx:latest — image exists but contains no StellarPGx binaries
Status: UNSOLVED. No working Docker image found for StellarPGx as of Mar 2026.

Tool-Specific Issues

TelomereHunter: Permission denied

Failed: OSError: [Errno 13] Permission denied: '/output/sergio' when writing output (sample name in error will vary)
Fix: Add --user root flag to docker run

TelomereHunter: pip install on host fails

Failed: pip install telomerehunter on the host gives UnicodeDecodeError — Python environment issues
Fix: Use Docker image lgalarno/telomerehunter:latest instead of native install

HLA-LA: Graph not serialized (3 failures)

Failed attempt 1: zlskidmore/hla-la:latest — graph at src/additionalReferences/PRG_MHC_GRCh38_withIMGT/ exists but is NOT serialized. HLA-LA exits silently.
Failed attempt 2: Copied graph files to graphs/PRG_MHC_GRCh38_withIMGT/ — "graph not complete"
Failed attempt 3: Ran --action prepareGraph in detached container — container exited during prep, graph still not serialized
Root cause: HLA-LA requires a pre-serialized graph (~40GB RAM to prepare, takes hours). No standard Docker image includes it.
Fix: Use jiachenzdocker/hla-la:latest (27.5GB image with pre-built graph) OR switch to T1K

HLA-LA: Binary crash with pre-built graph image

Failed: jiachenzdocker/hla-la:latest — read extraction succeeds but HLA-LA C++ binary crashes during graph alignment even with 32GB RAM and 8 threads. Error: "HLA-LA execution not successful."
Root cause: Likely an incompatibility between the pre-built binary and the BAM data format, or an unmet memory requirement (the graph deserialization may need >32GB)
Status: UNSOLVED. HLA-LA from WGS BAMs is unreliable in Docker. Alternative: use Sanitas clinical HLA typing results, or use arcas-hla or T1K with partial coordinates

T1K: Coordinate file with wrong values

Failed: t1k-build.pl -d hla.dat -g reference.fasta.fai produced coordinate file with chr19 -1 -1 + for all HLA genes
Root cause: The -g parameter expects the actual reference FASTA (3.1GB), not the FAI index (158KB). The AddGeneCoord.pl script needs to align HLA sequences against the genome to find coordinates.
Fix: Use -g Homo_sapiens_assembly38.fasta (the full FASTA, not the .fai)

T1K: BAM extraction produces 0-byte FASTQ

Failed: bam-extractor runs but produces empty _candidate_1.fq and _candidate_2.fq
Root cause: Coordinate file had -1 -1 coordinates (see above), so no genomic region was extracted
Fix: Regenerate coordinate file with proper reference FASTA

Cyrius CYP2D6: Inconclusive on short-read WGS

Result: Cyrius returned None for CYP2D6 star alleles
Root cause: CYP2D6 has extensive pseudogene homology (CYP2D7, CYP2D8) making short-read WGS unreliable
Mitigation: Rely on Sanitas lab calls for CYP2D6; consider long-read sequencing in future

bcftools/htslib Issues

bgzip/tabix not in bcftools image PATH

Failed: staphb/bcftools:1.21 does not include bgzip or tabix in $PATH
Fix: Use bcftools view -Oz -o output.vcf.gz (native bgzip output) and bcftools index -t (native tabix index) instead of piping to bgzip/tabix
Alternative: Use quay.io/biocontainers/samtools:1.21 which includes all htslib tools

MIS VCF conversion: 0-byte output files

Failed: All 22 chr files in mis_ready/ were 0 bytes
Root cause: Script used bgzip -c > output.vcf.gz which failed silently because bgzip wasn't available
Fix: See above — use bcftools view -Oz -o

VEP Cache Issues

VEP INSTALL.pl: Permission denied on temp directory

Failed: Cannot open Local file /opt/vep/.vep/tmp/homo_sapiens_vep_112_GRCh38.tar.gz
Fix: Run with --user root and pre-create the temp dir: mkdir -p /opt/vep/.vep/tmp && chmod 777 /opt/vep/.vep/tmp

VEP INSTALL.pl: Silent download failure

Failed: Container exits after "downloading..." with no cache files extracted
Root cause: The 17GB download timed out or was interrupted; INSTALL.pl doesn't retry
Fix: Download manually with wget -c (supports resume), then extract with tar xzf, then run VEP with --cache --dir_cache /path/to/cache

General Docker Tips

Always use resource limits

docker run --cpus 4 --memory 8g ...

Without limits, tools like DeepVariant or minimap2 will consume ALL available RAM and crash the host.

Build for amd64 from macOS

docker build --platform linux/amd64 ...

macOS is arm64; most Linux servers are amd64. Images built on Mac without --platform won't run on amd64 servers.

Use --rm for one-shot containers

Always use --rm for analysis containers to avoid accumulating stopped containers. Use -d (detached) for long-running jobs.

Always use --user root for write access

Most bioinformatics containers run as non-root users. If writing to bind-mounted volumes, add --user root to avoid permission issues.

CI / Workflow Issues

ShellCheck warnings still fail the GitHub Action

Observed: ludeeus/action-shellcheck@master exits non-zero even when configured with severity: warning
Impact: Once the check is required on main, "warning-only" findings still block merges
Fix: Clear ShellCheck warnings before enabling required status checks, or explicitly relax the workflow instead of assuming warnings are advisory only

Protect `main` only after CI is green

Observed: Required status checks become a trap if you enable branch protection while the default branch or active PR branch is still red
Fix: Get Lint and Smoke Tests green first, then enable required checks, block force-pushes, and block deletion

MToolBox Issues

MToolBox: No working Docker image exists

Failed: robertopreste/mtoolbox:latest — "repository does not exist or may require docker login"
Also checked: No image on quay.io/biocontainers, ghcr.io, or Docker Hub
Root cause: MToolBox was never officially containerized. GitHub issue #107 (Mar 2022) confirms: "Not at the moment."
Fix: Use GATK Mutect2 in mitochondrial mode instead (broadinstitute/gatk:latest). Mutect2 handles mitochondrial heteroplasmy detection natively and is well-maintained.

CNVnator Issues

CNVnator: Biocontainer tag with wrong build hash

Failed: quay.io/biocontainers/cnvnator:0.4.1--py312hc02a2a2_7 — manifest not found
Fix: Use quay.io/biocontainers/cnvnator:0.4.1--py312h99c8fb2_11. Biocontainer hashes encode the conda build hash and change between builds. Always verify at quay.io/repository/biocontainers/cnvnator.

Delly Issues

Delly: Biocontainer tag doesn't exist

Failed: quay.io/biocontainers/delly:1.2.9--ha41ced6_0 — manifest not found
Fix: Use quay.io/biocontainers/delly:1.7.3--hd6466ae_0 (latest as of Mar 2026). Biocontainer tags are version-specific and change frequently — always verify at quay.io/repository/biocontainers/delly.

Delly: SV annotation phase takes 2-3 hours

Observed: Delly's "SV annotation" step runs for 2-3 hours at 100% CPU on a 30X WGS genome. No new log output appears during this time, which can look like the process is stuck.
This is normal. Delly genotypes every candidate SV site against the reference, which is CPU-intensive. Total runtime for 30X WGS: ~3-4 hours.
Tip: Use docker stats to confirm the container is still using CPU. If CPU is at 0%, the process may actually be stuck.

Delly: Output is BCF format, not VCF

Gotcha: Delly writes BCF (binary VCF), not VCF. The output file has a .bcf extension.
Fix: Convert with bcftools view input.bcf -Oz -o output.vcf.gz and index with bcftools index -t output.vcf.gz. The pipeline script handles this automatically.

CNVnator: ROOT file appears empty (266 bytes) during tree extraction

Observed: During the -tree step, the .root file stays at 266 bytes (just the ROOT header) until the entire BAM is parsed.
This is normal. For a 30X WGS (~80GB BAM), the tree step takes ~5-10 minutes. The ROOT file grows to ~900MB-1.2GB only at the very end when the tree is flushed to disk.
If the container exits and the file is still 266 bytes: Check if a corrupt .root file from a previous failed run is blocking it. Delete and retry.

CPSR/PCGR Issues

CPSR: --pcgr_dir path confusion (PCGR 1.x, historical)

Failed: cpsr --pcgr_dir /genome/pcgr_data/data → "Data directory (/genome/pcgr_data/data/data) does not exist"
Root cause: CPSR 1.x internally appends /data to whatever --pcgr_dir you pass. If you point to the data/ directory inside the extracted bundle, it looks for data/data/.
Fix (1.x): Point --pcgr_dir to the parent of the data/ directory: --pcgr_dir /genome/pcgr_data (not /genome/pcgr_data/data)
Superseded by PCGR 2.x — the --pcgr_dir flag no longer exists. See migration notes below.

CPSR: Docker image is inside PCGR

Failed: sigven/cpsr:2.0.0 does not exist on Docker Hub
Fix: Use sigven/pcgr:2.2.5 which bundles both pcgr and cpsr binaries at /usr/local/bin/

PCGR 2.x Migration (1.4.1 to 2.2.5)

CLI completely changed: The --pcgr_dir flag is gone. Replaced by --refdata_dir (for the ref data bundle) and --vep_dir (for the VEP cache). These are separate mount points inside the container.
Docker volumes changed: PCGR 1.x used a single -v ${GENOME_DIR}:/genome mount. PCGR 2.x requires four separate mounts: VEP cache (:/mnt/.vep), ref data bundle (:/mnt/bundle), input VCFs (:/mnt/inputs), and outputs (:/mnt/outputs).
Data bundle is smaller and different: The old monolithic ~21 GB bundle (pcgr.databundle.grch38.20220203.tgz) that included VEP cache is replaced by a smaller ~5 GB ref data bundle (pcgr_ref_data.20250314.grch38.tgz). VEP cache is now provided separately (reuse the same cache from step 13).
Bundle extraction requires extra step: After tar xzf, the extracted data/ directory must be moved into a version-stamped directory: mkdir -p 20250314 && mv data/ 20250314/. The --refdata_dir mount points to this version directory.
Docker tag: sigven/pcgr:1.4.1 → sigven/pcgr:2.2.5. The image still bundles both pcgr and cpsr binaries.
Old data bundle is incompatible: If you have the 1.x bundle, you must download the 2.x bundle fresh. The directory structure and expected paths are completely different.

Michigan Imputation Server Notes

Minimum 20 samples per job

MIS is designed for genotyping array data from cohort studies
Single-sample WGS submissions may be rejected (20-sample minimum)
For individual WGS: imputation adds minimal value (you already have 90%+ variant coverage)
Main benefit for WGS would be phasing, not imputation

Registration required

Must create account at imputationserver.sph.umich.edu
API tokens expire after 30 days
Results auto-deleted after 7 days

TOPMed panel is best for Europeans

TOPMed Freeze 8 (r2): 132K samples, 705M variants, GRCh38 native
Uses chr prefix (which GRCh38 BAMs already have)
HRC r1.1 (32K samples) is European-centric but hg19 only

Alternative Variant Caller Issues

FreeBayes 1.3.7: SIGILL crash (exit code 132)

Failed: quay.io/biocontainers/freebayes:1.3.7--h1870644_0 — freebayes --version works, but actual variant calling triggers SIGILL (illegal instruction, exit code 132)
Tested on: Intel i5-14500
Root cause: Likely a build-time CPU optimization mismatch in the 1.3.7 biocontainer binary
Fix: Use quay.io/biocontainers/freebayes:1.3.6--hbfe0e7f_2 which works correctly

FreeBayes: Memory grows to ~13 GB on full genome

Observed: FreeBayes memory usage grows unpredictably during full-genome runs: 463MB at 30 min, 6.4GB at 60 min, 12.7GB at 90 min, then stabilizes ~12GB
Original limit: --memory 16g was too tight — would have OOM-killed at 80% usage
Fix: Use --memory 32g for full-genome runs. Peak observed was 12.8GB but growth is non-linear and region-dependent.

FreeBayes: Single-threaded, no parallelism

Observed: FreeBayes has no -t or --threads flag. Full 30X WGS takes 8-12 hours.
Workaround: Use --region chr22 (or INTERVALS=chr22) for quick testing (~20-40 min)
For production: Consider GNU parallel with per-chromosome regions, then merge VCFs

GATK HaplotypeCaller: bcftools index fails on existing .tbi

Failed: bcftools index -t fails with "index file exists" after GATK already creates its own .tbi
Fix: Use bcftools index -ft (with -f force flag) to overwrite the GATK-generated index

GATK HaplotypeCaller: Requires .dict file

Failed: GATK HaplotypeCaller fails if Homo_sapiens_assembly38.dict is missing
Fix: Generate once with gatk CreateSequenceDictionary -R /genome/reference/Homo_sapiens_assembly38.fasta

bcftools isec: -R vs -r for region strings

Failed: bcftools isec -R chr22 treats -R (uppercase) as a BED file path, fails with "file not found"
Fix: Use -r chr22 (lowercase) for region strings. -R expects a file.

TIDDIT >=3.9: Requires BWA index for local assembly

Failed: tiddit --sv exits with "The reference must be indexed using bwa index; run bwa index, or skip local assembly (--skip_assembly)"
Root cause: TIDDIT 3.9+ uses local assembly for breakpoint refinement, which requires BWA index files alongside the reference
Fix: Use --skip_assembly when using minimap2 alignments (no BWA index available). If using BWA-MEM2 alignment, the index files are compatible.

TIDDIT: Image tag 3.7.0 doesn't exist on quay.io

Failed: quay.io/biocontainers/tiddit:3.7.0--py312h24f4cff_1 — manifest unknown
Fix: Use quay.io/biocontainers/tiddit:3.9.5--py312h6e8b409_0. Always verify tags at quay.io/repository/biocontainers/tiddit.

Strelka2: --callRegions needs bgzipped + tabixed BED

Failed: --callRegions reference.fasta.fai → "Can't find expected call-regions bed index file"
Fix: Create a proper bgzipped BED file with tabix index. Use GATK container for bgzip/tabix (not in bcftools or samtools staphb images).

bgzip/tabix not in staphb/samtools or staphb/bcftools images

Observed: Neither staphb/samtools:1.20 nor staphb/bcftools:1.21 include bgzip or tabix in PATH
Fix: Use broadinstitute/gatk:4.6.1.0 which has both at /usr/bin/bgzip and /usr/bin/tabix. Or use bcftools view -Oz as a bgzip alternative.

FreeBayes chr22 variant count (3x more than DeepVariant)

Observed: FreeBayes calls ~247K variants on chr22 vs DeepVariant ~91K and GATK ~69K
Interpretation: The ~200K FreeBayes-unique variants are mostly false positives. FreeBayes maximizes sensitivity at the cost of precision.
Recommendation: Always quality-filter FreeBayes output with bcftools filter or vcffilter before use.

Chip Data Conversion (Genotyping Arrays → VCF)

plink silently corrupts single-sample homozygous ALT genotypes

What failed: plink --23file (1.9) to import + plink2 --ref-from-fa force to fix REF/ALT
Why: For single-sample data, ALL homozygous positions are monomorphic. plink's .bim stores only one allele for these. --ref-from-fa cannot create a proper ALT because there's no second allele slot. Homozygous ALT genotypes silently become homozygous REF.
Verified: rs9939609 (FTO), genotype=AA, REF=T. plink: REF=A, ALT=., GT=0/0 (WRONG). bcftools: REF=T, ALT=A, GT=1/1 (CORRECT). ~66K positions (11%) corrupted.
Fix: Use bcftools convert --tsv2vcf -f <reference.fa>. Single command, no intermediate binary format.

plink 1.9 --23file quirks

--allow-extra-chr cannot be used with --23file
Female samples with Y calls (MyHeritage GSA PAR region) error with sex=F
Sex inference defaults to male unless explicitly set

MyHeritage CSV must be converted to TSV

Quoted CSV with "RSID","CHROMOSOME","POSITION","RESULT" columns
Strip ## comments, header, quotes; convert commas to tabs

bcftools hg19 VCF needs chr prefix before liftover

hg19 reference uses numeric chromosomes; chain file expects chr prefix
bcftools annotate --rename-chrs between conversion and liftover

PharmCAT chip vs WGS results (MyHeritage GSA, verified 2026-03-31)

Correct: CYP2B6 (*1/*6), CYP4F2 (*1/*6), DPYD (*5/*5), NUDT15 (*1/*2)
Missed: CYP2C19 (25 missing), VKORC1 (1 missing)
Wrong: CYP3A5 *1/*1 (should be *3/*3, 4 missing positions)
Total: 888 missing PGx positions from the GSA chip

ROH and PRS need special flags for chip data

ROH: -G30 required (no FORMAT/PL in chip VCF)
PRS: no-mean-imputation required (single sample lacks allele frequencies)
PRS matching: chip ~12% of large scoring files vs WGS ~28%

v0.3.0 Tool Additions (Apr 2026)

ExpansionHunter v5.0.0: Completely different CLI from v2.5.5

Old (v2.5.5): --bam, --ref-fasta, --repeat-specs (directory), --vcf, --json, --log (all required)
New (v5.0.0): --reads, --reference, --variant-catalog (single JSON), --output-prefix, --threads
The --log flag is gone in v5.0.0. Do NOT pass it or the command will fail.
Biocontainer image: quay.io/biocontainers/expansionhunter:5.0.0--hc26b3af_5. Binary is ExpansionHunter (on PATH, not at /ExpansionHunter/bin/).
Bundled catalog: /usr/local/share/ExpansionHunter/variant_catalog/grch38/variant_catalog.json (31 loci). No need to download separately.

GRIDSS: Requires BWA index (not minimap2)

Failed: GRIDSS exits with "BWA index not found" when using default minimap2 alignment
Fix: Either (a) align with BWA-MEM2 first (02a-alignment-bwamem2.sh), or (b) generate BWA index files separately. The 04b-gridss.sh script validates this and prints instructions.
Note: GRIDSS outputs ALL SVs as BND (breakend) notation. Standard DEL/DUP/INV types require post-processing conversion for SURVIVOR merge compatibility.

GRIDSS: 32 GB memory requirement

Observed: GRIDSS assembly-based SV calling needs ~28 GB JVM heap for 30X WGS
Fix: Container runs with --memory 32g and -Xmx28g JVM argument. Will fail silently on machines with < 32 GB RAM.

GRIDSS: ENCODE blacklist download

Observed: GRIDSS benefits from an ENCODE blacklist to suppress known artifact regions
Fix: Script auto-downloads ENCFF356LFX.bed.gz (hg38 blacklist) from ENCODE on first run. If download fails (offline), GRIDSS runs without it (lower precision but still functional).

fastp: Maximum 16 threads despite --workers flag

Observed: fastp accepts -w (workers) up to 16. Values above 16 are clamped to 16. The -w flag controls I/O worker threads; actual adapter detection is single-threaded.
Recommendation: Use -w matching THREADS up to 16. For most WGS runs, -w 4 is sufficient.

fastp: BGI/MGI adapter auto-detection

Observed: fastp's --detect_adapter_for_pe works for Illumina, BGI, and MGI adapters without specifying adapter sequences. BGI/MGI adapters are compiled into fastp's knownadapters.h.
No action needed: The --detect_adapter_for_pe flag handles all common sequencing platforms.

mosdepth: --fast-mode skips per-base output

Observed: --fast-mode uses a simpler, faster counting method and does NOT write the per-base .per-base.bed.gz file. This saves ~2 GB of output and cuts runtime by ~40%.
Fix: Always use --fast-mode unless per-base resolution is specifically needed.

MultiQC: Auto-discovers fastp JSON by content, not filename

Observed: MultiQC identifies fastp output by looking for "before_filtering": { in JSON files, not by filename pattern. Files must end in .json.
Tip: Ensure fastp's -j output uses .json extension.

Octopus: No issues observed

Docker image dancooke/octopus:0.7.4 works out of the box for germline calling
Supports --threads for parallelism (unlike FreeBayes)
Typical memory: 8-12 GB for 30X WGS (much less than FreeBayes peak of 13 GB)

PharmCAT 3.x Migration (2.15.5 to 3.2.0)

Preprocessor script renamed (no .py extension)

Old (2.15.5): python3 /pharmcat/pharmcat_vcf_preprocessor.py
New (3.2.0): python3 /pharmcat/pharmcat_vcf_preprocessor
Impact: Step 7 preprocessor command must drop the .py suffix or the container exits with "No such file"

Reporter flags: must be explicit for both formats

Old (2.15.5): -reporterJson produced JSON; HTML was always generated by default
New (3.2.0): If ANY format flag is specified, ONLY those formats are produced. To get both HTML and JSON, you must pass -reporterJson -reporterHtml
Impact: Step 7 now passes both flags explicitly. Without -reporterHtml, the HTML report (used for manual review) would silently stop being generated.

JSON property rename: wildtypeAllele to referenceAllele

Old (2.15.5): wildtypeAllele property in gene result JSON objects
New (3.2.0): Renamed to referenceAllele
Impact: Step 27's JSON parser does not use this property directly, so no code change was needed. Any downstream scripts or notebooks that parse wildtypeAllele must be updated.

New features in 3.2.0

NAT2 calling: PharmCAT 3.x includes improved NAT2 acetylator status calling
BCF support: Preprocessor now accepts BCF input files directly (no conversion needed)
Single-gene calling: New -g flag allows running PharmCAT on a single gene (useful for targeted re-analysis)

FilesExpand file tree

lessons-learned.md

Latest commit

History