Extracts the small subset of clinically interesting variants from your VEP-annotated VCF. Instead of manually searching through 4-5 million variants, this step produces a focused list of ~200-500 variants that are rare AND functionally impactful.
The biggest challenge after running VEP annotation is: "I have millions of variants, what do I look at?" This step solves that by applying conservative filters to surface the variants most likely to be medically relevant.
bcftools + bcftools +split-vep plugin (parses VEP CSQ fields structurally — no grep)
staphb/bcftools:1.21
- VEP-annotated VCF from step 13:
${GENOME_DIR}/${SAMPLE}/vep/${SAMPLE}_vep.vcf(or.vcf.gz)
./scripts/23-clinical-filter.sh your_nameThe script produces up to three variant sets (depending on VEP annotations available) that are merged:
- Stop-gained (premature stop codon — breaks the protein)
- Frameshift insertions/deletions (shifts reading frame — breaks the protein)
- Splice donor/acceptor (disrupts splicing — breaks the protein)
- Start-lost (no translation initiation)
Expected count: 100-200 per genome. Uses bcftools +split-vep -s worst to select the most severe consequence per variant.
- Missense variants (amino acid change) with gnomAD allele frequency < 1%
- In-frame insertions/deletions with gnomAD AF < 1%
Expected count: 200-400 per genome after frequency filtering. If VEP output lacks gnomAD frequencies (--af_gnomade), all MODERATE variants are included.
- Variants with
CLIN_SIGcontaining "pathogenic" (covers both pathogenic and likely_pathogenic) - Only runs if VEP was run with
--everythingor--check_existing(which populates the CLIN_SIG field) - Does not use
-s worst— a variant is included if ANY transcript annotation has a pathogenic ClinVar entry
Expected count: 10-50 per genome (depends on ClinVar version).
| File | Contents | Size |
|---|---|---|
${SAMPLE}_clinical.vcf.gz |
Combined clinically interesting VCF | < 5 MB |
${SAMPLE}_clinical_summary.tsv |
Human-readable tab-delimited table | < 1 MB |
${SAMPLE}_high_impact.vcf.gz |
HIGH impact variants only | < 2 MB |
${SAMPLE}_rare_moderate.vcf.gz |
Rare MODERATE variants only | < 3 MB |
${SAMPLE}_clinvar_pathogenic.vcf.gz |
ClinVar P/LP only (if CLIN_SIG available) | < 1 MB |
~5-10 minutes (I/O-bound, reading the large VEP VCF)
# View the most important variants
column -t ${GENOME_DIR}/${SAMPLE}/clinical/${SAMPLE}_clinical_summary.tsv | head -20# Find which clinical variants are also in ClinVar
docker run --rm -v "${GENOME_DIR}:/genome" staphb/bcftools:1.21 \
bcftools isec -n=2 -w1 \
/genome/${SAMPLE}/clinical/${SAMPLE}_clinical.vcf.gz \
/genome/clinvar/clinvar.vcf.gz \
-Oz -o /genome/${SAMPLE}/clinical/${SAMPLE}_clinical_clinvar.vcf.gzThe _clinical.vcf.gz file is small enough to load in IGV Web or gene.iobio for visual inspection.
- This is a computational filter, not a clinical interpretation
- Some pathogenic variants are LOW impact (e.g., synonymous variants affecting splicing, regulatory variants) and will be missed by this filter
- gnomAD frequency filtering depends on VEP having annotated the gnomAD fields correctly
- Always cross-reference findings with ClinVar and consult a professional for clinical decisions
- No additional Docker images required — uses the same bcftools image as other steps
- Uses
bcftools +split-vepto parse VEP's pipe-delimited CSQ annotation structurally (not grep) - The VEP VCF is compressed and indexed automatically if needed
- PASS filter is applied to exclude low-quality variant calls
- Available CSQ subfields are auto-detected — ClinVar and gnomAD filters are skipped gracefully if not present