Skip to content

Latest commit

 

History

History
127 lines (88 loc) · 5.92 KB

File metadata and controls

127 lines (88 loc) · 5.92 KB

Step 1b: fastp QC + Adapter Trimming

Pre-alignment quality control: removes adapter sequences, trims low-quality bases, filters short reads, and removes polyG tails. Produces trimmed FASTQs plus JSON and HTML QC reports.


What It Does

  1. Adapter removal — auto-detects and trims Illumina TruSeq, Nextera, BGI/MGI, and 170+ other adapter sequences using both sequence matching and PE overlap analysis
  2. Quality trimming — slides a 4bp window from both read ends, trimming bases with mean quality below Q20
  3. PolyG tail removal — strips artificial polyG tails generated by NovaSeq/NextSeq two-color chemistry
  4. Length filtering — discards reads shorter than 36bp after trimming
  5. QC reporting — generates per-base quality curves, adapter content, GC distribution, duplication estimates, and insert size distribution (before + after filtering)

Why

Raw FASTQ reads contain adapter contamination and low-quality bases that can cause:

  • False variant calls — adapter sequences misalign to the reference and look like mutations
  • Reduced mapping rates — reads with adapter tails may not align or align incorrectly
  • Inflated duplication — adapter-contaminated reads cluster together artificially

This is particularly important for BGI/MGI (DNBseq) and Nebula Genomics data, where adapter contamination rates are higher than typical Illumina libraries. Aligners like minimap2 and BWA-MEM2 do not detect or remove adapters — they only soft-clip unmapped portions as a side effect of local alignment.

DeepVariant uses base quality as one of its 6 input channels. Adapter bases carry inflated quality scores that DeepVariant cannot distinguish from real sequence.

Tool

fastp v1.3.1 — all-in-one FASTQ preprocessor with SIMD acceleration and parallel compression.

Docker Image

quay.io/biocontainers/fastp:1.3.1--h43da1c4_0

Command

export GENOME_DIR=/path/to/data
./scripts/01b-fastp-qc.sh <sample_name>

Skip trimming

SKIP_TRIM=true ./scripts/run-all.sh <sample_name> <sex>

When skipped, the alignment step reads raw FASTQs directly (existing behavior).

What Happens Inside

fastp \
  -i input_R1.fastq.gz -I input_R2.fastq.gz \
  -o trimmed_R1.fastq.gz -O trimmed_R2.fastq.gz \
  --detect_adapter_for_pe \       # PE overlap-based adapter detection
  --qualified_quality_phred 20 \  # Bases below Q20 are "unqualified"
  --cut_front --cut_tail \        # Sliding window trim from both ends
  --cut_mean_quality 20 \         # Window mean quality threshold
  --length_required 36 \          # Discard reads < 36bp
  -g \                            # PolyG tail trimming (NovaSeq/NextSeq)
  -R "sample_name" \              # Report title (used by MultiQC)
  -j report.json -h report.html \ # QC reports
  -w 8                            # Worker threads

Adapter detection

For paired-end data, fastp uses overlap analysis to detect adapters: if the overlap region of R1 and R2 extends beyond the insert, the non-overlapping tails are adapter sequence. This works for any adapter type without needing to specify sequences.

Additionally, fastp has 170+ built-in adapter sequences including both Illumina TruSeq and BGI/MGI adapters in its known adapters database. No platform-specific flags are needed.

BGI/MGI note

fastp's known adapters include the MGI/BGI forward (AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA) and reverse (AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG) sequences. Auto-detection handles DNBseq data without additional configuration.

Output

File Location Description
Trimmed R1 fastq_trimmed/<sample>_R1.fastq.gz Quality-filtered, adapter-trimmed read 1
Trimmed R2 fastq_trimmed/<sample>_R2.fastq.gz Quality-filtered, adapter-trimmed read 2
JSON report fastq_trimmed/<sample>_fastp.json Machine-readable QC (consumed by MultiQC)
HTML report fastq_trimmed/<sample>_fastp.html Visual QC report (open in browser)

Runtime

Dataset Threads Time Memory
30X WGS (~60 GB FASTQ) 8 ~10-20 min < 2 GB
chr22 test data 4 < 1 min < 1 GB

fastp is I/O-bound, not CPU-bound. Threads beyond 8-16 provide minimal improvement.

Interpreting the Reports

HTML report

Open <sample>_fastp.html in a browser to see:

  • Before/after filtering summary — total reads, bases, Q20/Q30 rates
  • Quality score distribution — per-base quality across read positions
  • Base content — A/T/G/C/N proportions (should be ~25% each, flat across positions)
  • Adapter content — fraction of reads with detected adapters
  • Insert size distribution — fragment length peak (typically 300-500bp for WGS)
  • Duplication rate — estimated from the first 1M reads

What to look for

  • Adapter rate > 5%: Normal for some library preps, but indicates trimming is important
  • Q30 rate < 80% before filtering: Below-average sequencing quality
  • PolyG spike at read ends: Common on NovaSeq — fastp removes these automatically
  • Uneven base content at read starts: First 10-15bp often show bias from random hexamer priming (normal, not actionable)

MultiQC Integration

fastp's JSON report is automatically detected by MultiQC. The -R flag sets the sample name shown in the aggregated report. The JSON file must contain "before_filtering": { to be recognized.

Notes

  • fastp outputs gzip-compressed FASTQs when the output filename ends in .gz
  • The --detect_adapter_for_pe flag is disabled by default in fastp for PE data — this script enables it explicitly for thorough adapter removal
  • If you need to re-run fastp, delete the fastq_trimmed/ directory first
  • fastp is designed for short reads (Illumina, BGI/MGI). For long reads (ONT/PacBio), use platform-specific QC tools instead