Pre-alignment quality control: removes adapter sequences, trims low-quality bases, filters short reads, and removes polyG tails. Produces trimmed FASTQs plus JSON and HTML QC reports.
- Adapter removal — auto-detects and trims Illumina TruSeq, Nextera, BGI/MGI, and 170+ other adapter sequences using both sequence matching and PE overlap analysis
- Quality trimming — slides a 4bp window from both read ends, trimming bases with mean quality below Q20
- PolyG tail removal — strips artificial polyG tails generated by NovaSeq/NextSeq two-color chemistry
- Length filtering — discards reads shorter than 36bp after trimming
- QC reporting — generates per-base quality curves, adapter content, GC distribution, duplication estimates, and insert size distribution (before + after filtering)
Raw FASTQ reads contain adapter contamination and low-quality bases that can cause:
- False variant calls — adapter sequences misalign to the reference and look like mutations
- Reduced mapping rates — reads with adapter tails may not align or align incorrectly
- Inflated duplication — adapter-contaminated reads cluster together artificially
This is particularly important for BGI/MGI (DNBseq) and Nebula Genomics data, where adapter contamination rates are higher than typical Illumina libraries. Aligners like minimap2 and BWA-MEM2 do not detect or remove adapters — they only soft-clip unmapped portions as a side effect of local alignment.
DeepVariant uses base quality as one of its 6 input channels. Adapter bases carry inflated quality scores that DeepVariant cannot distinguish from real sequence.
fastp v1.3.1 — all-in-one FASTQ preprocessor with SIMD acceleration and parallel compression.
- Paper: Chen et al., Bioinformatics 2018 (doi:10.1093/bioinformatics/bty560)
- Source: github.com/OpenGene/fastp
quay.io/biocontainers/fastp:1.3.1--h43da1c4_0
export GENOME_DIR=/path/to/data
./scripts/01b-fastp-qc.sh <sample_name>SKIP_TRIM=true ./scripts/run-all.sh <sample_name> <sex>When skipped, the alignment step reads raw FASTQs directly (existing behavior).
fastp \
-i input_R1.fastq.gz -I input_R2.fastq.gz \
-o trimmed_R1.fastq.gz -O trimmed_R2.fastq.gz \
--detect_adapter_for_pe \ # PE overlap-based adapter detection
--qualified_quality_phred 20 \ # Bases below Q20 are "unqualified"
--cut_front --cut_tail \ # Sliding window trim from both ends
--cut_mean_quality 20 \ # Window mean quality threshold
--length_required 36 \ # Discard reads < 36bp
-g \ # PolyG tail trimming (NovaSeq/NextSeq)
-R "sample_name" \ # Report title (used by MultiQC)
-j report.json -h report.html \ # QC reports
-w 8 # Worker threads
For paired-end data, fastp uses overlap analysis to detect adapters: if the overlap region of R1 and R2 extends beyond the insert, the non-overlapping tails are adapter sequence. This works for any adapter type without needing to specify sequences.
Additionally, fastp has 170+ built-in adapter sequences including both Illumina TruSeq and BGI/MGI adapters in its known adapters database. No platform-specific flags are needed.
fastp's known adapters include the MGI/BGI forward (AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA) and reverse (AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG) sequences. Auto-detection handles DNBseq data without additional configuration.
| File | Location | Description |
|---|---|---|
| Trimmed R1 | fastq_trimmed/<sample>_R1.fastq.gz |
Quality-filtered, adapter-trimmed read 1 |
| Trimmed R2 | fastq_trimmed/<sample>_R2.fastq.gz |
Quality-filtered, adapter-trimmed read 2 |
| JSON report | fastq_trimmed/<sample>_fastp.json |
Machine-readable QC (consumed by MultiQC) |
| HTML report | fastq_trimmed/<sample>_fastp.html |
Visual QC report (open in browser) |
| Dataset | Threads | Time | Memory |
|---|---|---|---|
| 30X WGS (~60 GB FASTQ) | 8 | ~10-20 min | < 2 GB |
| chr22 test data | 4 | < 1 min | < 1 GB |
fastp is I/O-bound, not CPU-bound. Threads beyond 8-16 provide minimal improvement.
Open <sample>_fastp.html in a browser to see:
- Before/after filtering summary — total reads, bases, Q20/Q30 rates
- Quality score distribution — per-base quality across read positions
- Base content — A/T/G/C/N proportions (should be ~25% each, flat across positions)
- Adapter content — fraction of reads with detected adapters
- Insert size distribution — fragment length peak (typically 300-500bp for WGS)
- Duplication rate — estimated from the first 1M reads
- Adapter rate > 5%: Normal for some library preps, but indicates trimming is important
- Q30 rate < 80% before filtering: Below-average sequencing quality
- PolyG spike at read ends: Common on NovaSeq — fastp removes these automatically
- Uneven base content at read starts: First 10-15bp often show bias from random hexamer priming (normal, not actionable)
fastp's JSON report is automatically detected by MultiQC. The -R flag sets the sample name shown in the aggregated report. The JSON file must contain "before_filtering": { to be recognized.
- fastp outputs gzip-compressed FASTQs when the output filename ends in
.gz - The
--detect_adapter_for_peflag is disabled by default in fastp for PE data — this script enables it explicitly for thorough adapter removal - If you need to re-run fastp, delete the
fastq_trimmed/directory first - fastp is designed for short reads (Illumina, BGI/MGI). For long reads (ONT/PacBio), use platform-specific QC tools instead