一、FASTQ 是什麼?
FASTQ 是一個純文字檔,每個 read 用 4 行表示:read ID、序列、分隔符 (+)、品質分數。對於 paired-end 定序,每個樣本會有 R1 和 R2 兩個檔案,read 順序兩兩對應。
FASTQ is a plain-text file. Each read uses 4 lines: read ID, sequence, separator (+), and quality scores. For paired-end sequencing, each sample has R1 and R2 files with reads in matching order.
二、Phred 分數轉換器
拖動滑桿觀察 Phred 分數對應的錯誤率與 ASCII 字元。注意 Q30 是工業標準分界線。
Drag the slider to see how Phred scores map to error rates and ASCII characters. Q30 is the industry-standard cutoff.
三、FastQC 主要檢查項目
FastQC 是 Java 寫的單檔工具,輸出 HTML report 與 11 個檢測模組。對 WGS/WES 專案最關鍵的 6 個模組是:
FastQC is a Java-based tool producing HTML reports with 11 detection modules. The 6 most critical for WGS/WES projects are:
Per base sequence quality
沿 read 長度展示每個位置的品質分數箱線圖。Illumina 數據通常前段高、後段下降;R2 通常比 R1 差。red zone (<Q20) 大量出現代表需要 trim。
Box plots of quality at each cycle along the read. Illumina reads start strong then degrade; R2 is usually worse than R1. Heavy entry into the red zone (<Q20) means trimming is needed.
Per base sequence content
每個位置 A/T/G/C 的比例。應接近平直線(A≈T、G≈C)。前 9–13 bp 常見不平衡(隨機引子偏好),不必擔心;其他位置劇烈起伏可能是 adapter 或污染。
Proportion of A/T/G/C at each position. Should be roughly flat lines (A≈T, G≈C). Imbalance in the first 9–13 bp is normal (random hexamer bias); fluctuations elsewhere may indicate adapters or contamination.
Duplication levels
WGS 不應該有大量 duplicates(5–25% 正常);WES 通常較高(20–60%)因為 capture 集中。極高的 duplication(>80%)暗示 PCR over-amplification 或 library 複雜度低。
WGS should not show heavy duplication (5–25% normal); WES is typically higher (20–60%) due to capture concentration. Very high duplication (>80%) suggests PCR over-amplification or low library complexity.
Adapter content
3' 端常見 adapter read-through(當 insert 短於 read length 時)。FastQC 偵測 5 種常見 adapter (Illumina、Nextera、SOLID、Small RNA)。任何明顯的 adapter signal 都應該 trim。
3' adapter read-through is common (when insert is shorter than read length). FastQC detects 5 common adapter classes (Illumina, Nextera, SOLID, Small RNA). Any noticeable adapter signal should be trimmed.
Per sequence GC content
整個 read 的 GC% 分布。人類預期約 41% 為峰值,呈正常分布。雙峰或位移暗示跨物種污染(如細菌、mycoplasma)或 adapter dimer。
Per-read GC% distribution. Expected to peak around 41% for human samples in a normal-looking curve. Double peaks or shifts suggest cross-species contamination (bacteria, mycoplasma) or adapter dimers.
Overrepresented sequences
列出佔比 ≥0.1% 的單一序列。常見來源:adapter、ribosomal RNA(如 RNA-seq 殘留)、PCR primer dimer。每個序列 FastQC 會嘗試自動 BLAST 比對標註來源。
Lists single sequences making up ≥0.1% of reads. Common sources: adapters, ribosomal RNA (RNA-seq carryover), PCR primer dimers. FastQC auto-BLASTs each to suggest the source.
四、QC + Trimming 工具比較
傳統流程是 FastQC → Trimmomatic / Cutadapt → 再跑一次 FastQC——三個工具、三次 I/O。fastp(2018 釋出,2025 更新到 1.0)把這些整合到一次掃描,速度比傳統組合快 2–5 倍,並輸出與 FastQC 同等資訊量的 HTML report。
The classic flow is FastQC → Trimmomatic / Cutadapt → re-run FastQC — three tools, three I/O passes. fastp (released 2018, updated to 1.0 in 2025) integrates all of this into a single scan, running 2–5× faster than the legacy combo while producing FastQC-equivalent HTML reports.
| 特性 | FastQC | Trimmomatic | Cutadapt | fastp |
|---|---|---|---|---|
| QC 報告 | ✅ HTML | ❌ | ❌ | ✅ HTML + JSON |
| Adapter trim | ❌ | ✅ | ✅ | ✅ 自動偵測auto-detect |
| Quality trim | ❌ | ✅ 滑動視窗sliding window | ⚠️ 基本basic | ✅ 滑動視窗sliding window |
| UMI 處理 | ❌ | ❌ | ⚠️ | ✅ |
| polyG trim (NextSeq/NovaSeq 必要) | ❌ | ❌ | ❌ | ✅ |
| 速度 | 🔵 | 🐢 | 🐢 | ⚡⚡⚡ |
| 多執行緒 | ⚠️ | ✅ | ✅ | ✅ 原生native |
| 仍在維護 | ✅ | ❌ 2019 後停止stopped 2019 | ✅ | ✅ |
🌳 該用哪個工具?
五、實作範例
# 對 paired-end FASTQ 跑 FastQC(產出 HTML + zip) fastqc -o qc_out/ -t 8 sample_R1.fastq.gz sample_R2.fastq.gz # 結果: # qc_out/sample_R1_fastqc.html ← 視覺化報告 # qc_out/sample_R1_fastqc.zip ← 原始數據 (給 MultiQC 整合)
# fastp:一個指令同時做 QC、adapter、quality trim、polyG fastp \ -i sample_R1.fastq.gz -I sample_R2.fastq.gz \ -o sample_R1.trim.fastq.gz -O sample_R2.trim.fastq.gz \ --detect_adapter_for_pe \ # PE 自動偵測 adapter --cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 \ # 滑動視窗 --length_required 50 \ # trim 後仍 ≥50bp 才保留 --trim_poly_g \ # NovaSeq/NextSeq polyG 假信號 --thread 8 \ --html sample.fastp.html --json sample.fastp.json # 一次完成。HTML 內含 before/after 對比圖。
# Trimmomatic(傳統,仍可用但已停止更新) trimmomatic PE -phred33 -threads 8 \ sample_R1.fastq.gz sample_R2.fastq.gz \ sample_R1.trim.fastq.gz sample_R1.unpaired.fastq.gz \ sample_R2.trim.fastq.gz sample_R2.unpaired.fastq.gz \ ILLUMINACLIP:adapters/TruSeq3-PE.fa:2:30:10 \ # adapter 比對嚴格度 LEADING:3 TRAILING:3 \ # 兩端 trim Q<3 SLIDINGWINDOW:4:20 \ # 4bp 視窗,平均 Q≥20 MINLEN:50
# MultiQC:把所有 sample 的 fastp/FastQC 報告整合成一份 multiqc qc_out/ fastp_out/ -o multiqc_report/ # 自動掃描所有支援的工具輸出,產出 multiqc_report.html # 支援 FastQC、fastp、Trimmomatic、Picard、samtools、GATK 等 100+ 工具
六、常見問題與排查
不一定。FastQC 的 pass/warn/fail 是基於 RNA-seq/general 預設值,對 WGS 與 WES 來說很多 fail 可以忽略:例如 WES 的 Sequence Duplication Levels 因 target 集中而高是正常的;Per base sequence content 在前 13bp 不平衡是 random hexamer 偏好,無需處理。關鍵是看 trend,不是看燈號。
Not always. FastQC's pass/warn/fail uses defaults tuned for RNA-seq/general data. Many "fails" are acceptable for WGS/WES: WES's high Sequence Duplication is expected due to capture concentration; Per base sequence content imbalance in the first 13 bp reflects random hexamer bias and needs no fix. Read the trend, not the traffic light.
BWA-MEM 對輕度 adapter 污染與末端低品質有一定容忍度(會 soft-clip)。但 GATK 官方建議仍是 先做 adapter trim,因為:(1) adapter contamination 會在 BAM 產生大量 soft-clipped reads 影響後續 BQSR 統計;(2) 短 fragment 樣本(如 cfDNA)adapter read-through 嚴重時必須處理。Quality trim 則是可選——多數現代 caller 會考量 base quality。
BWA-MEM tolerates mild adapter contamination and end-quality drops (soft-clips them). But GATK's official recommendation remains pre-trim adapters because: (1) adapter contamination creates many soft-clipped reads in the BAM, affecting downstream BQSR statistics; (2) short-fragment samples (e.g. cfDNA) with severe adapter read-through must be cleaned. Quality trimming is optional — most modern callers already account for base quality.
常見原因有三:(1) 跨物種污染(細菌 GC% 跟人類不同,如 mycoplasma 約 32%)→ 用 Kraken2 或 FastQ Screen 篩查。(2) Adapter dimer 形成單一極端 GC 峰 → fastp 可清除。(3) 多個物種樣本意外混合。發現雙峰時,最重要的下一步是定量污染源,不要直接進入比對。
Three common causes: (1) Cross-species contamination (bacterial GC% differs from human, e.g. mycoplasma ~32%) → screen with Kraken2 or FastQ Screen. (2) Adapter dimers create one extreme GC peak → fastp removes them. (3) Accidental sample mixing. When you see a double peak, the next step is to quantify the contamination source, not push to alignment.
Illumina 兩色化學(NovaSeq、NextSeq)用「沒有信號」代表 G base。當 fragment 比 read length 短時,sequencer 在 fragment 結束後仍繼續讀,產生長串「沒有信號」即被誤判為 polyG。必須在 trimming 階段移除,否則會嚴重影響比對品質。fastp --trim_poly_g 預設啟用即可解決。
Illumina 2-color chemistry (NovaSeq, NextSeq) encodes G base as "no signal." When the fragment is shorter than read length, the sequencer keeps reading past the fragment end, producing long "no-signal" stretches misread as polyG. This must be trimmed or alignment quality suffers severely. fastp's --trim_poly_g (default on) handles it.
📝 自我檢測
1. 你打開 FastQC 報告,發現 Sequence Duplication Levels 顯示 50% duplicates,紅燈警告。樣本是 WES。應該?
1. Your FastQC report shows 50% sequence duplication with a red flag. The sample is WES. You should:
2. NovaSeq 平台產生的 reads,end of read 經常出現 polyG 序列。這是因為?
2. NovaSeq reads often end in polyG sequences. This is because:
3. 一個 read 的某 base 標示為 Phred Q40,這代表?
3. A base in a read is labeled Phred Q40. This means: