Step 2: FASTQ QC — WGS/WES Tutorial

檔案格式

一、FASTQ 是什麼？

FASTQ 是一個純文字檔，每個 read 用 4 行表示：read ID、序列、分隔符 (+)、品質分數。對於 paired-end 定序，每個樣本會有 R1 和 R2 兩個檔案，read 順序兩兩對應。

FASTQ is a plain-text file. Each read uses 4 lines: read ID, sequence, separator (+), and quality scores. For paired-end sequencing, each sample has R1 and R2 files with reads in matching order.

@A00123:45:HXXX:1:1101:1234:5678 1:N:0:ATCACG ← read ID + run/flowcell/coords + R1/R2 flag GATTACAGATTACAGATTACAGATTACAGATTACAGATTAC... ← sequence (A/C/G/T/N), 150bp typical + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF... ← Phred quality (ASCII-encoded, Phred+33)

🔑

Phred 分數 Q = -10·log₁₀(P)，其中 P 是錯誤機率。Q30 = 1/1000 錯誤率（99.9% 準確）。Illumina 平台目標是 ≥80% bases Q30+。 Phred score Q = -10·log₁₀(P), where P is error probability. Q30 = 1/1000 error rate (99.9% accuracy). Illumina platforms target ≥80% of bases at Q30+.

互動模擬

二、Phred 分數轉換器

拖動滑桿觀察 Phred 分數對應的錯誤率與 ASCII 字元。注意 Q30 是工業標準分界線。

Drag the slider to see how Phred scores map to error rates and ASCII characters. Q30 is the industry-standard cutoff.

Phred 分數 Q30

— — — —

QC 工具

三、FastQC 主要檢查項目

FastQC 是 Java 寫的單檔工具，輸出 HTML report 與 11 個檢測模組。對 WGS/WES 專案最關鍵的 6 個模組是：

FastQC is a Java-based tool producing HTML reports with 11 detection modules. The 6 most critical for WGS/WES projects are:

📈

Per base sequence quality

沿 read 長度展示每個位置的品質分數箱線圖。Illumina 數據通常前段高、後段下降；R2 通常比 R1 差。red zone (<Q20) 大量出現代表需要 trim。

Box plots of quality at each cycle along the read. Illumina reads start strong then degrade; R2 is usually worse than R1. Heavy entry into the red zone (<Q20) means trimming is needed.

🧬

Per base sequence content

每個位置 A/T/G/C 的比例。應接近平直線（A≈T、G≈C）。前 9–13 bp 常見不平衡（隨機引子偏好），不必擔心；其他位置劇烈起伏可能是 adapter 或污染。

Proportion of A/T/G/C at each position. Should be roughly flat lines (A≈T, G≈C). Imbalance in the first 9–13 bp is normal (random hexamer bias); fluctuations elsewhere may indicate adapters or contamination.

🔁

Duplication levels

WGS 不應該有大量 duplicates（5–25% 正常）；WES 通常較高（20–60%）因為 capture 集中。極高的 duplication（>80%）暗示 PCR over-amplification 或 library 複雜度低。

WGS should not show heavy duplication (5–25% normal); WES is typically higher (20–60%) due to capture concentration. Very high duplication (>80%) suggests PCR over-amplification or low library complexity.

🔌

Adapter content

3' 端常見 adapter read-through（當 insert 短於 read length 時）。FastQC 偵測 5 種常見 adapter (Illumina、Nextera、SOLID、Small RNA)。任何明顯的 adapter signal 都應該 trim。

3' adapter read-through is common (when insert is shorter than read length). FastQC detects 5 common adapter classes (Illumina, Nextera, SOLID, Small RNA). Any noticeable adapter signal should be trimmed.

📊

Per sequence GC content

整個 read 的 GC% 分布。人類預期約 41% 為峰值，呈正常分布。雙峰或位移暗示跨物種污染（如細菌、mycoplasma）或 adapter dimer。

Per-read GC% distribution. Expected to peak around 41% for human samples in a normal-looking curve. Double peaks or shifts suggest cross-species contamination (bacteria, mycoplasma) or adapter dimers.

🔁

Overrepresented sequences

列出佔比 ≥0.1% 的單一序列。常見來源：adapter、ribosomal RNA（如 RNA-seq 殘留）、PCR primer dimer。每個序列 FastQC 會嘗試自動 BLAST 比對標註來源。

Lists single sequences making up ≥0.1% of reads. Common sources: adapters, ribosomal RNA (RNA-seq carryover), PCR primer dimers. FastQC auto-BLASTs each to suggest the source.

工具比較

四、QC + Trimming 工具比較

傳統流程是 FastQC → Trimmomatic / Cutadapt → 再跑一次 FastQC——三個工具、三次 I/O。fastp（2018 釋出，2025 更新到 1.0）把這些整合到一次掃描，速度比傳統組合快 2–5 倍，並輸出與 FastQC 同等資訊量的 HTML report。

The classic flow is FastQC → Trimmomatic / Cutadapt → re-run FastQC — three tools, three I/O passes. fastp (released 2018, updated to 1.0 in 2025) integrates all of this into a single scan, running 2–5× faster than the legacy combo while producing FastQC-equivalent HTML reports.

特性	FastQC	Trimmomatic	Cutadapt	fastp
QC 報告	✅ HTML	❌	❌	✅ HTML + JSON
Adapter trim	❌	✅	✅	✅ 自動偵測auto-detect
Quality trim	❌	✅ 滑動視窗sliding window	⚠️ 基本basic	✅ 滑動視窗sliding window
UMI 處理	❌	❌	⚠️	✅
polyG trim (NextSeq/NovaSeq 必要)	❌	❌	❌	✅
速度	🔵	🐢	🐢	⚡⚡⚡
多執行緒	⚠️	✅	✅	✅ 原生native
仍在維護	✅	❌ 2019 後停止stopped 2019	✅	✅

🌳 該用哪個工具？

2024 年起的新專案 → fastp（QC + trim 一次完成）。

需要與既有 pipeline 對齊 → FastQC + Trimmomatic 仍可。

NovaSeq / NextSeq 平台 → 必須選有 polyG trimming 的工具（fastp 預設啟用）。

超大型批次（>100 sample） → fastp + MultiQC 統整所有報告。

New projects from 2024 onward → fastp (QC + trim in one pass).

Need to align with existing pipeline → FastQC + Trimmomatic still fine.

NovaSeq / NextSeq platform → must use a tool with polyG trimming (fastp enables it by default).

Huge cohorts (>100 samples) → fastp + MultiQC to consolidate all reports.

程式碼

五、實作範例

# 對 paired-end FASTQ 跑 FastQC（產出 HTML + zip）
fastqc -o qc_out/ -t 8 sample_R1.fastq.gz sample_R2.fastq.gz

# 結果：
#   qc_out/sample_R1_fastqc.html  ← 視覺化報告
#   qc_out/sample_R1_fastqc.zip   ← 原始數據 (給 MultiQC 整合)

# fastp：一個指令同時做 QC、adapter、quality trim、polyG
fastp \
  -i sample_R1.fastq.gz -I sample_R2.fastq.gz \
  -o sample_R1.trim.fastq.gz -O sample_R2.trim.fastq.gz \
  --detect_adapter_for_pe \                      # PE 自動偵測 adapter
  --cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 \  # 滑動視窗
  --length_required 50 \                          # trim 後仍 ≥50bp 才保留
  --trim_poly_g \                                # NovaSeq/NextSeq polyG 假信號
  --thread 8 \
  --html sample.fastp.html --json sample.fastp.json

# 一次完成。HTML 內含 before/after 對比圖。

# Trimmomatic（傳統，仍可用但已停止更新）
trimmomatic PE -phred33 -threads 8 \
  sample_R1.fastq.gz sample_R2.fastq.gz \
  sample_R1.trim.fastq.gz sample_R1.unpaired.fastq.gz \
  sample_R2.trim.fastq.gz sample_R2.unpaired.fastq.gz \
  ILLUMINACLIP:adapters/TruSeq3-PE.fa:2:30:10 \  # adapter 比對嚴格度
  LEADING:3 TRAILING:3 \                       # 兩端 trim Q<3
  SLIDINGWINDOW:4:20 \                          # 4bp 視窗，平均 Q≥20
  MINLEN:50

# MultiQC：把所有 sample 的 fastp/FastQC 報告整合成一份
multiqc qc_out/ fastp_out/ -o multiqc_report/

# 自動掃描所有支援的工具輸出，產出 multiqc_report.html
# 支援 FastQC、fastp、Trimmomatic、Picard、samtools、GATK 等 100+ 工具

FAQ

六、常見問題與排查

不一定。FastQC 的 pass/warn/fail 是基於 RNA-seq/general 預設值，對 WGS 與 WES 來說很多 fail 可以忽略：例如 WES 的 Sequence Duplication Levels 因 target 集中而高是正常的；Per base sequence content 在前 13bp 不平衡是 random hexamer 偏好，無需處理。關鍵是看 trend，不是看燈號。

Not always. FastQC's pass/warn/fail uses defaults tuned for RNA-seq/general data. Many "fails" are acceptable for WGS/WES: WES's high Sequence Duplication is expected due to capture concentration; Per base sequence content imbalance in the first 13 bp reflects random hexamer bias and needs no fix. Read the trend, not the traffic light.

BWA-MEM 對輕度 adapter 污染與末端低品質有一定容忍度（會 soft-clip）。但 GATK 官方建議仍是 先做 adapter trim，因為：(1) adapter contamination 會在 BAM 產生大量 soft-clipped reads 影響後續 BQSR 統計；(2) 短 fragment 樣本（如 cfDNA）adapter read-through 嚴重時必須處理。Quality trim 則是可選——多數現代 caller 會考量 base quality。

BWA-MEM tolerates mild adapter contamination and end-quality drops (soft-clips them). But GATK's official recommendation remains pre-trim adapters because: (1) adapter contamination creates many soft-clipped reads in the BAM, affecting downstream BQSR statistics; (2) short-fragment samples (e.g. cfDNA) with severe adapter read-through must be cleaned. Quality trimming is optional — most modern callers already account for base quality.

常見原因有三：(1) 跨物種污染（細菌 GC% 跟人類不同，如 mycoplasma 約 32%）→ 用 Kraken2 或 FastQ Screen 篩查。(2) Adapter dimer 形成單一極端 GC 峰 → fastp 可清除。(3) 多個物種樣本意外混合。發現雙峰時，最重要的下一步是定量污染源，不要直接進入比對。

Three common causes: (1) Cross-species contamination (bacterial GC% differs from human, e.g. mycoplasma ~32%) → screen with Kraken2 or FastQ Screen. (2) Adapter dimers create one extreme GC peak → fastp removes them. (3) Accidental sample mixing. When you see a double peak, the next step is to quantify the contamination source, not push to alignment.

Illumina 兩色化學（NovaSeq、NextSeq）用「沒有信號」代表 G base。當 fragment 比 read length 短時，sequencer 在 fragment 結束後仍繼續讀，產生長串「沒有信號」即被誤判為 polyG。必須在 trimming 階段移除，否則會嚴重影響比對品質。fastp --trim_poly_g 預設啟用即可解決。

Illumina 2-color chemistry (NovaSeq, NextSeq) encodes G base as "no signal." When the fragment is shorter than read length, the sequencer keeps reading past the fragment end, producing long "no-signal" stretches misread as polyG. This must be trimmed or alignment quality suffers severely. fastp's --trim_poly_g (default on) handles it.

📝 自我檢測

1. 你打開 FastQC 報告，發現 Sequence Duplication Levels 顯示 50% duplicates，紅燈警告。樣本是 WES。應該？

1. Your FastQC report shows 50% sequence duplication with a red flag. The sample is WES. You should:

A. 立刻丟棄樣本，重做定序A. Discard the sample, re-sequence immediately

B. 用 dedup 工具強制把 duplicates 全部移除再進入比對B. Force-remove all duplicates with a dedup tool before alignment

C. 不必擔心，WES 的 duplication 偏高是 capture 集中的正常現象；交給 BAM 階段的 MarkDuplicates 處理C. Don't worry — high WES duplication is normal due to capture concentration; let MarkDuplicates handle it at the BAM stage

D. 改變 trimming 參數D. Change the trimming parameters

2. NovaSeq 平台產生的 reads，end of read 經常出現 polyG 序列。這是因為？

2. NovaSeq reads often end in polyG sequences. This is because:

A. 樣本 DNA 本身富含 GA. The sample DNA is genuinely G-rich

B. 兩色化學中「沒有信號」被解讀為 G，當 fragment 短於 read length 時就出現假 polyGB. 2-color chemistry encodes "no signal" as G, producing false polyG when fragment is shorter than read length

C. PCR 引子設計不良C. Poor PCR primer design

D. Reference genome 包含 polyG repeatsD. The reference genome contains polyG repeats

3. 一個 read 的某 base 標示為 Phred Q40，這代表？

3. A base in a read is labeled Phred Q40. This means:

A. 該 base 錯誤機率為 1/10000（99.99% 準確）A. The base has a 1/10000 error probability (99.99% accuracy)

B. 該 base 是 reference 上第 40 個位置B. The base is at position 40 in the reference

C. 整個 read 平均品質為 40C. The overall read quality averages 40

D. 該位置覆蓋深度為 40×D. Coverage depth at that position is 40×