CHAPTER 10 / 14

NGS QC:FastQC + MultiQC 實作

批次跑 FastQC、整合 MultiQC 報告、解讀常見指標——用 Linux 把 100 個樣本的品質一覽無遺。

Batch FastQC, aggregate via MultiQC, interpret the metrics — use Linux to oversee QC across 100 samples at a glance.

FastQC、fastp、MultiQC — 三劍客分工

🔬

FastQC

每個 fastq 檔產出一份 HTML 報告:base quality、sequence content、duplication、adapter、overrepresented…

Per-file HTML report: base quality, GC content, duplication, adapter content, overrepresented sequences…

✂️

fastp

同時做 QC + adapter trimming + quality filter,速度快、報告漂亮。現代 NGS 專案推薦取代 Trimmomatic。

QC + adapter trimming + quality filter in one go. Fast, pretty reports — recommended over Trimmomatic.

📊

MultiQC

掃描資料夾中的 FastQC、STAR、salmon、samtools 等 log,自動彙整成一份多樣本互動式報告。

Scans logs from FastQC, STAR, salmon, samtools, … and aggregates everything into one interactive multi-sample report.

標準 QC 流程(直接抄)

建立資料夾

mkdir -p results/{fastqc_pre,fastp,fastqc_post,multiqc} logs

原始 FASTQ 跑 FastQC

fastqc -t 8 -o results/fastqc_pre raw_data/*.fastq.gz \
  > logs/fastqc_pre.log 2>&1

fastp trim adapter + 過濾低品質

for s in sampleA sampleB; do
  fastp \
    -i raw_data/${s}_R1.fastq.gz -I raw_data/${s}_R2.fastq.gz \
    -o results/fastp/${s}_R1.fq.gz -O results/fastp/${s}_R2.fq.gz \
    -h results/fastp/${s}.html -j results/fastp/${s}.json \
    --thread 8 \
    2> logs/fastp_${s}.log
done

整理後 FASTQ 再跑一次 FastQC

fastqc -t 8 -o results/fastqc_post results/fastp/*.fq.gz \
  > logs/fastqc_post.log 2>&1

MultiQC 一鍵彙整

multiqc results/ -o results/multiqc \
  --title "RNAseq QC report" \
  --filename multiqc_report.html

FastQC 重要指標的判讀

模組意義紅燈可能原因
Per base sequence quality每位置 Phred 分數尾段下降需 trim;中段下降 = 異常
Per sequence GC contentGC 含量雙峰 = 污染
Sequence duplication重複序列比例RNA-seq 正常;DNA-seq 高 = PCR bias
Adapter contentadapter 殘留高 = 必須 trim
Overrepresented sequences過度代表序列rRNA / adapter / 污染
💡

實戰判讀原則:不要看到紅色就慌張——許多紅燈在特定 library type 是正常的(例如 amplicon 的 sequence duplication)。一定要結合 library protocol 與 application 解讀。

Reading rule: a red flag isn't always bad — some are expected for a given library type (e.g. duplication in amplicons). Always interpret in light of library protocol and application.

MultiQC 是 NGS 工程師的最強盟友

跑完任何 NGS 流程,multiqc results/ 一行就會自動掃描所有 log,產生:

  • 每個樣本一張卡片,可篩選、排序、互動圖
  • 各工具的指標彙整(FastQC、STAR、salmon、samtools、bcftools、Picard…)
  • 輸出 multiqc_data/ 包含 TSV 格式,可在 R / Python 進一步分析

After any NGS pipeline, one line — multiqc results/ — scans all logs and produces:

  • One card per sample, filterable / sortable / interactive
  • Metrics from FastQC, STAR, salmon, samtools, bcftools, Picard, …
  • A multiqc_data/ folder with TSVs for downstream R / Python analysis
multiqc results/
掃描 results 目錄
multiqc results/ -o results/multiqc -n run1
指定輸出位置與檔名
multiqc results/ --interactive
強制互動式圖表
multiqc results/ --sample-names samples.tsv
用 sample table 重新命名
multiqc results/ -m fastqc -m fastp -m star
只跑特定模組

互動:QC 閾值滑桿

下面模擬 16 個樣本的 mean quality(Phred)與 adapter %。拖動下方閾值,看哪些樣本通過 QC。

Below is a simulated set of 16 samples with mean Phred and adapter %. Drag the sliders to see which pass.

綠 = Pass|紅 = Fail;數字為 mean Phred / adapter%

📝 自我檢測

1. MultiQC 最重要的能力是?

1. MultiQC's most important capability?

A. 重新做 QCA. Re-run QC itself
B. 直接讀 BAM 計算 mapping rateB. Reads BAMs to compute mapping rate
C. 自動掃描資料夾、彙整多種工具的 log 與報告C. Walk a folder and aggregate logs from many tools
D. 取代 FastQCD. Replace FastQC

2. FastQC 顯示「Adapter Content」紅燈,下一步應該?

2. FastQC's Adapter Content is red. Next step?

A. 直接 alignment,FastQC 不準A. Skip QC, run alignment
B. 用 fastp 或 Trim Galore 做 adapter trimmingB. Trim adapters with fastp / Trim Galore
C. 重新定序C. Resequence
D. 把 adapter 加進 referenceD. Add adapters to the reference

3. 想對 100 個樣本批次跑 fastp,最 Linux-y 的做法?

3. To run fastp on 100 samples, the most Linux-y way?

A. 一個一個複製貼上指令A. Copy-paste 100 times
B. 用 GUI 一個一個跑B. Run one by one in a GUI
C. 寫一個 for loop 或 Snakemake/NextflowC. Write a for-loop or Snakemake/Nextflow
D. 全部丟進 ExcelD. Throw everything into Excel

4. RNA-seq 樣本的 sequence duplication 達 50%,這代表什麼?

4. An RNA-seq sample has 50% sequence duplication. Implication?

A. 必須丟掉A. Must discard
B. 是設備問題B. Equipment problem
C. 表示 reads 都是 PCR 重複C. All reads are PCR duplicates
D. RNA-seq 高表現基因有大量重複是正常的,需結合 application 判讀D. Normal in RNA-seq from highly expressed genes; interpret in context