FastQC、fastp、MultiQC — 三劍客分工
FastQC
每個 fastq 檔產出一份 HTML 報告:base quality、sequence content、duplication、adapter、overrepresented…
Per-file HTML report: base quality, GC content, duplication, adapter content, overrepresented sequences…
fastp
同時做 QC + adapter trimming + quality filter,速度快、報告漂亮。現代 NGS 專案推薦取代 Trimmomatic。
QC + adapter trimming + quality filter in one go. Fast, pretty reports — recommended over Trimmomatic.
MultiQC
掃描資料夾中的 FastQC、STAR、salmon、samtools 等 log,自動彙整成一份多樣本互動式報告。
Scans logs from FastQC, STAR, salmon, samtools, … and aggregates everything into one interactive multi-sample report.
標準 QC 流程(直接抄)
建立資料夾
mkdir -p results/{fastqc_pre,fastp,fastqc_post,multiqc} logs原始 FASTQ 跑 FastQC
fastqc -t 8 -o results/fastqc_pre raw_data/*.fastq.gz \ > logs/fastqc_pre.log 2>&1
fastp trim adapter + 過濾低品質
for s in sampleA sampleB; do
fastp \
-i raw_data/${s}_R1.fastq.gz -I raw_data/${s}_R2.fastq.gz \
-o results/fastp/${s}_R1.fq.gz -O results/fastp/${s}_R2.fq.gz \
-h results/fastp/${s}.html -j results/fastp/${s}.json \
--thread 8 \
2> logs/fastp_${s}.log
done整理後 FASTQ 再跑一次 FastQC
fastqc -t 8 -o results/fastqc_post results/fastp/*.fq.gz \ > logs/fastqc_post.log 2>&1
MultiQC 一鍵彙整
multiqc results/ -o results/multiqc \ --title "RNAseq QC report" \ --filename multiqc_report.html
FastQC 重要指標的判讀
| 模組 | 意義 | 紅燈可能原因 |
|---|---|---|
| Per base sequence quality | 每位置 Phred 分數 | 尾段下降需 trim;中段下降 = 異常 |
| Per sequence GC content | GC 含量 | 雙峰 = 污染 |
| Sequence duplication | 重複序列比例 | RNA-seq 正常;DNA-seq 高 = PCR bias |
| Adapter content | adapter 殘留 | 高 = 必須 trim |
| Overrepresented sequences | 過度代表序列 | rRNA / adapter / 污染 |
實戰判讀原則:不要看到紅色就慌張——許多紅燈在特定 library type 是正常的(例如 amplicon 的 sequence duplication)。一定要結合 library protocol 與 application 解讀。
Reading rule: a red flag isn't always bad — some are expected for a given library type (e.g. duplication in amplicons). Always interpret in light of library protocol and application.
MultiQC 是 NGS 工程師的最強盟友
跑完任何 NGS 流程,multiqc results/ 一行就會自動掃描所有 log,產生:
- 每個樣本一張卡片,可篩選、排序、互動圖
- 各工具的指標彙整(FastQC、STAR、salmon、samtools、bcftools、Picard…)
- 輸出
multiqc_data/包含 TSV 格式,可在 R / Python 進一步分析
After any NGS pipeline, one line — multiqc results/ — scans all logs and produces:
- One card per sample, filterable / sortable / interactive
- Metrics from FastQC, STAR, salmon, samtools, bcftools, Picard, …
- A
multiqc_data/folder with TSVs for downstream R / Python analysis
互動:QC 閾值滑桿
下面模擬 16 個樣本的 mean quality(Phred)與 adapter %。拖動下方閾值,看哪些樣本通過 QC。
Below is a simulated set of 16 samples with mean Phred and adapter %. Drag the sliders to see which pass.
綠 = Pass|紅 = Fail;數字為 mean Phred / adapter%
📝 自我檢測
1. MultiQC 最重要的能力是?
1. MultiQC's most important capability?
2. FastQC 顯示「Adapter Content」紅燈,下一步應該?
2. FastQC's Adapter Content is red. Next step?
3. 想對 100 個樣本批次跑 fastp,最 Linux-y 的做法?
3. To run fastp on 100 samples, the most Linux-y way?
4. RNA-seq 樣本的 sequence duplication 達 50%,這代表什麼?
4. An RNA-seq sample has 50% sequence duplication. Implication?