Step 4: Post-Alignment Processing

為什麼

一、為什麼還沒能直接 call variants？

剛產生的 sorted BAM 檔還帶著兩個系統性偏誤，會直接污染下游 variant calling：

PCR duplicates：建庫時 PCR 擴增會把單一 DNA 分子複製成多條 reads。如果不標記，HaplotypeCaller 會以為「同一個鹼基被觀察到 8 次」而高估 variant evidence。
系統性鹼基品質偏誤：定序機回報的 Phred Q 並非完美——它會因 cycle、read group、context (前後鹼基) 而系統性偏高或偏低。BQSR (Base Quality Score Recalibration) 用已知 variants 當錨點，學出每個 (cycle, context, RG) 組合的實際錯誤率，再校正每個鹼基的 Q 值。

這兩步合稱「Data Pre-processing」的尾段，是 GATK Best Practices 中 variant calling 之前的最後關卡。

The freshly-sorted BAM still carries two systematic biases that would directly contaminate downstream variant calling:

PCR duplicates: library-prep PCR amplifies a single DNA molecule into many reads. Without marking, HaplotypeCaller will interpret "the same base observed 8 times" as inflated variant evidence.
Systematic base-quality bias: the Phred Q reported by the sequencer is imperfect — it is systematically biased per cycle, read group, and sequence context. BQSR (Base Quality Score Recalibration) uses known variants as anchors to learn the real error rate per (cycle, context, RG) combination, and recalibrates each base's Q value.

Together these form the tail of "Data Pre-processing" — the final gate before variant calling in GATK Best Practices.

技術

二、MarkDuplicates 怎麼判斷 duplicate？

判斷規則出乎意料地簡單：兩條 paired reads 如果 5' 端 mapping 座標完全一樣（兩端都一樣，且方向一樣），就視為來自同一個原始 DNA 分子。當一群 reads 都符合這條件時，工具會挑「總和 base quality 最高」那條當代表，其餘加上 SAM flag 0x400 (1024) 標記為 duplicate。

注意：duplicate 不會被刪除，只是標記。HaplotypeCaller、Mutect2 等工具預設會略過帶 0x400 flag 的 reads。要實際移除可加 --REMOVE_DUPLICATES true，但通常沒必要——保留可追溯性更重要。

The rule is surprisingly simple: two paired reads are duplicates if both 5' mapping coordinates match and orientations agree. Among all reads sharing this signature, the tool picks the one with highest summed base quality as the representative; the rest get SAM flag 0x400 (1024) marking them as duplicates.

Note: duplicates are marked, not deleted. HaplotypeCaller, Mutect2, etc. skip reads with flag 0x400 by default. To physically remove them, add --REMOVE_DUPLICATES true — but usually unnecessary; traceability matters more.

📌

Picard MarkDuplicates

GATK Best Practices 的官方工具。準確、輸出 metrics 詳細（含 LIBRARY-level duplication rate），但對大 BAM 較慢、記憶體吃重。

The official GATK Best Practices tool. Accurate, detailed metrics (incl. LIBRARY-level duplication rate), but slower and memory-hungry on large BAMs.

⚡

samtools markdup

速度快數倍、記憶體少，結果與 Picard 高度一致。但需要先跑 samtools fixmate -m 補上 MS/MC tag。近年已被許多 pipeline 採用作為 Picard 替代品。

Several times faster with lower memory; results highly consistent with Picard. Requires samtools fixmate -m first to add MS/MC tags. Increasingly used as a Picard replacement in modern pipelines.

🔧

MarkDuplicatesSpark

GATK 的 Spark 版本，支援多執行緒，可同時做 sort + markdup。對單機多核 CPU 友善，但仍是 beta 狀態。

GATK's Spark variant — multi-threaded, can sort and mark in one pass. Friendly to multi-core single nodes but still flagged beta.

⚠️

UMI-aware tools

若建庫使用 UMI (Unique Molecular Identifier)，傳統座標比對會誤殺真正的不同分子。改用 fgbio GroupReadsByUmi + CallMolecularConsensusReads 或 Picard UmiAwareMarkDuplicatesWithMateCigar。

If the library uses UMIs (Unique Molecular Identifiers), coordinate-only marking incorrectly collapses distinct molecules. Use fgbio GroupReadsByUmi + CallMolecularConsensusReads, or Picard UmiAwareMarkDuplicatesWithMateCigar.

技術

三、BQSR 兩階段流程

BQSR 的核心假設是：已知 variant sites 之外的所有「mismatch」都是定序錯誤。所以給工具一份「known sites VCF」（dbSNP + Mills indels + 1000G indels），就能用「BAM 中所有非 known-sites 的 mismatch」當作錯誤的 ground truth。

BQSR's core assumption: any mismatch outside known variant sites is a sequencing error. Given a "known sites VCF" (dbSNP + Mills indels + 1000G indels), the tool treats every non-known-site mismatch as ground-truth error.

階段	工具	輸入	輸出	做什麼
1	`BaseRecalibrator`	dedup BAM + known-sites VCF	`recal.table`	掃過所有 reads，建立 (cycle × context × RG × QualScore) 的錯誤率表格
2	`ApplyBQSR`	原 BAM + recal.table	recalibrated BAM	逐個 base 查表，將原 Q 替換為校正後 Q
3 (選)	`AnalyzeCovariates`	校正前後兩份 recal.table	PDF	畫圖驗證 BQSR 是否成功收斂

什麼時候可以略過 BQSR？

DRAGEN 與 DeepVariant 內建自己的鹼基品質模型，不需要也不應該再跑 BQSR。對於 non-model organism（沒有 known-sites VCF）也應略過——強行跑會把真實 variants 當成錯誤校正掉。GATK HaplotypeCaller 在 non-model 物種上的官方建議是：跳過 BQSR，改用 hard filtering。

DRAGEN and DeepVariant ship their own base-quality models — BQSR should be skipped. For non-model organisms (no known-sites VCF), also skip — forcing it would calibrate real variants away as "errors". GATK's official recommendation for HaplotypeCaller on non-model species: skip BQSR, use hard filtering instead.

互動模擬

四、互動模擬：BQSR 校正前後品質分布

移動滑桿來模擬不同 sequencing cycle 下的品質校正效果。原始 Q 通常在 read 的後段被高估（機器仍報 Q35，但真實錯誤率對應 Q25），BQSR 會把它拉回來——校正後的 Q 與真實錯誤率對應更準。

Move the slider to simulate quality recalibration at different sequencing cycles. The reported Q is typically inflated late in a read (machine reports Q35, but real error rate corresponds to Q25); BQSR pulls it back so the recalibrated Q tracks the true error rate.

Sequencing cycle (read 中的位置, bp): 75

藍 = 機器原報 Q（Reported Q），橘 = BQSR 校正後 Q（Empirical Q），灰 = 該 cycle 實測錯誤率對應的真實 Q。校正後的橘線應更貼近灰線。

Blue = original reported Q. Orange = BQSR-recalibrated Q. Gray = the true Q corresponding to the empirical error rate at that cycle. The orange line should track the gray line more closely after recalibration.

技術

五、Known-sites 三劍客

BQSR 的 --known-sites 參數可以重複指定。GATK 官方對 GRCh38 human 的標準三件組：

BQSR's --known-sites argument can be repeated. GATK's standard trio for GRCh38 human:

dbSNP

Homo_sapiens_assembly38.dbsnp138.vcf — 最大宗的 SNP 集合，~7 億條 entries。提供大部分 SNV 校正錨點。

Homo_sapiens_assembly38.dbsnp138.vcf — the largest SNP set (~700 M entries). Provides most of the SNV calibration anchors.

Mills & 1000G Gold Indels

Mills_and_1000G_gold_standard.indels.hg38.vcf.gz — 高品質 indel 集合，是 indel 校正的主力。

Mills_and_1000G_gold_standard.indels.hg38.vcf.gz — high-confidence indel set, the workhorse for indel calibration.

1000G Phase 1 Indels

Homo_sapiens_assembly38.known_indels.vcf.gz — 補充 indel 集合，覆蓋 Mills 之外的常見變異。

Homo_sapiens_assembly38.known_indels.vcf.gz — supplementary indel set, covering common variants beyond Mills.

實務指令

六、Pipeline 指令範例

# Picard / GATK MarkDuplicates
gatk MarkDuplicates \
    -I sorted.bam \
    -O dedup.bam \
    -M dedup_metrics.txt \
    --CREATE_INDEX true

# 讀 metrics 看 duplication rate
grep -A2 "^LIBRARY" dedup_metrics.txt
# PERCENT_DUPLICATION 通常 WGS < 15%, WES 可能 20-30%

# samtools markdup (需要先 fixmate)
samtools sort -n -@ 8 sorted.bam -o nsort.bam
samtools fixmate -m -@ 8 nsort.bam fix.bam
samtools sort -@ 8 fix.bam -o resort.bam
samtools markdup -@ 8 -s resort.bam dedup.bam 2> markdup_stats.txt
samtools index dedup.bam
# -s 輸出統計到 stderr

# Stage 1: build recalibration table
gatk BaseRecalibrator \
    -I dedup.bam \
    -R reference.fasta \
    --known-sites Homo_sapiens_assembly38.dbsnp138.vcf \
    --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz \
    -O recal.table

# Stage 2: apply
gatk ApplyBQSR \
    -I dedup.bam \
    -R reference.fasta \
    --bqsr-recal-file recal.table \
    -O analysis_ready.bam

# (Optional) Stage 3: visualize
gatk BaseRecalibrator -I analysis_ready.bam -R reference.fasta \
    --known-sites ... -O recal_post.table
gatk AnalyzeCovariates -before recal.table -after recal_post.table -plots BQSR.pdf

# Streaming 範例：BWA-MEM2 → sort → markdup → BQSR
THREADS=16
SAMPLE=NA12878
REF=Homo_sapiens_assembly38.fasta

bwa-mem2 mem -t $THREADS -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tLB:lib1\tPL:ILLUMINA" \
    $REF ${SAMPLE}_R1.fq.gz ${SAMPLE}_R2.fq.gz \
  | samtools sort -@ $THREADS -o ${SAMPLE}.sorted.bam -
samtools index ${SAMPLE}.sorted.bam

gatk MarkDuplicates -I ${SAMPLE}.sorted.bam -O ${SAMPLE}.dedup.bam -M ${SAMPLE}.metrics.txt --CREATE_INDEX true

gatk BaseRecalibrator -I ${SAMPLE}.dedup.bam -R $REF \
    --known-sites Homo_sapiens_assembly38.dbsnp138.vcf \
    --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    -O ${SAMPLE}.recal.table

gatk ApplyBQSR -I ${SAMPLE}.dedup.bam -R $REF \
    --bqsr-recal-file ${SAMPLE}.recal.table \
    -O ${SAMPLE}.analysis_ready.bam

常見問題

七、FAQ

duplication rate 多少算正常？

30× WGS 一般 5-15%；WES 因 library complexity 較低、PCR cycle 較多，常見 15-30%。超過 40% 通常代表 library 起始 input 太少，需要重做。Amplicon panel 可能 > 50% 屬正常。

30× WGS: typically 5–15%. WES: 15–30% (lower library complexity, more PCR cycles). Over 40% usually means too little starting material — re-prep. Amplicon panels can exceed 50% normally.

BQSR 跑完後 Q 值會變怎樣？

通常高 Q 值會被往下拉（從 Q40 → Q35），低 Q 值微調，整體分布變得更窄、更貼近真實錯誤率。AnalyzeCovariates 的 PDF 會顯示「校正前 reported vs empirical 偏差很大、校正後幾乎重疊」。

High Q's are typically pulled down (Q40 → Q35), low Q's lightly adjusted, distribution narrows toward truth. The AnalyzeCovariates PDF should show large reported-vs-empirical gap before, near-overlap after.

可以先 BQSR 再 MarkDuplicates 嗎？

不行。順序必須是 sort → markdup → BQSR。BQSR 必須略過 duplicate reads（避免同一個錯誤被計算多次），所以前提是 duplicate flag 已經被打上。

No. Order must be sort → markdup → BQSR. BQSR must skip duplicate reads (so the same error isn't counted multiple times), which requires the dup flag to already be set.

FASTQ-to-BAM 全程要多久？

30× WGS、16 cores：BWA-MEM2 約 4-6 小時，sort+markdup 約 1-2 小時，BQSR 約 2-4 小時，合計 8-12 小時。DRAGEN FPGA 板可在 30 分鐘內完成全流程。

30× WGS on 16 cores: BWA-MEM2 ~4–6 h, sort+markdup ~1–2 h, BQSR ~2–4 h — total 8–12 h. DRAGEN FPGA cards complete the full flow in under 30 minutes.

自我檢測

八、小測驗

Q1. 為什麼 MarkDuplicates 之後不直接刪除 duplicates？

Q2. BQSR 用「known-sites VCF」的目的是什麼？

Q3. 為什麼用 DeepVariant 或 DRAGEN 不需要 BQSR？