一、Somatic vs Germline:本質不同
Germline calling 假設樣本是 diploid,每個位點 genotype 是 0/0、0/1、1/1 三選一,VAF 固定接近 0、0.5、或 1。但 tumor 樣本:
- 樣本不純:tumor purity 通常 30-90%,剩下是混入的 normal cells、stromal cells、infiltrating immune cells
- 克隆異質性:tumor 內部本身就是混合——early clonal mutations 出現在所有 tumor cells 中,subclonal mutations 只在部分
- VAF 是連續譜:可能 0.4(pure clonal het)、0.15(低純度 clonal)、0.05(subclonal)、甚至 0.01(深度測序看 ctDNA)
- copy number 可能異常:CNV 區域的 VAF 不再可預期
因此 somatic caller 不是 genotype caller,而是「在 noise 中找信號」的二元判斷:這個位點的 ALT 是真 mutation 還是 sequencing error / mapping artifact / contamination?
Germline calling assumes a diploid sample with three possible genotypes (0/0, 0/1, 1/1) and VAF fixed near 0, 0.5, or 1. But tumor samples:
- Impure samples: tumor purity is typically 30–90%, the rest is contaminating normal cells, stromal cells, infiltrating immune cells
- Clonal heterogeneity: tumors are internally mixed — early clonal mutations appear in all tumor cells, subclonal mutations only in some
- VAF is a continuum: could be 0.4 (pure clonal het), 0.15 (low-purity clonal), 0.05 (subclonal), even 0.01 (deep sequencing for ctDNA)
- Copy number may be abnormal: VAF in CNV regions is no longer predictable
So the somatic caller isn't a genotype caller — it's "signal hunting in noise" via binary judgment: is this site's ALT a real mutation, or a sequencing error / mapping artifact / contamination?
二、Mutect2:tumor-only vs tumor-vs-normal
Tumor-vs-Normal (推薦)
同病人取 normal tissue(通常 blood)作對照。任何 normal 中也存在的變異 = germline,可直接過濾掉,剩下的就是 somatic。最乾淨、最可信。
Use a normal tissue (usually blood) from the same patient as control. Any variant also present in normal = germline and is filtered out; what's left is somatic. Cleanest and most reliable.
Tumor-only
只有 tumor 樣本(normal 不可得,如 archived FFPE)。完全依賴 PoN + gnomAD AF 過濾 germline 與 artifacts。FP 率較高,需更嚴格 downstream filter。
Only tumor available (normal unobtainable, e.g., archived FFPE). Fully relies on PoN + gnomAD AF to filter germline and artifacts. Higher FP rate, stricter downstream filtering needed.
用 30+ 個健康樣本(同一定序中心、同 platform、同 capture kit),各自跑 Mutect2 tumor-only 模式,再用 CreateSomaticPanelOfNormals 合併成 VCF。任何「重複出現在多個 normal」的位點 = sequencing/alignment artifact,從 tumor 結果中過濾掉。PoN 是 Mutect2 tumor-only 的關鍵,缺乏 PoN 會被 artifact 淹沒。
Take 30+ healthy samples (same sequencing center, platform, and capture kit), run Mutect2 in tumor-only mode on each, then combine into a VCF with CreateSomaticPanelOfNormals. Any site repeatedly appearing across normals = sequencing/alignment artifact, filtered from tumor results. PoN is critical for Mutect2 tumor-only — without it, calls are drowned in artifacts.
三、互動模擬:tumor 內 VAF 分布
調整 tumor purity 與 subclonal fraction,看模擬的 1,000 個 somatic mutations 在 VAF 直方圖上的分布。Clonal mutations 的中位 VAF ≈ purity/2 (heterozygous in all tumor cells);subclonal mutations 形成 VAF 較低的尾巴。
Adjust tumor purity and subclonal fraction to see how 1,000 simulated somatic mutations distribute in the VAF histogram. Clonal mutations have median VAF ≈ purity/2 (heterozygous in all tumor cells); subclonal mutations form a low-VAF tail.
四、Mutect2 完整 pipeline
| 步驟 | 工具 | 目的 |
|---|---|---|
| 1 | Mutect2 | 呼叫候選 somatic variants (tumor + normal + PoN + germline AF) |
| 2 | GetPileupSummaries | 計算 tumor 與 normal 在 common variants 的 allele counts |
| 3 | CalculateContamination | 估計 cross-sample contamination 比例 |
| 4 | LearnReadOrientationModel | 學 FFPE oxidation / 8-oxoG artifact 的 read orientation bias |
| 5 | FilterMutectCalls | 整合上述所有資訊,標記 PASS 或具體 filter 原因 |
| 6 (選) | FilterAlignmentArtifacts | 重新對齊到 BWA-MEM2 + alt aware 參考,剔除 mapping artifacts |
FFPE (formalin-fixed paraffin-embedded) 樣本是病理檔案常見來源,但 formalin 會氧化 G→8-oxoG,定序時讀為 T,產生系統性 G>T / C>A artifact。LearnReadOrientationModel 會學出此 bias 的 strand-specific pattern,FilterMutectCalls 用 --ob-priors 套用,可大幅減少 FP。新鮮血液 / 凍存組織不必擔心,但 FFPE 一定要做這步。
FFPE (formalin-fixed paraffin-embedded) samples are common from pathology archives, but formalin oxidizes G→8-oxoG, read as T during sequencing — producing systematic G>T / C>A artifacts. LearnReadOrientationModel learns this bias's strand-specific pattern; FilterMutectCalls applies it via --ob-priors, drastically reducing FPs. Fresh blood / frozen tissue doesn't need this, but FFPE absolutely does.
五、Pipeline 指令範例
# Mutect2 — tumor + matched normal
gatk Mutect2 \
-R Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-I normal.bam \
-tumor TUMOR_SAMPLE_ID \
-normal NORMAL_SAMPLE_ID \
--germline-resource af-only-gnomad.hg38.vcf.gz \
--panel-of-normals 1000g_pon.hg38.vcf.gz \
--f1r2-tar-gz tumor.f1r2.tar.gz \
-O tumor.unfiltered.vcf.gz
# 重要:tumor 樣本名要與 BAM @RG SM tag 完全一致# Mutect2 — tumor-only mode
gatk Mutect2 \
-R Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-tumor TUMOR_SAMPLE_ID \
--germline-resource af-only-gnomad.hg38.vcf.gz \
--panel-of-normals 1000g_pon.hg38.vcf.gz \
--f1r2-tar-gz tumor.f1r2.tar.gz \
-O tumor.unfiltered.vcf.gz
# 注意:tumor-only 模式對 PoN 高度依賴;若無自家 PoN,
# Broad 提供 1000G PoN 可下載 (gs://gatk-best-practices/somatic-hg38/)# 建立自家 PoN (建議用 30+ 個與 tumor 同 platform/capture 的 normals)
# 1. 每個 normal 跑 Mutect2 tumor-only 模式
for normal in normal_001 normal_002 ... normal_030; do
gatk Mutect2 -R ref.fa -I ${normal}.bam -tumor ${normal} \
--max-mnp-distance 0 \
-O pon/${normal}.vcf.gz
done
# 2. GenomicsDBImport
gatk GenomicsDBImport \
-R ref.fa -L wgs_calling_regions.hg38.interval_list \
--genomicsdb-workspace-path pon_db \
$(for v in pon/*.vcf.gz; do echo -n "-V $v "; done)
# 3. 合併為 PoN VCF
gatk CreateSomaticPanelOfNormals \
-R ref.fa \
--germline-resource af-only-gnomad.hg38.vcf.gz \
-V gendb://pon_db \
-O my_panel_of_normals.vcf.gz# Step 1: contamination
gatk GetPileupSummaries -I tumor.bam \
-V small_exac_common_3.hg38.vcf.gz \
-L small_exac_common_3.hg38.vcf.gz \
-O tumor.pileups.table
gatk GetPileupSummaries -I normal.bam \
-V small_exac_common_3.hg38.vcf.gz \
-L small_exac_common_3.hg38.vcf.gz \
-O normal.pileups.table
gatk CalculateContamination \
-I tumor.pileups.table \
-matched normal.pileups.table \
-O tumor.contamination.table \
--tumor-segmentation tumor.segments.table
# Step 2: orientation bias model (if FFPE)
gatk LearnReadOrientationModel \
-I tumor.f1r2.tar.gz \
-O tumor.read-orientation-model.tar.gz
# Step 3: 整合 filter
gatk FilterMutectCalls \
-R ref.fa \
-V tumor.unfiltered.vcf.gz \
--contamination-table tumor.contamination.table \
--tumor-segmentation tumor.segments.table \
--ob-priors tumor.read-orientation-model.tar.gz \
-O tumor.filtered.vcf.gz
# 只留 PASS
bcftools view -f PASS tumor.filtered.vcf.gz -Oz -o tumor.somatic.vcf.gz六、TMB 與下游應用
過濾後的 somatic VCF 是癌症基因體分析的起點,常見下游:
- TMB (Tumor Mutation Burden):每 Mb 非同義 somatic mutations 數。TMB-high (typically ≥ 10 mut/Mb) 與 immunotherapy (anti-PD-1) 反應正相關,已是 FDA-approved biomarker。
- Mutational signatures (COSMIC SBS):deconvolute 96 trinucleotide context 計數,找出底層 mutagenesis (UV、tobacco、APOBEC、MMR-deficient)。SBS4 = tobacco, SBS7 = UV。
- Driver gene identification:對照 COSMIC Cancer Gene Census、OncoKB;用 dN/dS、MutSig、IntOGen 識別驅動突變。
- Actionability annotation:對照 OncoKB、CIViC、JAX-CKB 找 FDA-approved drug targets,分 AMP Tier I-IV。
The filtered somatic VCF is the starting point for cancer genomic analysis. Common downstream uses:
- TMB (Tumor Mutation Burden): non-synonymous somatic mutations per Mb. High TMB (typically ≥10 mut/Mb) correlates with immunotherapy (anti-PD-1) response — an FDA-approved biomarker.
- Mutational signatures (COSMIC SBS): deconvolute 96-trinucleotide-context counts to identify underlying mutagenesis (UV, tobacco, APOBEC, MMR-deficient). SBS4 = tobacco, SBS7 = UV.
- Driver gene identification: cross-reference COSMIC Cancer Gene Census, OncoKB; use dN/dS, MutSig, IntOGen to identify driver mutations.
- Actionability annotation: cross-reference OncoKB, CIViC, JAX-CKB to find FDA-approved drug targets, categorized by AMP Tier I–IV.
七、FAQ
tumor coverage 該打多少?
contamination 多少算嚴重?
Mutect2 與 Strelka2 / VarDict / DeepVariant 比較?
如果沒有 matched normal 怎麼辦?
八、小測驗
Q1. 為什麼 somatic caller 不是 genotype caller?
Q2. PoN 主要過濾掉的是?
Q3. FFPE 樣本的特徵性 artifact 是?