ADVANCED

體細胞變異呼叫 (Somatic Calling)

癌症分析的核心:在 normal 背景中找出 tumor 才有的變異——VAF 不再固定 0.5,挑戰來自亞克隆、純度、與低 VAF 偵測。

The heart of cancer analysis: find tumor-only variants against normal background — VAF no longer fixed at 0.5, with challenges from subclonality, purity, and low-VAF detection.

一、Somatic vs Germline:本質不同

Germline calling 假設樣本是 diploid,每個位點 genotype 是 0/0、0/1、1/1 三選一,VAF 固定接近 0、0.5、或 1。但 tumor 樣本:

  • 樣本不純:tumor purity 通常 30-90%,剩下是混入的 normal cells、stromal cells、infiltrating immune cells
  • 克隆異質性:tumor 內部本身就是混合——early clonal mutations 出現在所有 tumor cells 中,subclonal mutations 只在部分
  • VAF 是連續譜:可能 0.4(pure clonal het)、0.15(低純度 clonal)、0.05(subclonal)、甚至 0.01(深度測序看 ctDNA)
  • copy number 可能異常:CNV 區域的 VAF 不再可預期

因此 somatic caller 不是 genotype caller,而是「在 noise 中找信號」的二元判斷:這個位點的 ALT 是真 mutation 還是 sequencing error / mapping artifact / contamination?

Germline calling assumes a diploid sample with three possible genotypes (0/0, 0/1, 1/1) and VAF fixed near 0, 0.5, or 1. But tumor samples:

  • Impure samples: tumor purity is typically 30–90%, the rest is contaminating normal cells, stromal cells, infiltrating immune cells
  • Clonal heterogeneity: tumors are internally mixed — early clonal mutations appear in all tumor cells, subclonal mutations only in some
  • VAF is a continuum: could be 0.4 (pure clonal het), 0.15 (low-purity clonal), 0.05 (subclonal), even 0.01 (deep sequencing for ctDNA)
  • Copy number may be abnormal: VAF in CNV regions is no longer predictable

So the somatic caller isn't a genotype caller — it's "signal hunting in noise" via binary judgment: is this site's ALT a real mutation, or a sequencing error / mapping artifact / contamination?

二、Mutect2:tumor-only vs tumor-vs-normal

👫

Tumor-vs-Normal (推薦)

同病人取 normal tissue(通常 blood)作對照。任何 normal 中也存在的變異 = germline,可直接過濾掉,剩下的就是 somatic。最乾淨、最可信。

Use a normal tissue (usually blood) from the same patient as control. Any variant also present in normal = germline and is filtered out; what's left is somatic. Cleanest and most reliable.

🚶

Tumor-only

只有 tumor 樣本(normal 不可得,如 archived FFPE)。完全依賴 PoN + gnomAD AF 過濾 germline 與 artifacts。FP 率較高,需更嚴格 downstream filter。

Only tumor available (normal unobtainable, e.g., archived FFPE). Fully relies on PoN + gnomAD AF to filter germline and artifacts. Higher FP rate, stricter downstream filtering needed.

Panel of Normals (PoN) 是什麼?

用 30+ 個健康樣本(同一定序中心、同 platform、同 capture kit),各自跑 Mutect2 tumor-only 模式,再用 CreateSomaticPanelOfNormals 合併成 VCF。任何「重複出現在多個 normal」的位點 = sequencing/alignment artifact,從 tumor 結果中過濾掉。PoN 是 Mutect2 tumor-only 的關鍵,缺乏 PoN 會被 artifact 淹沒。

Take 30+ healthy samples (same sequencing center, platform, and capture kit), run Mutect2 in tumor-only mode on each, then combine into a VCF with CreateSomaticPanelOfNormals. Any site repeatedly appearing across normals = sequencing/alignment artifact, filtered from tumor results. PoN is critical for Mutect2 tumor-only — without it, calls are drowned in artifacts.

三、互動模擬:tumor 內 VAF 分布

調整 tumor purity 與 subclonal fraction,看模擬的 1,000 個 somatic mutations 在 VAF 直方圖上的分布。Clonal mutations 的中位 VAF ≈ purity/2 (heterozygous in all tumor cells);subclonal mutations 形成 VAF 較低的尾巴。

Adjust tumor purity and subclonal fraction to see how 1,000 simulated somatic mutations distribute in the VAF histogram. Clonal mutations have median VAF ≈ purity/2 (heterozygous in all tumor cells); subclonal mutations form a low-VAF tail.

70%
30%
藍 = clonal mutations (出現在所有 tumor cells);橘 = subclonal mutations。低 VAF 區的 mutations 較難偵測,需要高 coverage (≥ 100×)。
Blue = clonal mutations (in all tumor cells); Orange = subclonal mutations. Low-VAF mutations are harder to detect, requiring high coverage (≥100×).

四、Mutect2 完整 pipeline

步驟工具目的
1Mutect2呼叫候選 somatic variants (tumor + normal + PoN + germline AF)
2GetPileupSummaries計算 tumor 與 normal 在 common variants 的 allele counts
3CalculateContamination估計 cross-sample contamination 比例
4LearnReadOrientationModel學 FFPE oxidation / 8-oxoG artifact 的 read orientation bias
5FilterMutectCalls整合上述所有資訊,標記 PASS 或具體 filter 原因
6 (選)FilterAlignmentArtifacts重新對齊到 BWA-MEM2 + alt aware 參考,剔除 mapping artifacts
FFPE artifact (8-oxoG) 必看

FFPE (formalin-fixed paraffin-embedded) 樣本是病理檔案常見來源,但 formalin 會氧化 G→8-oxoG,定序時讀為 T,產生系統性 G>T / C>A artifact。LearnReadOrientationModel 會學出此 bias 的 strand-specific pattern,FilterMutectCalls 用 --ob-priors 套用,可大幅減少 FP。新鮮血液 / 凍存組織不必擔心,但 FFPE 一定要做這步。

FFPE (formalin-fixed paraffin-embedded) samples are common from pathology archives, but formalin oxidizes G→8-oxoG, read as T during sequencing — producing systematic G>T / C>A artifacts. LearnReadOrientationModel learns this bias's strand-specific pattern; FilterMutectCalls applies it via --ob-priors, drastically reducing FPs. Fresh blood / frozen tissue doesn't need this, but FFPE absolutely does.

五、Pipeline 指令範例

# Mutect2 — tumor + matched normal
gatk Mutect2 \
    -R Homo_sapiens_assembly38.fasta \
    -I tumor.bam \
    -I normal.bam \
    -tumor TUMOR_SAMPLE_ID \
    -normal NORMAL_SAMPLE_ID \
    --germline-resource af-only-gnomad.hg38.vcf.gz \
    --panel-of-normals 1000g_pon.hg38.vcf.gz \
    --f1r2-tar-gz tumor.f1r2.tar.gz \
    -O tumor.unfiltered.vcf.gz

# 重要:tumor 樣本名要與 BAM @RG SM tag 完全一致
# Mutect2 — tumor-only mode
gatk Mutect2 \
    -R Homo_sapiens_assembly38.fasta \
    -I tumor.bam \
    -tumor TUMOR_SAMPLE_ID \
    --germline-resource af-only-gnomad.hg38.vcf.gz \
    --panel-of-normals 1000g_pon.hg38.vcf.gz \
    --f1r2-tar-gz tumor.f1r2.tar.gz \
    -O tumor.unfiltered.vcf.gz

# 注意:tumor-only 模式對 PoN 高度依賴;若無自家 PoN,
# Broad 提供 1000G PoN 可下載 (gs://gatk-best-practices/somatic-hg38/)
# 建立自家 PoN (建議用 30+ 個與 tumor 同 platform/capture 的 normals)

# 1. 每個 normal 跑 Mutect2 tumor-only 模式
for normal in normal_001 normal_002 ... normal_030; do
  gatk Mutect2 -R ref.fa -I ${normal}.bam -tumor ${normal} \
      --max-mnp-distance 0 \
      -O pon/${normal}.vcf.gz
done

# 2. GenomicsDBImport
gatk GenomicsDBImport \
    -R ref.fa -L wgs_calling_regions.hg38.interval_list \
    --genomicsdb-workspace-path pon_db \
    $(for v in pon/*.vcf.gz; do echo -n "-V $v "; done)

# 3. 合併為 PoN VCF
gatk CreateSomaticPanelOfNormals \
    -R ref.fa \
    --germline-resource af-only-gnomad.hg38.vcf.gz \
    -V gendb://pon_db \
    -O my_panel_of_normals.vcf.gz
# Step 1: contamination
gatk GetPileupSummaries -I tumor.bam \
    -V small_exac_common_3.hg38.vcf.gz \
    -L small_exac_common_3.hg38.vcf.gz \
    -O tumor.pileups.table

gatk GetPileupSummaries -I normal.bam \
    -V small_exac_common_3.hg38.vcf.gz \
    -L small_exac_common_3.hg38.vcf.gz \
    -O normal.pileups.table

gatk CalculateContamination \
    -I tumor.pileups.table \
    -matched normal.pileups.table \
    -O tumor.contamination.table \
    --tumor-segmentation tumor.segments.table

# Step 2: orientation bias model (if FFPE)
gatk LearnReadOrientationModel \
    -I tumor.f1r2.tar.gz \
    -O tumor.read-orientation-model.tar.gz

# Step 3: 整合 filter
gatk FilterMutectCalls \
    -R ref.fa \
    -V tumor.unfiltered.vcf.gz \
    --contamination-table tumor.contamination.table \
    --tumor-segmentation tumor.segments.table \
    --ob-priors tumor.read-orientation-model.tar.gz \
    -O tumor.filtered.vcf.gz

# 只留 PASS
bcftools view -f PASS tumor.filtered.vcf.gz -Oz -o tumor.somatic.vcf.gz

六、TMB 與下游應用

過濾後的 somatic VCF 是癌症基因體分析的起點,常見下游:

  • TMB (Tumor Mutation Burden):每 Mb 非同義 somatic mutations 數。TMB-high (typically ≥ 10 mut/Mb) 與 immunotherapy (anti-PD-1) 反應正相關,已是 FDA-approved biomarker。
  • Mutational signatures (COSMIC SBS):deconvolute 96 trinucleotide context 計數,找出底層 mutagenesis (UV、tobacco、APOBEC、MMR-deficient)。SBS4 = tobacco, SBS7 = UV。
  • Driver gene identification:對照 COSMIC Cancer Gene Census、OncoKB;用 dN/dS、MutSig、IntOGen 識別驅動突變。
  • Actionability annotation:對照 OncoKB、CIViC、JAX-CKB 找 FDA-approved drug targets,分 AMP Tier I-IV。

The filtered somatic VCF is the starting point for cancer genomic analysis. Common downstream uses:

  • TMB (Tumor Mutation Burden): non-synonymous somatic mutations per Mb. High TMB (typically ≥10 mut/Mb) correlates with immunotherapy (anti-PD-1) response — an FDA-approved biomarker.
  • Mutational signatures (COSMIC SBS): deconvolute 96-trinucleotide-context counts to identify underlying mutagenesis (UV, tobacco, APOBEC, MMR-deficient). SBS4 = tobacco, SBS7 = UV.
  • Driver gene identification: cross-reference COSMIC Cancer Gene Census, OncoKB; use dN/dS, MutSig, IntOGen to identify driver mutations.
  • Actionability annotation: cross-reference OncoKB, CIViC, JAX-CKB to find FDA-approved drug targets, categorized by AMP Tier I–IV.

七、FAQ

tumor coverage 該打多少?
取決於目標 VAF:clonal mutations (VAF ≥ 0.2),30-50× 即可;subclonal (VAF 0.05-0.1),需 100-200×;ctDNA / liquid biopsy (VAF < 0.01),需 ≥ 1000× ultra-deep + UMI 去 PCR error。Normal 通常 30× 就夠。
Depends on target VAF: clonal mutations (VAF ≥ 0.2), 30–50× suffices; subclonal (VAF 0.05–0.1), need 100–200×; ctDNA / liquid biopsy (VAF < 0.01), need ≥1000× ultra-deep + UMIs to remove PCR error. Normal typically only needs 30×.
contamination 多少算嚴重?
CalculateContamination 的輸出 fraction < 0.01 通常可忽略;0.01-0.05 可接受但建議 review;> 0.05 顯著,需排查樣本交叉污染。FilterMutectCalls 會用此值動態調整 filter threshold。
CalculateContamination output fraction < 0.01 is usually negligible; 0.01–0.05 is acceptable but warrants review; > 0.05 is significant — investigate cross-sample contamination. FilterMutectCalls dynamically adjusts filter thresholds based on this value.
Mutect2 與 Strelka2 / VarDict / DeepVariant 比較?
Mutect2 是 GATK 官方、文件最完整、PoN 與 contamination 整合最好。Strelka2 速度更快、indel 表現好。VarDict 在低 VAF 偵測 (< 0.05) 表現突出。DeepSomatic (Google 2024) 是 DeepVariant 的 somatic 版本。實務常用兩個 caller 求交集(如 Mutect2 ∩ Strelka2),降低 FP。
Mutect2 is the GATK official tool with the most complete documentation, best PoN and contamination integration. Strelka2 is faster, with strong indel performance. VarDict shines at low-VAF detection (< 0.05). DeepSomatic (Google 2024) is DeepVariant's somatic counterpart. Best practice often uses two callers and takes the intersection (e.g., Mutect2 ∩ Strelka2) to reduce FPs.
如果沒有 matched normal 怎麼辦?
用 tumor-only 模式 + 嚴格的後續過濾:(1) gnomAD AF > 0.0001 → 視為 likely germline; (2) 自家 PoN 過濾 artifacts; (3) 對照 COSMIC + ClinVar 才報出。FP 率仍會高於 tumor-vs-normal,臨床診斷務必標註。長期應建立 archived normal collection 或用 buccal swab 補正。
Use tumor-only mode + strict post-filtering: (1) gnomAD AF > 0.0001 → treat as likely germline; (2) PoN to filter artifacts; (3) cross-check COSMIC + ClinVar before reporting. FP rate is still higher than tumor-vs-normal; flag this in clinical reports. Long-term, build an archived normal collection or use buccal swabs.

八、小測驗

Q1. 為什麼 somatic caller 不是 genotype caller?

Q2. PoN 主要過濾掉的是?

Q3. FFPE 樣本的特徵性 artifact 是?