STEP 5 / 9

變異呼叫 (Variant Calling)

從 BAM 找出每個位點的 SNV 與 indel——HaplotypeCaller、DeepVariant、DRAGEN 三大主流的選擇。

Identify SNVs and indels at every site from a BAM — HaplotypeCaller, DeepVariant, DRAGEN: choosing among the three.

一、Variant Calling 在做什麼?

給定一份 analysis-ready BAM 與參考基因組,variant caller 在每個位點問三個問題:

  1. 這個位點與 reference 不同嗎? 比對所有 overlap 此位點的 reads,看 ALT allele 出現比例。
  2. 差異是真實 variant 還是 noise? 整合 base quality、mapping quality、strand bias、context 等多重證據,計算 likelihood。
  3. 真實的話,genotype 是什麼? 0/0 (homozygous ref)、0/1 (heterozygous)、還是 1/1 (homozygous alt)?

輸出標準格式為 VCF (Variant Call Format),每行一個位點變異,欄位包含 CHROM、POS、ID、REF、ALT、QUAL、FILTER、INFO 與 per-sample FORMAT 欄位(GT、AD、DP、GQ、PL)。

Given an analysis-ready BAM and reference, the caller asks three questions at every site:

  1. Does this site differ from reference? Inspect all reads overlapping the site and the ALT allele frequency.
  2. Is the difference a real variant or noise? Integrate base quality, mapping quality, strand bias, and context to compute likelihoods.
  3. If real, what's the genotype? 0/0 (hom ref), 0/1 (het), or 1/1 (hom alt)?

The standard output is VCF (Variant Call Format) — one variant per line, with CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, and per-sample FORMAT fields (GT, AD, DP, GQ, PL).

二、VCF 欄位剖析

#CHROM chr7 # 染色體
POS 117559590 # 位置 (1-based)
ID rs113993960 # dbSNP ID 或 .
REF CTT # 參考鹼基
ALT C # 變異鹼基 (這是 2bp deletion: ΔF508)
QUAL 2845.6 # Phred-scaled call quality
FILTER PASS # PASS 或 filter 名稱
INFO AC=1;AF=0.5;AN=2;DP=42;MQ=60 # 位點層級註解
FORMAT GT:AD:DP:GQ:PL # 樣本欄位順序
SAMPLE1 0/1:23,19:42:99:560,0,720 # heterozygous, AD=23 ref/19 alt

GT

genotype。0/0=homo ref, 0/1=het, 1/1=homo alt, ./.=missing。

Genotype. 0/0=hom ref, 0/1=het, 1/1=hom alt, ./.=missing.

AD

Allelic Depth。每個 allele 支持的 read 數,順序為 ref,alt。

Allelic Depth. Read counts supporting each allele in order ref,alt.

DP

總 depth (此位點實際使用的 reads 數,可能小於 AD 總和因為 caller 過濾)。

Total depth at this site (after caller's own filtering — may be less than sum(AD)).

GQ

Genotype Quality (Phred scale, 0-99)。GQ > 20 通常算可信。

Genotype Quality (Phred, 0-99). GQ > 20 is typically considered confident.

PL

Phred-scaled Likelihoods,順序 0/0, 0/1, 1/1。最 likely genotype 的 PL = 0,其他越大代表越不可能。

Phred-scaled Likelihoods for 0/0, 0/1, 1/1. Most likely genotype has PL = 0; higher values are less likely.

VAF / AF

Variant Allele Frequency = AD[alt] / DP。germline het 應接近 0.5;somatic 可能任何值。

Variant Allele Frequency = AD[alt] / DP. Germline het should be near 0.5; somatic can be anything.

三、四大 caller 比較

Caller演算法速度SNV F1Indel F1特點
HaplotypeCaller (GATK)局部 de novo assembly + HMM★★☆☆☆~99.6%~99.4%Best Practices 黃金標準;GVCF 模式支援聯合呼叫;CPU 慢但廣泛驗證
DeepVariant (Google)CNN 對 pileup image 分類★★★☆☆~99.7%~99.5%SNV 準確度業界第一;GPU 加速;不需 BQSR;模型可換 ONT/PacBio
DRAGEN (Illumina)FPGA-加速圖式 caller★★★★★~99.6%~99.6%全 pipeline ~30 分鐘;商業 FPGA 卡;indel 表現最佳
Strelka2 (Illumina)混合模型 + ML 訓練★★★★☆~99.5%~99.3%速度與準確度兼顧;germline + somatic 雙模式;無 GVCF

F1 數據來自 PrecisionFDA Truth Challenge V2 與 GIAB HG002 benchmark;實際表現會因樣本、coverage、reference build 而異。

F1 numbers from PrecisionFDA Truth Challenge V2 and GIAB HG002 benchmarks; real-world performance varies with sample, coverage, and reference build.

四、為什麼要先輸出 GVCF?

傳統 single-sample VCF 只記錄「有 variant」的位點。但要做群體聯合呼叫(下一章 Joint Genotyping)必須知道:「樣本 B 在樣本 A 的 variant 位點,是真的 0/0 還是因為 coverage 不夠看不清楚?」

GVCF (Genomic VCF) 就是答案:每個樣本獨立呼叫時,所有位點都記錄——variant 位點記詳細,non-variant 區段以 <NON_REF> 區塊壓縮表示,附帶該區段的最小 GQ。如此後續 GenotypeGVCFs 就能對 N 個樣本做正確的 joint calling。

用 HaplotypeCaller 加上 -ERC GVCF 即可輸出。檔案稍大但壓縮後仍可接受(30× WGS 約 1-2 GB)。

A traditional single-sample VCF records only sites with variants. But for cohort joint calling (next chapter, Joint Genotyping), you need to know: "At sample A's variant site, is sample B truly 0/0, or just under-covered so we can't tell?"

GVCF (Genomic VCF) answers this: every site is recorded — variant sites in detail, non-variant stretches as compressed <NON_REF> blocks with the block's minimum GQ. Downstream GenotypeGVCFs can then correctly joint-call N samples.

Just add -ERC GVCF to HaplotypeCaller. Files are larger but still manageable after compression (~1–2 GB for 30× WGS).

# Variant site:
chr1 10583 . G A,<NON_REF> ... 0/1:5,4,0:9:99:124,0,142,140,154,294
# Non-variant block (positions 10584-10592):
chr1 10584 . T <NON_REF> . . END=10592 GT:DP:GQ:MIN_DP:PL 0/0:8:24:7:0,21,315

五、互動模擬:Genotype 判定

調整 ALT-supporting reads 數量與總 coverage,看 caller 如何根據 PL 給出 genotype。VAF (variant allele frequency) 接近 0 通常為 0/0,接近 0.5 為 0/1,接近 1 為 1/1——但低 depth 會讓所有判斷都不確定。

Adjust ALT-supporting reads and total coverage to see how the caller assigns a genotype via PL. VAF near 0 usually means 0/0, near 0.5 means 0/1, near 1 means 1/1 — but low depth makes everything uncertain.

30
15
PL 為 Phred-scaled likelihood,最 likely 的 genotype = 0,其他越大代表越不可能。GQ = 第二低 PL,反映判斷的「自信度」。
PL is the Phred-scaled likelihood; the most likely genotype has PL=0, higher = less likely. GQ = second-lowest PL, reflecting confidence in the call.

六、Pipeline 指令範例

# GATK HaplotypeCaller — GVCF 模式 (推薦)
gatk HaplotypeCaller \
    -I analysis_ready.bam \
    -R Homo_sapiens_assembly38.fasta \
    -O sample.g.vcf.gz \
    -ERC GVCF \
    --tmp-dir /tmp \
    -L wgs_calling_regions.hg38.interval_list

# 加速:分染色體並行 (scatter-gather)
for c in chr{1..22} chrX chrY; do
  gatk HaplotypeCaller -I dedup.bam -R ref.fa \
    -O sample.$c.g.vcf.gz -ERC GVCF -L $c &
done; wait

# 合併
gatk MergeVcfs -I sample.chr1.g.vcf.gz -I sample.chr2.g.vcf.gz ... -O sample.g.vcf.gz
# DeepVariant via Docker (CPU)
BIN_VERSION=1.6.0
INPUT_DIR=/path/to/data
OUTPUT_DIR=/path/to/output

docker run \
  -v "${INPUT_DIR}":"/input" \
  -v "${OUTPUT_DIR}":"/output" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
    --model_type=WGS \
    --ref=/input/ref.fa \
    --reads=/input/sample.bam \
    --output_vcf=/output/sample.vcf.gz \
    --output_gvcf=/output/sample.g.vcf.gz \
    --num_shards=16

# WES 換 --model_type=WES,PacBio 用 --model_type=PACBIO
# Strelka2 germline workflow
configureStrelkaGermlineWorkflow.py \
    --bam analysis_ready.bam \
    --referenceFasta ref.fa \
    --runDir strelka_germline

# 執行
strelka_germline/runWorkflow.py -m local -j 16

# 結果在 strelka_germline/results/variants/variants.vcf.gz
# 看 header
bcftools view -h sample.vcf.gz | less

# 統計
bcftools stats sample.vcf.gz | grep "^SN"
# number of SNPs / indels / multiallelic / Ti/Tv ratio

# 篩 PASS only
bcftools view -f PASS sample.vcf.gz -Oz -o sample.pass.vcf.gz

# 看某基因區段
bcftools view -r chr7:117559590-117559700 sample.vcf.gz

# Ti/Tv ratio (健康指標, WGS 應該 ~2.0-2.1, WES ~3.0-3.3)
bcftools stats sample.vcf.gz | grep "tstv"

七、FAQ

只有一個樣本,還需要 GVCF 嗎?
不一定,但建議仍跑 GVCF 模式。理由:(1) 未來可能加入 cohort 重做 joint calling;(2) GVCF 保留 reference confidence,下游可區分「真 0/0」與「coverage 不足」。直接出 single-sample VCF 可加 -ERC NONE
Not strictly, but recommended. Reasons: (1) future cohorts can re-do joint calling; (2) GVCF preserves reference confidence, distinguishing true 0/0 from low coverage. Skip with -ERC NONE for direct single-sample VCF.
Ti/Tv ratio 是什麼?多少算正常?
Transition (A↔G, C↔T) 與 Transversion (其餘) 的比例。基於生化機制,真實 SNV 的 Ti/Tv 在人類 WGS 約 2.0-2.1,WES 偏高約 3.0-3.3 (因 exon 富 CpG)。明顯低於這個範圍 (如 1.5) 通常代表 false positive 太多。
Ratio of transitions (A↔G, C↔T) to transversions (the rest). Real SNVs have Ti/Tv ~2.0–2.1 in human WGS, ~3.0–3.3 in WES (exons are CpG-rich). Markedly lower (e.g., 1.5) usually means too many false positives.
HaplotypeCaller vs DeepVariant 該選哪個?
大型 cohort joint calling、需要 VQSR 與 GVCF 流程選 HaplotypeCaller;single-sample 重視準確度、有 GPU、處理 ONT/PacBio 選 DeepVariant。臨床診斷常用 DeepVariant,研究 cohort 常用 HaplotypeCaller。
Large cohort joint calling that needs VQSR and GVCF workflow → HaplotypeCaller. Single-sample, accuracy-focused, GPU-equipped, or ONT/PacBio → DeepVariant. Clinical diagnostics often use DeepVariant; research cohorts often use HaplotypeCaller.
需要 GATK best-practices interval_list 嗎?
強烈建議。WGS 用 wgs_calling_regions.hg38.interval_list 排除 N-region 和 centromere;WES 用 capture kit 廠商提供的 BED + 100bp padding。沒指定會浪費時間呼叫無 reads 的區段。
Strongly recommended. WGS: wgs_calling_regions.hg38.interval_list excludes N-regions and centromeres. WES: use the kit vendor's BED + 100bp padding. Without intervals, time is wasted calling regions with no reads.

八、小測驗

Q1. 一個位點 GT=0/1, AD=22,18, DP=40, GQ=99,最可能的解釋?

Q2. 為什麼 cohort 分析要用 GVCF 而不是直接合併 single-sample VCF?

Q3. WGS 的 Ti/Tv ratio 1.4 代表什麼?