Step 5: Variant Calling — WGS/WES Tutorial

為什麼

一、Variant Calling 在做什麼？

給定一份 analysis-ready BAM 與參考基因組，variant caller 在每個位點問三個問題：

這個位點與 reference 不同嗎？ 比對所有 overlap 此位點的 reads，看 ALT allele 出現比例。
差異是真實 variant 還是 noise？ 整合 base quality、mapping quality、strand bias、context 等多重證據，計算 likelihood。
真實的話，genotype 是什麼？ 0/0 (homozygous ref)、0/1 (heterozygous)、還是 1/1 (homozygous alt)？

輸出標準格式為 VCF (Variant Call Format)，每行一個位點變異，欄位包含 CHROM、POS、ID、REF、ALT、QUAL、FILTER、INFO 與 per-sample FORMAT 欄位（GT、AD、DP、GQ、PL）。

Given an analysis-ready BAM and reference, the caller asks three questions at every site:

Does this site differ from reference? Inspect all reads overlapping the site and the ALT allele frequency.
Is the difference a real variant or noise? Integrate base quality, mapping quality, strand bias, and context to compute likelihoods.
If real, what's the genotype? 0/0 (hom ref), 0/1 (het), or 1/1 (hom alt)?

The standard output is VCF (Variant Call Format) — one variant per line, with CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, and per-sample FORMAT fields (GT, AD, DP, GQ, PL).

格式

二、VCF 欄位剖析

#CHROM chr7 # 染色體

POS 117559590 # 位置 (1-based)

ID rs113993960 # dbSNP ID 或 .

REF CTT # 參考鹼基

ALT C # 變異鹼基 (這是 2bp deletion: ΔF508)

QUAL 2845.6 # Phred-scaled call quality

FILTER PASS # PASS 或 filter 名稱

INFO AC=1;AF=0.5;AN=2;DP=42;MQ=60 # 位點層級註解

FORMAT GT:AD:DP:GQ:PL # 樣本欄位順序

SAMPLE1 0/1:23,19:42:99:560,0,720 # heterozygous, AD=23 ref/19 alt

GT

genotype。0/0=homo ref, 0/1=het, 1/1=homo alt, ./.=missing。

Genotype. 0/0=hom ref, 0/1=het, 1/1=hom alt, ./.=missing.

Allelic Depth。每個 allele 支持的 read 數，順序為 ref,alt。

Allelic Depth. Read counts supporting each allele in order ref,alt.

DP

總 depth (此位點實際使用的 reads 數，可能小於 AD 總和因為 caller 過濾)。

Total depth at this site (after caller's own filtering — may be less than sum(AD)).

GQ

Genotype Quality (Phred scale, 0-99)。GQ > 20 通常算可信。

Genotype Quality (Phred, 0-99). GQ > 20 is typically considered confident.

PL

Phred-scaled Likelihoods，順序 0/0, 0/1, 1/1。最 likely genotype 的 PL = 0，其他越大代表越不可能。

Phred-scaled Likelihoods for 0/0, 0/1, 1/1. Most likely genotype has PL = 0; higher values are less likely.

VAF / AF

Variant Allele Frequency = AD[alt] / DP。germline het 應接近 0.5；somatic 可能任何值。

Variant Allele Frequency = AD[alt] / DP. Germline het should be near 0.5; somatic can be anything.

技術

三、四大 caller 比較

Caller	演算法	速度	SNV F1	Indel F1	特點
HaplotypeCaller (GATK)	局部 de novo assembly + HMM	★★☆☆☆	~99.6%	~99.4%	Best Practices 黃金標準；GVCF 模式支援聯合呼叫；CPU 慢但廣泛驗證
DeepVariant (Google)	CNN 對 pileup image 分類	★★★☆☆	~99.7%	~99.5%	SNV 準確度業界第一；GPU 加速；不需 BQSR；模型可換 ONT/PacBio
DRAGEN (Illumina)	FPGA-加速圖式 caller	★★★★★	~99.6%	~99.6%	全 pipeline ~30 分鐘；商業 FPGA 卡；indel 表現最佳
Strelka2 (Illumina)	混合模型 + ML 訓練	★★★★☆	~99.5%	~99.3%	速度與準確度兼顧；germline + somatic 雙模式；無 GVCF

F1 數據來自 PrecisionFDA Truth Challenge V2 與 GIAB HG002 benchmark；實際表現會因樣本、coverage、reference build 而異。

F1 numbers from PrecisionFDA Truth Challenge V2 and GIAB HG002 benchmarks; real-world performance varies with sample, coverage, and reference build.

GVCF 模式

四、為什麼要先輸出 GVCF？

傳統 single-sample VCF 只記錄「有 variant」的位點。但要做群體聯合呼叫（下一章 Joint Genotyping）必須知道：「樣本 B 在樣本 A 的 variant 位點，是真的 0/0 還是因為 coverage 不夠看不清楚？」

GVCF (Genomic VCF) 就是答案：每個樣本獨立呼叫時，所有位點都記錄——variant 位點記詳細，non-variant 區段以 <NON_REF> 區塊壓縮表示，附帶該區段的最小 GQ。如此後續 GenotypeGVCFs 就能對 N 個樣本做正確的 joint calling。

用 HaplotypeCaller 加上 -ERC GVCF 即可輸出。檔案稍大但壓縮後仍可接受（30× WGS 約 1-2 GB）。

A traditional single-sample VCF records only sites with variants. But for cohort joint calling (next chapter, Joint Genotyping), you need to know: "At sample A's variant site, is sample B truly 0/0, or just under-covered so we can't tell?"

GVCF (Genomic VCF) answers this: every site is recorded — variant sites in detail, non-variant stretches as compressed <NON_REF> blocks with the block's minimum GQ. Downstream GenotypeGVCFs can then correctly joint-call N samples.

Just add -ERC GVCF to HaplotypeCaller. Files are larger but still manageable after compression (~1–2 GB for 30× WGS).

# Variant site:

chr1 10583 . G A,<NON_REF> ... 0/1:5,4,0:9:99:124,0,142,140,154,294

# Non-variant block (positions 10584-10592):

chr1 10584 . T <NON_REF> . . END=10592 GT:DP:GQ:MIN_DP:PL 0/0:8:24:7:0,21,315

互動模擬

五、互動模擬：Genotype 判定

調整 ALT-supporting reads 數量與總 coverage，看 caller 如何根據 PL 給出 genotype。VAF (variant allele frequency) 接近 0 通常為 0/0，接近 0.5 為 0/1，接近 1 為 1/1——但低 depth 會讓所有判斷都不確定。

Adjust ALT-supporting reads and total coverage to see how the caller assigns a genotype via PL. VAF near 0 usually means 0/0, near 0.5 means 0/1, near 1 means 1/1 — but low depth makes everything uncertain.

Total depth (DP): 30

ALT-supporting reads: 15

PL 為 Phred-scaled likelihood，最 likely 的 genotype = 0，其他越大代表越不可能。GQ = 第二低 PL，反映判斷的「自信度」。

PL is the Phred-scaled likelihood; the most likely genotype has PL=0, higher = less likely. GQ = second-lowest PL, reflecting confidence in the call.

實務指令

六、Pipeline 指令範例

# GATK HaplotypeCaller — GVCF 模式 (推薦)
gatk HaplotypeCaller \
    -I analysis_ready.bam \
    -R Homo_sapiens_assembly38.fasta \
    -O sample.g.vcf.gz \
    -ERC GVCF \
    --tmp-dir /tmp \
    -L wgs_calling_regions.hg38.interval_list

# 加速：分染色體並行 (scatter-gather)
for c in chr{1..22} chrX chrY; do
  gatk HaplotypeCaller -I dedup.bam -R ref.fa \
    -O sample.$c.g.vcf.gz -ERC GVCF -L $c &
done; wait

# 合併
gatk MergeVcfs -I sample.chr1.g.vcf.gz -I sample.chr2.g.vcf.gz ... -O sample.g.vcf.gz

# DeepVariant via Docker (CPU)
BIN_VERSION=1.6.0
INPUT_DIR=/path/to/data
OUTPUT_DIR=/path/to/output

docker run \
  -v "${INPUT_DIR}":"/input" \
  -v "${OUTPUT_DIR}":"/output" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
    --model_type=WGS \
    --ref=/input/ref.fa \
    --reads=/input/sample.bam \
    --output_vcf=/output/sample.vcf.gz \
    --output_gvcf=/output/sample.g.vcf.gz \
    --num_shards=16

# WES 換 --model_type=WES，PacBio 用 --model_type=PACBIO

# Strelka2 germline workflow
configureStrelkaGermlineWorkflow.py \
    --bam analysis_ready.bam \
    --referenceFasta ref.fa \
    --runDir strelka_germline

# 執行
strelka_germline/runWorkflow.py -m local -j 16

# 結果在 strelka_germline/results/variants/variants.vcf.gz

# 看 header
bcftools view -h sample.vcf.gz | less

# 統計
bcftools stats sample.vcf.gz | grep "^SN"
# number of SNPs / indels / multiallelic / Ti/Tv ratio

# 篩 PASS only
bcftools view -f PASS sample.vcf.gz -Oz -o sample.pass.vcf.gz

# 看某基因區段
bcftools view -r chr7:117559590-117559700 sample.vcf.gz

# Ti/Tv ratio (健康指標, WGS 應該 ~2.0-2.1, WES ~3.0-3.3)
bcftools stats sample.vcf.gz | grep "tstv"

常見問題

七、FAQ

只有一個樣本，還需要 GVCF 嗎？

不一定，但建議仍跑 GVCF 模式。理由：(1) 未來可能加入 cohort 重做 joint calling；(2) GVCF 保留 reference confidence，下游可區分「真 0/0」與「coverage 不足」。直接出 single-sample VCF 可加 -ERC NONE。

Not strictly, but recommended. Reasons: (1) future cohorts can re-do joint calling; (2) GVCF preserves reference confidence, distinguishing true 0/0 from low coverage. Skip with -ERC NONE for direct single-sample VCF.

Ti/Tv ratio 是什麼？多少算正常？

Transition (A↔G, C↔T) 與 Transversion (其餘) 的比例。基於生化機制，真實 SNV 的 Ti/Tv 在人類 WGS 約 2.0-2.1，WES 偏高約 3.0-3.3 (因 exon 富 CpG)。明顯低於這個範圍 (如 1.5) 通常代表 false positive 太多。

Ratio of transitions (A↔G, C↔T) to transversions (the rest). Real SNVs have Ti/Tv ~2.0–2.1 in human WGS, ~3.0–3.3 in WES (exons are CpG-rich). Markedly lower (e.g., 1.5) usually means too many false positives.

HaplotypeCaller vs DeepVariant 該選哪個？

大型 cohort joint calling、需要 VQSR 與 GVCF 流程選 HaplotypeCaller；single-sample 重視準確度、有 GPU、處理 ONT/PacBio 選 DeepVariant。臨床診斷常用 DeepVariant，研究 cohort 常用 HaplotypeCaller。

Large cohort joint calling that needs VQSR and GVCF workflow → HaplotypeCaller. Single-sample, accuracy-focused, GPU-equipped, or ONT/PacBio → DeepVariant. Clinical diagnostics often use DeepVariant; research cohorts often use HaplotypeCaller.

需要 GATK best-practices interval_list 嗎？

強烈建議。WGS 用 wgs_calling_regions.hg38.interval_list 排除 N-region 和 centromere；WES 用 capture kit 廠商提供的 BED + 100bp padding。沒指定會浪費時間呼叫無 reads 的區段。

Strongly recommended. WGS: wgs_calling_regions.hg38.interval_list excludes N-regions and centromeres. WES: use the kit vendor's BED + 100bp padding. Without intervals, time is wasted calling regions with no reads.

自我檢測

八、小測驗

Q1. 一個位點 GT=0/1, AD=22,18, DP=40, GQ=99，最可能的解釋？

Q2. 為什麼 cohort 分析要用 GVCF 而不是直接合併 single-sample VCF？

Q3. WGS 的 Ti/Tv ratio 1.4 代表什麼？