一、Variant Calling 在做什麼?
給定一份 analysis-ready BAM 與參考基因組,variant caller 在每個位點問三個問題:
- 這個位點與 reference 不同嗎? 比對所有 overlap 此位點的 reads,看 ALT allele 出現比例。
- 差異是真實 variant 還是 noise? 整合 base quality、mapping quality、strand bias、context 等多重證據,計算 likelihood。
- 真實的話,genotype 是什麼? 0/0 (homozygous ref)、0/1 (heterozygous)、還是 1/1 (homozygous alt)?
輸出標準格式為 VCF (Variant Call Format),每行一個位點變異,欄位包含 CHROM、POS、ID、REF、ALT、QUAL、FILTER、INFO 與 per-sample FORMAT 欄位(GT、AD、DP、GQ、PL)。
Given an analysis-ready BAM and reference, the caller asks three questions at every site:
- Does this site differ from reference? Inspect all reads overlapping the site and the ALT allele frequency.
- Is the difference a real variant or noise? Integrate base quality, mapping quality, strand bias, and context to compute likelihoods.
- If real, what's the genotype? 0/0 (hom ref), 0/1 (het), or 1/1 (hom alt)?
The standard output is VCF (Variant Call Format) — one variant per line, with CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, and per-sample FORMAT fields (GT, AD, DP, GQ, PL).
二、VCF 欄位剖析
GT
genotype。0/0=homo ref, 0/1=het, 1/1=homo alt, ./.=missing。
Genotype. 0/0=hom ref, 0/1=het, 1/1=hom alt, ./.=missing.
AD
Allelic Depth。每個 allele 支持的 read 數,順序為 ref,alt。
Allelic Depth. Read counts supporting each allele in order ref,alt.
DP
總 depth (此位點實際使用的 reads 數,可能小於 AD 總和因為 caller 過濾)。
Total depth at this site (after caller's own filtering — may be less than sum(AD)).
GQ
Genotype Quality (Phred scale, 0-99)。GQ > 20 通常算可信。
Genotype Quality (Phred, 0-99). GQ > 20 is typically considered confident.
PL
Phred-scaled Likelihoods,順序 0/0, 0/1, 1/1。最 likely genotype 的 PL = 0,其他越大代表越不可能。
Phred-scaled Likelihoods for 0/0, 0/1, 1/1. Most likely genotype has PL = 0; higher values are less likely.
VAF / AF
Variant Allele Frequency = AD[alt] / DP。germline het 應接近 0.5;somatic 可能任何值。
Variant Allele Frequency = AD[alt] / DP. Germline het should be near 0.5; somatic can be anything.
三、四大 caller 比較
| Caller | 演算法 | 速度 | SNV F1 | Indel F1 | 特點 |
|---|---|---|---|---|---|
| HaplotypeCaller (GATK) | 局部 de novo assembly + HMM | ★★☆☆☆ | ~99.6% | ~99.4% | Best Practices 黃金標準;GVCF 模式支援聯合呼叫;CPU 慢但廣泛驗證 |
| DeepVariant (Google) | CNN 對 pileup image 分類 | ★★★☆☆ | ~99.7% | ~99.5% | SNV 準確度業界第一;GPU 加速;不需 BQSR;模型可換 ONT/PacBio |
| DRAGEN (Illumina) | FPGA-加速圖式 caller | ★★★★★ | ~99.6% | ~99.6% | 全 pipeline ~30 分鐘;商業 FPGA 卡;indel 表現最佳 |
| Strelka2 (Illumina) | 混合模型 + ML 訓練 | ★★★★☆ | ~99.5% | ~99.3% | 速度與準確度兼顧;germline + somatic 雙模式;無 GVCF |
F1 數據來自 PrecisionFDA Truth Challenge V2 與 GIAB HG002 benchmark;實際表現會因樣本、coverage、reference build 而異。
F1 numbers from PrecisionFDA Truth Challenge V2 and GIAB HG002 benchmarks; real-world performance varies with sample, coverage, and reference build.
四、為什麼要先輸出 GVCF?
傳統 single-sample VCF 只記錄「有 variant」的位點。但要做群體聯合呼叫(下一章 Joint Genotyping)必須知道:「樣本 B 在樣本 A 的 variant 位點,是真的 0/0 還是因為 coverage 不夠看不清楚?」
GVCF (Genomic VCF) 就是答案:每個樣本獨立呼叫時,所有位點都記錄——variant 位點記詳細,non-variant 區段以 <NON_REF> 區塊壓縮表示,附帶該區段的最小 GQ。如此後續 GenotypeGVCFs 就能對 N 個樣本做正確的 joint calling。
用 HaplotypeCaller 加上 -ERC GVCF 即可輸出。檔案稍大但壓縮後仍可接受(30× WGS 約 1-2 GB)。
A traditional single-sample VCF records only sites with variants. But for cohort joint calling (next chapter, Joint Genotyping), you need to know: "At sample A's variant site, is sample B truly 0/0, or just under-covered so we can't tell?"
GVCF (Genomic VCF) answers this: every site is recorded — variant sites in detail, non-variant stretches as compressed <NON_REF> blocks with the block's minimum GQ. Downstream GenotypeGVCFs can then correctly joint-call N samples.
Just add -ERC GVCF to HaplotypeCaller. Files are larger but still manageable after compression (~1–2 GB for 30× WGS).
五、互動模擬:Genotype 判定
調整 ALT-supporting reads 數量與總 coverage,看 caller 如何根據 PL 給出 genotype。VAF (variant allele frequency) 接近 0 通常為 0/0,接近 0.5 為 0/1,接近 1 為 1/1——但低 depth 會讓所有判斷都不確定。
Adjust ALT-supporting reads and total coverage to see how the caller assigns a genotype via PL. VAF near 0 usually means 0/0, near 0.5 means 0/1, near 1 means 1/1 — but low depth makes everything uncertain.
六、Pipeline 指令範例
# GATK HaplotypeCaller — GVCF 模式 (推薦)
gatk HaplotypeCaller \
-I analysis_ready.bam \
-R Homo_sapiens_assembly38.fasta \
-O sample.g.vcf.gz \
-ERC GVCF \
--tmp-dir /tmp \
-L wgs_calling_regions.hg38.interval_list
# 加速:分染色體並行 (scatter-gather)
for c in chr{1..22} chrX chrY; do
gatk HaplotypeCaller -I dedup.bam -R ref.fa \
-O sample.$c.g.vcf.gz -ERC GVCF -L $c &
done; wait
# 合併
gatk MergeVcfs -I sample.chr1.g.vcf.gz -I sample.chr2.g.vcf.gz ... -O sample.g.vcf.gz# DeepVariant via Docker (CPU)
BIN_VERSION=1.6.0
INPUT_DIR=/path/to/data
OUTPUT_DIR=/path/to/output
docker run \
-v "${INPUT_DIR}":"/input" \
-v "${OUTPUT_DIR}":"/output" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \
--ref=/input/ref.fa \
--reads=/input/sample.bam \
--output_vcf=/output/sample.vcf.gz \
--output_gvcf=/output/sample.g.vcf.gz \
--num_shards=16
# WES 換 --model_type=WES,PacBio 用 --model_type=PACBIO# Strelka2 germline workflow
configureStrelkaGermlineWorkflow.py \
--bam analysis_ready.bam \
--referenceFasta ref.fa \
--runDir strelka_germline
# 執行
strelka_germline/runWorkflow.py -m local -j 16
# 結果在 strelka_germline/results/variants/variants.vcf.gz# 看 header
bcftools view -h sample.vcf.gz | less
# 統計
bcftools stats sample.vcf.gz | grep "^SN"
# number of SNPs / indels / multiallelic / Ti/Tv ratio
# 篩 PASS only
bcftools view -f PASS sample.vcf.gz -Oz -o sample.pass.vcf.gz
# 看某基因區段
bcftools view -r chr7:117559590-117559700 sample.vcf.gz
# Ti/Tv ratio (健康指標, WGS 應該 ~2.0-2.1, WES ~3.0-3.3)
bcftools stats sample.vcf.gz | grep "tstv"七、FAQ
只有一個樣本,還需要 GVCF 嗎?
-ERC NONE。-ERC NONE for direct single-sample VCF.Ti/Tv ratio 是什麼?多少算正常?
HaplotypeCaller vs DeepVariant 該選哪個?
需要 GATK best-practices interval_list 嗎?
wgs_calling_regions.hg38.interval_list 排除 N-region 和 centromere;WES 用 capture kit 廠商提供的 BED + 100bp padding。沒指定會浪費時間呼叫無 reads 的區段。wgs_calling_regions.hg38.interval_list excludes N-regions and centromeres. WES: use the kit vendor's BED + 100bp padding. Without intervals, time is wasted calling regions with no reads.八、小測驗
Q1. 一個位點 GT=0/1, AD=22,18, DP=40, GQ=99,最可能的解釋?
Q2. 為什麼 cohort 分析要用 GVCF 而不是直接合併 single-sample VCF?
Q3. WGS 的 Ti/Tv ratio 1.4 代表什麼?