一、為什麼還要過濾?
剛 joint-call 完的 raw VCF 中,每個變異都有 caller 給的 QUAL 分數,但這只是「位點層級的信心」,不能直接區分真假變異。實際上:
- 30× human WGS 一個樣本:raw call 約 4-5 百萬 variants,其中 ~10-15% 是 false positives(多在重複區、低 mapability 區、或 strand bias 嚴重處)
- 不過濾的後果:下游關聯研究與臨床診斷會被 FP 淹沒,p-value inflation、診斷假陽性都會發生
GATK 提供兩條路:VQSR (Variant Quality Score Recalibration) 用機器學習在 cohort 內訓練;Hard Filtering 用人工設定的閾值。兩者各有適用場景。
The raw VCF straight from joint calling has a per-site QUAL from the caller, but that's site-level confidence — it can't separate true from false directly. Reality:
- 30× human WGS, single sample: ~4–5 M raw calls, of which ~10–15% are false positives (mostly in repetitive regions, low-mappability tracts, or strand-biased sites)
- Without filtering: downstream association studies and clinical diagnostics get drowned in FPs — p-value inflation and diagnostic false positives ensue
GATK offers two paths: VQSR (Variant Quality Score Recalibration) trains ML on the cohort; Hard Filtering applies hand-tuned thresholds. Each has its niche.
二、VQSR 怎麼運作?
VQSR 把每個變異想成「多維特徵空間的一個點」:QD、FS、MQ、MQRankSum、ReadPosRankSum、SOR、DP(與選用的 InbreedingCoeff、HaplotypeScore)共 7-8 維。然後:
- VariantRecalibrator:用「已知為真」的 truth set(HapMap、Omni、1000G)作正樣本訓練 GMM (Gaussian Mixture Model),學出「真 variants 在特徵空間中的分布」
- 計算 VQSLOD:每個變異對 GMM 的 log-odds (log-likelihood of being a real variant)
- ApplyVQSR:依使用者指定的 sensitivity tranche (e.g. 99.7% 找回所有 truth set 變異) 切閾值,pass 標 PASS、fail 標 VQSRTrancheSNP99.90to100.00 等
SNV 與 indel 必須分開訓練(兩者特徵分布不同)。VQSR 對 cohort 規模有要求——SNV 至少 30 樣本,indel 至少 30 樣本(exome)/ 1 樣本(WGS,因 WGS variant 數量大)。
VQSR treats each variant as a point in multi-dimensional feature space: QD, FS, MQ, MQRankSum, ReadPosRankSum, SOR, DP (plus optional InbreedingCoeff, HaplotypeScore) — 7-8 dimensions. Then:
- VariantRecalibrator: trains a GMM (Gaussian Mixture Model) on a "known-true" truth set (HapMap, Omni, 1000G) to learn the distribution of real variants in feature space
- Compute VQSLOD: log-odds (log-likelihood of being real) of each variant against the GMM
- ApplyVQSR: cuts at a user-specified sensitivity tranche (e.g., 99.7% to recover all truth-set variants); passing variants get PASS, failing get e.g. VQSRTrancheSNP99.90to100.00
SNVs and indels must be trained separately (different feature distributions). VQSR has cohort-size requirements: SNVs need ≥30 samples; indels need ≥30 (exome) or 1 (WGS, since WGS yields enough indels).
三、Hard Filtering 推薦門檻
當 cohort 太小(< 30 樣本)、non-model 物種,或無 truth set 時,GATK 官方建議直接套用以下門檻:
When the cohort is too small (<30 samples), the organism is non-model, or no truth set exists, GATK officially recommends the thresholds below:
| 標籤 | 意義 | SNV | Indel | 說明 |
|---|---|---|---|---|
QD | QualByDepth | < 2.0 | < 2.0 | QUAL / DP,過低代表單位深度的證據弱 |
FS | FisherStrand | > 60.0 | > 200.0 | strand bias 的 Phred 值,過高代表 ALT 集中在某一 strand |
SOR | StrandOddsRatio | > 3.0 | > 10.0 | 另一種 strand bias 量度,與 FS 互補 |
MQ | RMSMappingQuality | < 40.0 | — | 支持此 variant 的 reads 平均 MAPQ |
MQRankSum | MQ ranksum test | < -12.5 | — | REF vs ALT reads 的 MAPQ 是否系統性偏差 |
ReadPosRankSum | Read pos ranksum | < -8.0 | < -20.0 | 變異是否總出現在 read 兩端 (容易出錯處) |
符合任一條件即標記為對應 filter(如 FS_filter),FILTER 欄不再是 PASS。下游分析通常 bcftools view -f PASS 過掉。
Matching any condition flags the variant with that filter (e.g., FS_filter), and FILTER is no longer PASS. Downstream analyses typically use bcftools view -f PASS to drop them.
四、互動模擬:QD 與 FS 雙閾值
調整 QD 與 FS 的閾值,看 5,000 個模擬 SNV 中 TP/FP 的分布如何被切。GATK 預設值(QD≥2、FS≤60)取得不錯平衡——太寬會放進 FP、太嚴會殺掉 TP。
Adjust QD and FS thresholds to see how 5,000 simulated SNVs (TP/FP mixture) get partitioned. GATK defaults (QD≥2, FS≤60) strike a good balance — too loose admits FPs, too strict kills TPs.
五、VQSR vs Hard Filter,怎麼選?
VQSR 訓練 GMM 需要充足變異數,cohort 太小會 fail。
無 truth set 無法訓練 VQSR;可考慮 bootstrap 自建 truth set。
GATK Best Practices 首選;SNV 與 indel 分開訓練。
六、Pipeline 指令範例
# SNV VQSR — Step A: 訓練 model
gatk VariantRecalibrator \
-R Homo_sapiens_assembly38.fasta \
-V cohort.vcf.gz \
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.vcf.gz \
--resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg38.vcf.gz \
--resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
-mode SNP \
--max-gaussians 6 \
-O snp.recal --tranches-file snp.tranches
# Step B: 套用
gatk ApplyVQSR \
-R Homo_sapiens_assembly38.fasta \
-V cohort.vcf.gz \
--recal-file snp.recal \
--tranches-file snp.tranches \
--truth-sensitivity-filter-level 99.7 \
-mode SNP \
-O cohort.snp_filt.vcf.gz# Indel VQSR — Step A
gatk VariantRecalibrator \
-R Homo_sapiens_assembly38.fasta \
-V cohort.snp_filt.vcf.gz \
--resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--resource:axiomPoly,known=false,training=true,truth=false,prior=10.0 Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf \
-an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum \
-mode INDEL \
--max-gaussians 4 \
-O indel.recal --tranches-file indel.tranches
# Step B
gatk ApplyVQSR \
-V cohort.snp_filt.vcf.gz \
--recal-file indel.recal --tranches-file indel.tranches \
--truth-sensitivity-filter-level 99.0 \
-mode INDEL \
-O cohort.filt.vcf.gz# 1. 拆 SNV 與 indel
gatk SelectVariants -V cohort.vcf.gz --select-type-to-include SNP -O snp.vcf.gz
gatk SelectVariants -V cohort.vcf.gz --select-type-to-include INDEL -O indel.vcf.gz
# 2. SNV hard filter
gatk VariantFiltration -V snp.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
--filter-expression "MQ < 40.0" --filter-name "MQ40" \
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
--filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
--filter-expression "SOR > 3.0" --filter-name "SOR3" \
-O snp.filt.vcf.gz
# 3. Indel hard filter
gatk VariantFiltration -V indel.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 200.0" --filter-name "FS200" \
--filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
--filter-expression "SOR > 10.0" --filter-name "SOR10" \
-O indel.filt.vcf.gz
# 4. 合併
gatk MergeVcfs -I snp.filt.vcf.gz -I indel.filt.vcf.gz -O cohort.hardfilt.vcf.gz
# 5. 只留 PASS
bcftools view -f PASS cohort.hardfilt.vcf.gz -Oz -o cohort.pass.vcf.gz七、FAQ
truth sensitivity tranche 該設多少?
VQSR fail 的 variants 是不是就一定錯?
--max-gaussians 該設多少?
DeepVariant / DRAGEN 的 VCF 也要做 VQSR 嗎?
八、小測驗
Q1. VQSR 的核心想法是?
Q2. 你只有 5 個 mouse exome 樣本,想做 filtering,最佳方法?
Q3. FS=350 的 SNV 通常代表什麼?