STEP 7 / 9

變異過濾 (Variant Filtering)

從 raw cohort VCF 篩出可信的 variants——VQSR 機器學習 vs Hard Filtering 啟發式門檻。

Filter trustworthy variants from a raw cohort VCF — VQSR (machine learning) vs hard filtering (heuristic thresholds).

一、為什麼還要過濾?

剛 joint-call 完的 raw VCF 中,每個變異都有 caller 給的 QUAL 分數,但這只是「位點層級的信心」,不能直接區分真假變異。實際上:

  • 30× human WGS 一個樣本:raw call 約 4-5 百萬 variants,其中 ~10-15% 是 false positives(多在重複區、低 mapability 區、或 strand bias 嚴重處)
  • 不過濾的後果:下游關聯研究與臨床診斷會被 FP 淹沒,p-value inflation、診斷假陽性都會發生

GATK 提供兩條路:VQSR (Variant Quality Score Recalibration) 用機器學習在 cohort 內訓練;Hard Filtering 用人工設定的閾值。兩者各有適用場景。

The raw VCF straight from joint calling has a per-site QUAL from the caller, but that's site-level confidence — it can't separate true from false directly. Reality:

  • 30× human WGS, single sample: ~4–5 M raw calls, of which ~10–15% are false positives (mostly in repetitive regions, low-mappability tracts, or strand-biased sites)
  • Without filtering: downstream association studies and clinical diagnostics get drowned in FPs — p-value inflation and diagnostic false positives ensue

GATK offers two paths: VQSR (Variant Quality Score Recalibration) trains ML on the cohort; Hard Filtering applies hand-tuned thresholds. Each has its niche.

二、VQSR 怎麼運作?

VQSR 把每個變異想成「多維特徵空間的一個點」:QD、FS、MQ、MQRankSum、ReadPosRankSum、SOR、DP(與選用的 InbreedingCoeff、HaplotypeScore)共 7-8 維。然後:

  1. VariantRecalibrator:用「已知為真」的 truth set(HapMap、Omni、1000G)作正樣本訓練 GMM (Gaussian Mixture Model),學出「真 variants 在特徵空間中的分布」
  2. 計算 VQSLOD:每個變異對 GMM 的 log-odds (log-likelihood of being a real variant)
  3. ApplyVQSR:依使用者指定的 sensitivity tranche (e.g. 99.7% 找回所有 truth set 變異) 切閾值,pass 標 PASS、fail 標 VQSRTrancheSNP99.90to100.00 等

SNV 與 indel 必須分開訓練(兩者特徵分布不同)。VQSR 對 cohort 規模有要求——SNV 至少 30 樣本,indel 至少 30 樣本(exome)/ 1 樣本(WGS,因 WGS variant 數量大)。

VQSR treats each variant as a point in multi-dimensional feature space: QD, FS, MQ, MQRankSum, ReadPosRankSum, SOR, DP (plus optional InbreedingCoeff, HaplotypeScore) — 7-8 dimensions. Then:

  1. VariantRecalibrator: trains a GMM (Gaussian Mixture Model) on a "known-true" truth set (HapMap, Omni, 1000G) to learn the distribution of real variants in feature space
  2. Compute VQSLOD: log-odds (log-likelihood of being real) of each variant against the GMM
  3. ApplyVQSR: cuts at a user-specified sensitivity tranche (e.g., 99.7% to recover all truth-set variants); passing variants get PASS, failing get e.g. VQSRTrancheSNP99.90to100.00

SNVs and indels must be trained separately (different feature distributions). VQSR has cohort-size requirements: SNVs need ≥30 samples; indels need ≥30 (exome) or 1 (WGS, since WGS yields enough indels).

三、Hard Filtering 推薦門檻

當 cohort 太小(< 30 樣本)、non-model 物種,或無 truth set 時,GATK 官方建議直接套用以下門檻:

When the cohort is too small (<30 samples), the organism is non-model, or no truth set exists, GATK officially recommends the thresholds below:

標籤意義SNVIndel說明
QDQualByDepth< 2.0< 2.0QUAL / DP,過低代表單位深度的證據弱
FSFisherStrand> 60.0> 200.0strand bias 的 Phred 值,過高代表 ALT 集中在某一 strand
SORStrandOddsRatio> 3.0> 10.0另一種 strand bias 量度,與 FS 互補
MQRMSMappingQuality< 40.0支持此 variant 的 reads 平均 MAPQ
MQRankSumMQ ranksum test< -12.5REF vs ALT reads 的 MAPQ 是否系統性偏差
ReadPosRankSumRead pos ranksum< -8.0< -20.0變異是否總出現在 read 兩端 (容易出錯處)

符合任一條件即標記為對應 filter(如 FS_filter),FILTER 欄不再是 PASS。下游分析通常 bcftools view -f PASS 過掉。

Matching any condition flags the variant with that filter (e.g., FS_filter), and FILTER is no longer PASS. Downstream analyses typically use bcftools view -f PASS to drop them.

四、互動模擬:QD 與 FS 雙閾值

調整 QD 與 FS 的閾值,看 5,000 個模擬 SNV 中 TP/FP 的分布如何被切。GATK 預設值(QD≥2、FS≤60)取得不錯平衡——太寬會放進 FP、太嚴會殺掉 TP。

Adjust QD and FS thresholds to see how 5,000 simulated SNVs (TP/FP mixture) get partitioned. GATK defaults (QD≥2, FS≤60) strike a good balance — too loose admits FPs, too strict kills TPs.

2.0
60
綠 = True positives (kept), 紅 = False positives (kept), 灰 = filtered out。理想情況:左下塞滿 FP(被過濾掉)、右下與右上塞滿 TP(保留)。
Green = TPs kept; Red = FPs kept; Gray = filtered out. Ideal: lower-left full of FPs (filtered), right side full of TPs (kept).

五、VQSR vs Hard Filter,怎麼選?

Q1: 樣本數 ≥ 30?
→ Hard Filtering

VQSR 訓練 GMM 需要充足變異數,cohort 太小會 fail。

Q2: 是 human 且有 truth set?
否 (non-model)
→ Hard Filtering

無 truth set 無法訓練 VQSR;可考慮 bootstrap 自建 truth set。

→ VQSR

GATK Best Practices 首選;SNV 與 indel 分開訓練。

六、Pipeline 指令範例

# SNV VQSR — Step A: 訓練 model
gatk VariantRecalibrator \
    -R Homo_sapiens_assembly38.fasta \
    -V cohort.vcf.gz \
    --resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.vcf.gz \
    --resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg38.vcf.gz \
    --resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
    --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf \
    -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
    -mode SNP \
    --max-gaussians 6 \
    -O snp.recal --tranches-file snp.tranches

# Step B: 套用
gatk ApplyVQSR \
    -R Homo_sapiens_assembly38.fasta \
    -V cohort.vcf.gz \
    --recal-file snp.recal \
    --tranches-file snp.tranches \
    --truth-sensitivity-filter-level 99.7 \
    -mode SNP \
    -O cohort.snp_filt.vcf.gz
# Indel VQSR — Step A
gatk VariantRecalibrator \
    -R Homo_sapiens_assembly38.fasta \
    -V cohort.snp_filt.vcf.gz \
    --resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    --resource:axiomPoly,known=false,training=true,truth=false,prior=10.0 Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz \
    --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf \
    -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum \
    -mode INDEL \
    --max-gaussians 4 \
    -O indel.recal --tranches-file indel.tranches

# Step B
gatk ApplyVQSR \
    -V cohort.snp_filt.vcf.gz \
    --recal-file indel.recal --tranches-file indel.tranches \
    --truth-sensitivity-filter-level 99.0 \
    -mode INDEL \
    -O cohort.filt.vcf.gz
# 1. 拆 SNV 與 indel
gatk SelectVariants -V cohort.vcf.gz --select-type-to-include SNP -O snp.vcf.gz
gatk SelectVariants -V cohort.vcf.gz --select-type-to-include INDEL -O indel.vcf.gz

# 2. SNV hard filter
gatk VariantFiltration -V snp.vcf.gz \
    --filter-expression "QD < 2.0" --filter-name "QD2" \
    --filter-expression "FS > 60.0" --filter-name "FS60" \
    --filter-expression "MQ < 40.0" --filter-name "MQ40" \
    --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
    --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
    --filter-expression "SOR > 3.0" --filter-name "SOR3" \
    -O snp.filt.vcf.gz

# 3. Indel hard filter
gatk VariantFiltration -V indel.vcf.gz \
    --filter-expression "QD < 2.0" --filter-name "QD2" \
    --filter-expression "FS > 200.0" --filter-name "FS200" \
    --filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
    --filter-expression "SOR > 10.0" --filter-name "SOR10" \
    -O indel.filt.vcf.gz

# 4. 合併
gatk MergeVcfs -I snp.filt.vcf.gz -I indel.filt.vcf.gz -O cohort.hardfilt.vcf.gz

# 5. 只留 PASS
bcftools view -f PASS cohort.hardfilt.vcf.gz -Oz -o cohort.pass.vcf.gz

七、FAQ

truth sensitivity tranche 該設多少?
SNV 常用 99.7%(保留 99.7% truth set 變異),Indel 用 99.0%(indel 較嘈雜)。臨床診斷可用更嚴格的 99.0% / 95.0%;族群研究偏好 99.9% / 99.5%。tranches 表格會列各 cutoff 的 TP/FP/Ti/Tv,可依需求權衡。
SNV: typically 99.7% (retains 99.7% of truth-set variants). Indel: 99.0% (indels are noisier). Clinical diagnostics may use stricter 99.0%/95.0%; population studies prefer 99.9%/99.5%. The tranches table lists TP/FP/Ti/Tv at each cutoff for trade-off review.
VQSR fail 的 variants 是不是就一定錯?
不是。被 fail 的 variants 富集 false positives,但仍含 ~1-5% true positives(特別在 truth set 不完整的區域,如 HLA、segmental duplication)。臨床診斷若臨床懷疑高的 variant 被 VQSR fail,常會手動 IGV review。
No. Failed variants are enriched in false positives but still contain ~1-5% true positives (particularly in regions where truth sets are incomplete, e.g., HLA, segmental duplications). In clinical diagnostics, a clinically-suspect variant flagged by VQSR is often re-reviewed manually in IGV.
--max-gaussians 該設多少?
SNV 預設 8(充足樣本時),少樣本可降到 4-6。Indel 通常 4。如果出錯訊息「unable to retrieve enough variants for 8 Gaussians」就調低,這代表 cohort 太小。
SNV: default 8 (when samples are abundant); reduce to 4–6 for fewer samples. Indel: typically 4. If you get "unable to retrieve enough variants for 8 Gaussians," reduce — your cohort is too small.
DeepVariant / DRAGEN 的 VCF 也要做 VQSR 嗎?
不需要。兩者已經內建 ML-based filtering,輸出的 VCF 已經 PASS/Refcall 標好。再跑 VQSR 反而可能 over-filter。CNN-driven callers 就用其原生 quality score 即可。
No. Both include ML-based filtering and output VCFs already labeled PASS/Refcall. Running VQSR on top may over-filter. For CNN-driven callers, just use their native quality scores.

八、小測驗

Q1. VQSR 的核心想法是?

Q2. 你只有 5 個 mouse exome 樣本,想做 filtering,最佳方法?

Q3. FS=350 的 SNV 通常代表什麼?