Step 7: Variant Filtering — WGS/WES Tutorial

為什麼

一、為什麼還要過濾？

剛 joint-call 完的 raw VCF 中，每個變異都有 caller 給的 QUAL 分數，但這只是「位點層級的信心」，不能直接區分真假變異。實際上：

30× human WGS 一個樣本：raw call 約 4-5 百萬 variants，其中 ~10-15% 是 false positives（多在重複區、低 mapability 區、或 strand bias 嚴重處）
不過濾的後果：下游關聯研究與臨床診斷會被 FP 淹沒，p-value inflation、診斷假陽性都會發生

GATK 提供兩條路：VQSR (Variant Quality Score Recalibration) 用機器學習在 cohort 內訓練；Hard Filtering 用人工設定的閾值。兩者各有適用場景。

The raw VCF straight from joint calling has a per-site QUAL from the caller, but that's site-level confidence — it can't separate true from false directly. Reality:

30× human WGS, single sample: ~4–5 M raw calls, of which ~10–15% are false positives (mostly in repetitive regions, low-mappability tracts, or strand-biased sites)
Without filtering: downstream association studies and clinical diagnostics get drowned in FPs — p-value inflation and diagnostic false positives ensue

GATK offers two paths: VQSR (Variant Quality Score Recalibration) trains ML on the cohort; Hard Filtering applies hand-tuned thresholds. Each has its niche.

技術

二、VQSR 怎麼運作？

VQSR 把每個變異想成「多維特徵空間的一個點」：QD、FS、MQ、MQRankSum、ReadPosRankSum、SOR、DP（與選用的 InbreedingCoeff、HaplotypeScore）共 7-8 維。然後：

VariantRecalibrator：用「已知為真」的 truth set（HapMap、Omni、1000G）作正樣本訓練 GMM (Gaussian Mixture Model)，學出「真 variants 在特徵空間中的分布」
計算 VQSLOD：每個變異對 GMM 的 log-odds (log-likelihood of being a real variant)
ApplyVQSR：依使用者指定的 sensitivity tranche (e.g. 99.7% 找回所有 truth set 變異) 切閾值，pass 標 PASS、fail 標 VQSRTrancheSNP99.90to100.00 等

SNV 與 indel 必須分開訓練（兩者特徵分布不同）。VQSR 對 cohort 規模有要求——SNV 至少 30 樣本，indel 至少 30 樣本（exome）/ 1 樣本（WGS，因 WGS variant 數量大）。

VQSR treats each variant as a point in multi-dimensional feature space: QD, FS, MQ, MQRankSum, ReadPosRankSum, SOR, DP (plus optional InbreedingCoeff, HaplotypeScore) — 7-8 dimensions. Then:

VariantRecalibrator: trains a GMM (Gaussian Mixture Model) on a "known-true" truth set (HapMap, Omni, 1000G) to learn the distribution of real variants in feature space
Compute VQSLOD: log-odds (log-likelihood of being real) of each variant against the GMM
ApplyVQSR: cuts at a user-specified sensitivity tranche (e.g., 99.7% to recover all truth-set variants); passing variants get PASS, failing get e.g. VQSRTrancheSNP99.90to100.00

SNVs and indels must be trained separately (different feature distributions). VQSR has cohort-size requirements: SNVs need ≥30 samples; indels need ≥30 (exome) or 1 (WGS, since WGS yields enough indels).

技術

三、Hard Filtering 推薦門檻

當 cohort 太小（< 30 樣本）、non-model 物種，或無 truth set 時，GATK 官方建議直接套用以下門檻：

When the cohort is too small (<30 samples), the organism is non-model, or no truth set exists, GATK officially recommends the thresholds below:

標籤	意義	SNV	Indel	說明
`QD`	QualByDepth	< 2.0	< 2.0	QUAL / DP，過低代表單位深度的證據弱
`FS`	FisherStrand	> 60.0	> 200.0	strand bias 的 Phred 值，過高代表 ALT 集中在某一 strand
`SOR`	StrandOddsRatio	> 3.0	> 10.0	另一種 strand bias 量度，與 FS 互補
`MQ`	RMSMappingQuality	< 40.0	—	支持此 variant 的 reads 平均 MAPQ
`MQRankSum`	MQ ranksum test	< -12.5	—	REF vs ALT reads 的 MAPQ 是否系統性偏差
`ReadPosRankSum`	Read pos ranksum	< -8.0	< -20.0	變異是否總出現在 read 兩端 (容易出錯處)

符合任一條件即標記為對應 filter（如 FS_filter），FILTER 欄不再是 PASS。下游分析通常 bcftools view -f PASS 過掉。

Matching any condition flags the variant with that filter (e.g., FS_filter), and FILTER is no longer PASS. Downstream analyses typically use bcftools view -f PASS to drop them.

互動模擬

四、互動模擬：QD 與 FS 雙閾值

調整 QD 與 FS 的閾值，看 5,000 個模擬 SNV 中 TP/FP 的分布如何被切。GATK 預設值（QD≥2、FS≤60）取得不錯平衡——太寬會放進 FP、太嚴會殺掉 TP。

Adjust QD and FS thresholds to see how 5,000 simulated SNVs (TP/FP mixture) get partitioned. GATK defaults (QD≥2, FS≤60) strike a good balance — too loose admits FPs, too strict kills TPs.

QD ≥ 2.0

FS ≤ 60

綠 = True positives (kept), 紅 = False positives (kept), 灰 = filtered out。理想情況：左下塞滿 FP（被過濾掉）、右下與右上塞滿 TP（保留）。

Green = TPs kept; Red = FPs kept; Gray = filtered out. Ideal: lower-left full of FPs (filtered), right side full of TPs (kept).

決策

五、VQSR vs Hard Filter，怎麼選？

Q1: 樣本數 ≥ 30？

否

→ Hard Filtering

VQSR 訓練 GMM 需要充足變異數，cohort 太小會 fail。

是

Q2: 是 human 且有 truth set？

否 (non-model)

→ Hard Filtering

無 truth set 無法訓練 VQSR；可考慮 bootstrap 自建 truth set。

是

→ VQSR

GATK Best Practices 首選；SNV 與 indel 分開訓練。

實務指令

六、Pipeline 指令範例

# SNV VQSR — Step A: 訓練 model
gatk VariantRecalibrator \
    -R Homo_sapiens_assembly38.fasta \
    -V cohort.vcf.gz \
    --resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.vcf.gz \
    --resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg38.vcf.gz \
    --resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
    --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf \
    -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
    -mode SNP \
    --max-gaussians 6 \
    -O snp.recal --tranches-file snp.tranches

# Step B: 套用
gatk ApplyVQSR \
    -R Homo_sapiens_assembly38.fasta \
    -V cohort.vcf.gz \
    --recal-file snp.recal \
    --tranches-file snp.tranches \
    --truth-sensitivity-filter-level 99.7 \
    -mode SNP \
    -O cohort.snp_filt.vcf.gz

# Indel VQSR — Step A
gatk VariantRecalibrator \
    -R Homo_sapiens_assembly38.fasta \
    -V cohort.snp_filt.vcf.gz \
    --resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    --resource:axiomPoly,known=false,training=true,truth=false,prior=10.0 Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz \
    --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf \
    -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum \
    -mode INDEL \
    --max-gaussians 4 \
    -O indel.recal --tranches-file indel.tranches

# Step B
gatk ApplyVQSR \
    -V cohort.snp_filt.vcf.gz \
    --recal-file indel.recal --tranches-file indel.tranches \
    --truth-sensitivity-filter-level 99.0 \
    -mode INDEL \
    -O cohort.filt.vcf.gz

# 1. 拆 SNV 與 indel
gatk SelectVariants -V cohort.vcf.gz --select-type-to-include SNP -O snp.vcf.gz
gatk SelectVariants -V cohort.vcf.gz --select-type-to-include INDEL -O indel.vcf.gz

# 2. SNV hard filter
gatk VariantFiltration -V snp.vcf.gz \
    --filter-expression "QD < 2.0" --filter-name "QD2" \
    --filter-expression "FS > 60.0" --filter-name "FS60" \
    --filter-expression "MQ < 40.0" --filter-name "MQ40" \
    --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
    --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
    --filter-expression "SOR > 3.0" --filter-name "SOR3" \
    -O snp.filt.vcf.gz

# 3. Indel hard filter
gatk VariantFiltration -V indel.vcf.gz \
    --filter-expression "QD < 2.0" --filter-name "QD2" \
    --filter-expression "FS > 200.0" --filter-name "FS200" \
    --filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
    --filter-expression "SOR > 10.0" --filter-name "SOR10" \
    -O indel.filt.vcf.gz

# 4. 合併
gatk MergeVcfs -I snp.filt.vcf.gz -I indel.filt.vcf.gz -O cohort.hardfilt.vcf.gz

# 5. 只留 PASS
bcftools view -f PASS cohort.hardfilt.vcf.gz -Oz -o cohort.pass.vcf.gz

常見問題

七、FAQ

truth sensitivity tranche 該設多少？

SNV 常用 99.7%（保留 99.7% truth set 變異），Indel 用 99.0%（indel 較嘈雜）。臨床診斷可用更嚴格的 99.0% / 95.0%；族群研究偏好 99.9% / 99.5%。tranches 表格會列各 cutoff 的 TP/FP/Ti/Tv，可依需求權衡。

SNV: typically 99.7% (retains 99.7% of truth-set variants). Indel: 99.0% (indels are noisier). Clinical diagnostics may use stricter 99.0%/95.0%; population studies prefer 99.9%/99.5%. The tranches table lists TP/FP/Ti/Tv at each cutoff for trade-off review.

VQSR fail 的 variants 是不是就一定錯？

不是。被 fail 的 variants 富集 false positives，但仍含 ~1-5% true positives（特別在 truth set 不完整的區域，如 HLA、segmental duplication）。臨床診斷若臨床懷疑高的 variant 被 VQSR fail，常會手動 IGV review。

No. Failed variants are enriched in false positives but still contain ~1-5% true positives (particularly in regions where truth sets are incomplete, e.g., HLA, segmental duplications). In clinical diagnostics, a clinically-suspect variant flagged by VQSR is often re-reviewed manually in IGV.

--max-gaussians 該設多少？

SNV 預設 8（充足樣本時），少樣本可降到 4-6。Indel 通常 4。如果出錯訊息「unable to retrieve enough variants for 8 Gaussians」就調低，這代表 cohort 太小。

SNV: default 8 (when samples are abundant); reduce to 4–6 for fewer samples. Indel: typically 4. If you get "unable to retrieve enough variants for 8 Gaussians," reduce — your cohort is too small.

DeepVariant / DRAGEN 的 VCF 也要做 VQSR 嗎？

不需要。兩者已經內建 ML-based filtering，輸出的 VCF 已經 PASS/Refcall 標好。再跑 VQSR 反而可能 over-filter。CNN-driven callers 就用其原生 quality score 即可。

No. Both include ML-based filtering and output VCFs already labeled PASS/Refcall. Running VQSR on top may over-filter. For CNN-driven callers, just use their native quality scores.

自我檢測

八、小測驗

Q1. VQSR 的核心想法是？

Q2. 你只有 5 個 mouse exome 樣本，想做 filtering，最佳方法？

Q3. FS=350 的 SNV 通常代表什麼？