ADVANCED

WES 專屬考量 (Capture, QC, Blind Spots)

Capture kit 選擇、target enrichment 工作流、CollectHsMetrics QC、與 WES 永恆的盲點——把 WES pipeline 跑好的關鍵細節。

Capture kit selection, target enrichment workflow, CollectHsMetrics QC, and WES's perennial blind spots — the key details for running WES well.

一、Target enrichment 工作流

WES 與 WGS 在 wet lab 端的根本差異:WES 在 library prep 後加一步 target capture,用幾十萬個 biotinylated probe 去 hybridize 並「釣出」exon 區段:

  1. Genomic DNA fragmentation(超音波或酶切,~200-300 bp)
  2. Library prep(end repair, A-tail, adapter ligation, PCR amplify)
  3. Hybridization:library 與 biotinylated capture probe 在液相中混合 16-24 小時
  4. Streptavidin bead pulldown:磁珠抓住 biotin,洗掉未結合的非 target 片段
  5. Post-capture PCR(10-12 cycles)放大濃度,再上 Illumina 定序

這個 capture 步驟是 WES 所有「特色問題」的根源——uneven coverage、capture bias、GC bias、PCR duplicates,全來自此。

The fundamental wet-lab difference between WES and WGS: WES adds a target capture step after library prep, where hundreds of thousands of biotinylated probes hybridize and "fish out" exon segments:

  1. Genomic DNA fragmentation (sonication or enzymatic, ~200–300 bp)
  2. Library prep (end repair, A-tail, adapter ligation, PCR amplify)
  3. Hybridization: library mixed with biotinylated capture probes in solution for 16–24 hours
  4. Streptavidin bead pulldown: magnetic beads grab biotin, washing off unbound non-target fragments
  5. Post-capture PCR (10–12 cycles) re-amplifies concentration, then loaded onto Illumina

This capture step is the root of all WES's "characteristic problems" — uneven coverage, capture bias, GC bias, PCR duplicates all originate here.

二、四大 Exome capture kit 比較

Kit廠商Target size均勻度特色 / 弱點
Twist Exome 2.0Twist Bioscience~36 Mb★★★★★ (最佳)2018 後新一代,雙鏈 DNA probe,FOLD_80 ≈ 1.5;GC bias 最低;目前主流選擇
Agilent SureSelectAgilent~50 Mb (V8)★★★★ (良好)經典老牌,RNA probe,target 含 UTR 與 microRNA;FOLD_80 ≈ 2.0;GC-rich 區較弱
IDT xGen ExomeIDT~39 Mb★★★★ (良好)DNA probe,低 input (10 ng) 表現好;常用於 cfDNA / FFPE / single-cell
Illumina TruSeq ExomeIllumina~45 Mb★★★ (一般)與 Illumina sequencer 整合工作流;新版本較少更新;逐漸被取代
註:FOLD_80 = 80th percentile coverage / mean coverage;越接近 1 表示 coverage 越均勻。
Note: FOLD_80 = 80th percentile coverage / mean coverage; closer to 1 means more uniform coverage.
Padding intervals 是什麼?

capture probe 雖然只在 exon 上,但 hybridization 會「順帶」抓到 probe 周圍 ~50-100 bp 的 flanking sequence。為了 variant calling 包含 splice site (exon-intron boundary ±10 bp 是 splice donor/acceptor 關鍵區),下游分析應使用「padded intervals」:原 target intervals 兩側各延伸 50-100 bp。GATK 預設 100 bp padding。

Capture probes target only exons, but hybridization incidentally captures ~50–100 bp of flanking sequence around the probe. To include splice sites (exon-intron boundary ±10 bp is the splice donor/acceptor critical region), downstream analysis should use "padded intervals" — original target intervals extended 50–100 bp on each side. GATK defaults to 100-bp padding.

三、CollectHsMetrics — WES 的 QC 之王

Picard 的 CollectHsMetrics("Hs" = Hybrid Selection)專門評估 capture 品質,每個 WES 樣本都應該跑。關鍵 metrics:

Picard's CollectHsMetrics ("Hs" = Hybrid Selection) is purpose-built for capture QC; every WES sample should run it. Key metrics:

Metric意義可接受範圍
PCT_SELECTED_BASES比對到 target 或 padding 區的 base 比例 (capture 效率)> 60% (Twist 常 > 80%)
PCT_OFF_BAIT未對到 target / padding 的 read 比例 (浪費)< 30%
MEAN_TARGET_COVERAGEtarget 區的平均 coverage≥ 80× (germline)、≥ 200× (somatic)
PCT_TARGET_BASES_30Xtarget base 中達到 30× 以上的比例 (medical exome 關鍵)> 90%
FOLD_80_BASE_PENALTY均勻度指標:80% target 達到目標 coverage 所需 fold over-sequencing;越低越均勻< 2.5 (Twist 常 1.5)
AT_DROPOUT / GC_DROPOUT極端 AT-rich (< 30%) 或 GC-rich (> 70%) 區段的覆蓋損失各 < 5%;> 10% 警示
PCT_PF_UQ_READS_ALIGNEDPass-Filter unique aligned reads 比例> 95%
HS_LIBRARY_SIZE估計的 library 複雜度(unique molecules)越高越好;低值代表 PCR duplicate 嚴重

四、互動模擬:capture uniformity (FOLD_80)

調整 FOLD_80_BASE_PENALTY,觀察 mean coverage 100× 時 target base coverage 的分布。FOLD_80 越低(越均勻),coverage 分布越集中;越高越散開,意味著要把「最差的那 20% target」也打到 30× 以上,必須付出更高的 mean coverage 代價。

Adjust FOLD_80_BASE_PENALTY to see the target-base coverage distribution at 100× mean coverage. Lower FOLD_80 (more uniform) = more concentrated distribution; higher = more spread out, meaning to bring the worst 20% of targets above 30× requires a higher mean coverage premium.

1.8
100×
紅線 = 30× 醫療診斷常用最低標。黃 = 20%-80% percentile 範圍。FOLD_80 = 1.5 是 Twist 的水平;> 3.0 通常代表 capture 失敗。
Red line = 30× minimum standard for medical diagnostics. Yellow = 20%-80% percentile range. FOLD_80 = 1.5 is Twist's level; >3.0 typically indicates capture failure.

五、WES 永恆的盲點

即使最好的 capture kit + 高 coverage,WES 仍有結構性盲點。這些區域若是研究目標,必須改用 WGS 或加 targeted sequencing 補強:

Even with the best capture kit + high coverage, WES has structural blind spots. If these regions matter to your study, switch to WGS or add targeted sequencing to compensate:

GC-rich 基因 (e.g. GBA, HBA, ABO)

GC > 70% 區的 PCR amplification 嚴重 bias。GBA(Gaucher disease)的問題還更複雜:旁邊有高度同源的 GBAP1 pseudogene,會造成 mapping ambiguity。臨床建議用 long-range PCR + Sanger 或 long-read 補正。

PCR amplification in GC > 70% regions has severe bias. GBA (Gaucher disease) is even more complex — adjacent highly homologous GBAP1 pseudogene creates mapping ambiguity. Clinical practice supplements with long-range PCR + Sanger, or long-reads.

Pseudogene 干擾

許多重要基因有 high-identity pseudogene(PMS2 / PMS2CL, CYP2D6 / CYP2D7P, NCF1 / NCF1B),short read 無法區分,variant call 不可靠。需 specific NGS panel 加 long PCR 才能解。

Many important genes have high-identity pseudogenes (PMS2 / PMS2CL, CYP2D6 / CYP2D7P, NCF1 / NCF1B). Short reads can't distinguish them; variant calls are unreliable. Requires specific NGS panel + long PCR.

深度 intronic 變異

距 splice site > 100 bp 的 cryptic splice variants WES 完全看不到。例:CFTR, DMD, NF1 都有 deep intronic pathogenic variants。Splicing assay 或 RNA-seq 可補抓。

Cryptic splice variants > 100 bp from splice sites are invisible to WES. Examples: CFTR, DMD, NF1 all have deep intronic pathogenic variants. Splicing assays or RNA-seq can compensate.

調控區 (promoter / enhancer / 5'UTR)

大多數 WES kit 不涵蓋 promoter / enhancer,但這些區域的變異也能造成嚴重表達失調(如 TBX5 enhancer、TERT promoter)。WGS 必要。

Most WES kits don't cover promoters / enhancers, but variants there can cause severe expression dysregulation (e.g., TBX5 enhancer, TERT promoter). WGS is required.

Tandem repeats / VNTR

如 HTT (Huntington), FMR1 (Fragile X), C9orf72 (ALS) 的 repeat expansion,short-read WES/WGS 都難以準確 sizing,long-read 是首選。

Repeat expansions like HTT (Huntington), FMR1 (Fragile X), C9orf72 (ALS) — short-read WES/WGS struggle to size accurately; long-reads are preferred.

CNV 與 SV (見 SV 章節)

WES 的不連續 capture 讓 read-depth 訊號不可靠;intronic / intergenic breakpoints 完全看不到。CNV 偵測效能遠差於 WGS。

WES's discontinuous capture makes read-depth signals unreliable; intronic/intergenic breakpoints are invisible. CNV detection is far worse than WGS.

六、Pipeline 指令範例

# 把 capture kit 提供的 BED 轉為 Picard interval_list 格式
gatk BedToIntervalList \
    -I twist_exome_v2_targets.bed \
    -O twist_exome_v2_targets.interval_list \
    -SD Homo_sapiens_assembly38.dict

# probe (bait) 與 target 通常不同 — bait 是物理 probe 位置,
# target 是想偵測的基因組區段。CollectHsMetrics 兩者都需要。
gatk BedToIntervalList \
    -I twist_exome_v2_baits.bed \
    -O twist_exome_v2_baits.interval_list \
    -SD Homo_sapiens_assembly38.dict

# 為 variant calling 加 padding (100 bp 每側)
gatk IntervalListTools \
    -I twist_exome_v2_targets.interval_list \
    -O twist_exome_v2_targets_padded100.interval_list \
    --PADDING 100
# 跑 CollectHsMetrics — 每個 WES 樣本必跑
gatk CollectHsMetrics \
    -I sample.dedup.bam \
    -O sample.hs_metrics.txt \
    -BI twist_exome_v2_baits.interval_list \
    -TI twist_exome_v2_targets.interval_list \
    -R Homo_sapiens_assembly38.fasta \
    --PER_TARGET_COVERAGE sample.per_target.txt \
    --PER_BASE_COVERAGE sample.per_base.txt

# 看關鍵 metrics
grep -A1 "## METRICS" sample.hs_metrics.txt | tail -1 | \
    awk '{print "PCT_SELECTED_BASES:",$11,"\nMEAN_TARGET_COV:",$23,
              "\nPCT_TARGET_30X:",$36,"\nFOLD_80:",$45,
              "\nGC_DROPOUT:",$54}'

# 結合 MultiQC 視覺化整批樣本
multiqc . -n wes_qc_report
# WES variant calling — 用 padded intervals
gatk HaplotypeCaller \
    -R Homo_sapiens_assembly38.fasta \
    -I sample.dedup.bqsr.bam \
    -L twist_exome_v2_targets_padded100.interval_list \
    -ERC GVCF \
    -O sample.g.vcf.gz \
    --tmp-dir /scratch

# 注意:WES 樣本量 < 30 通常不適合 VQSR (statistical model 需要量)
# 建議改用 GATK Hard Filter,或 DeepVariant + RTG vcfeval 評估

# DeepVariant on WES (推薦對 small cohort)
docker run -v $(pwd):/data google/deepvariant:1.6.0 \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type=WES \
    --ref=/data/Homo_sapiens_assembly38.fasta \
    --reads=/data/sample.dedup.bam \
    --regions=/data/twist_exome_v2_targets_padded100.bed \
    --output_vcf=/data/sample.dv.vcf.gz
# WES CNV — 需要大 cohort 做 normalization (≥ 30 同 kit/run normals)

# 方法 1: GATK gCNV (germline cohort mode)
gatk PreprocessIntervals \
    -R ref.fa \
    -L twist_exome_v2_targets.interval_list \
    --bin-length 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O preprocessed.interval_list

# 對每個 sample 收 read counts
gatk CollectReadCounts \
    -I sample.dedup.bam \
    -L preprocessed.interval_list \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O sample.counts.hdf5

# Cohort-mode CNV calling (fit 在所有 normal 上)
gatk GermlineCNVCaller \
    --run-mode COHORT \
    -L preprocessed.interval_list \
    -I sample1.counts.hdf5 -I sample2.counts.hdf5 ... \
    --contig-ploidy-calls ploidy-calls/ \
    --output cohort_cnv \
    --output-prefix cohort

# 方法 2: ExomeDepth (R package, 較簡便)
# 方法 3: CNVkit (有 batch + reference 子命令)
# 方法 4: 商業 ClinCNV / ExomeAI

七、FAQ

什麼時候該選 WGS 而非 WES?
明確的 WGS 指證:(1) 已做 WES 但無診斷結果 (約 50-60% case 仍 negative),需查 deep intronic / regulatory variants; (2) 重點 SV/CNV detection(如 22q11.2 microdeletion); (3) GC-rich 或 pseudogene 困擾的基因 (GBA, PMS2); (4) 需 mitochondrial / repeat expansion 一次到位; (5) 大 cohort 研究預算夠 (現在 30× WGS 約 $300-500 USD/sample,與 WES 已接近)。
Clear WGS indications: (1) WES negative case (50–60% of WES cases remain undiagnosed) requiring deep-intronic / regulatory variant search; (2) focused SV/CNV detection (e.g., 22q11.2 microdeletion); (3) GC-rich or pseudogene-confounded genes (GBA, PMS2); (4) need mitochondrial / repeat expansion in one shot; (5) large-cohort study with budget (30× WGS now ~$300–500 USD/sample, close to WES).
為什麼 WES 樣本量小不適合 VQSR?
VQSR 需要足夠的 variant 數量訓練 Gaussian mixture model:SNP > 30 樣本、indel > 100 樣本通常才穩定。WES 樣本太少時模型 overfitting 或無法收斂。GATK 官方 troubleshooting 指出 single-sample WES 直接用 hard filter 反而更穩。
VQSR needs enough variants to train a Gaussian mixture model — typically >30 samples for SNPs, >100 for indels. With too few WES samples, the model overfits or fails to converge. GATK official troubleshooting notes that single-sample WES is more reliable with hard filters.
medical exome / clinical exome 與 research exome 不一樣?
是。"Medical exome" 通常指針對 ~5,000-7,000 個已知疾病基因的高覆蓋 enriched panel(例:Twist Comprehensive Exome、Agilent ClearSeq Inherited Disease)。target 較小 (~10-15 Mb)、coverage 更深 (≥ 100×)、含關鍵 intronic 與 promoter region。Research exome 追求廣度 (~50 Mb),medical exome 追求關鍵基因深度與一致性。
Yes. "Medical exome" usually refers to high-coverage enriched panels targeting ~5,000–7,000 known disease genes (e.g., Twist Comprehensive Exome, Agilent ClearSeq Inherited Disease). Target is smaller (~10–15 Mb), coverage deeper (≥100×), with key intronic and promoter regions included. Research exomes pursue breadth (~50 Mb); medical exomes pursue depth and consistency in key genes.
capture kit 換版本,可以混合分析嗎?
不建議。不同 kit / 同 kit 不同版本的 target intervals 與均勻度差異大,混合做 joint genotyping 會引入 batch effect(特別是 CNV 與 borderline variants)。最少要:(1) 取所有 kit 的 target 交集做 analysis; (2) 跑 PCA / sample-distance 檢查 batch separation; (3) cohort 級研究避免混 kit。Joint cohort 分析一律建議單一 kit。
Not recommended. Target intervals and uniformity differ greatly across kits and even across versions of the same kit; joint genotyping mixed kits introduces batch effects (especially for CNVs and borderline variants). Minimum: (1) take target intersection across all kits; (2) run PCA / sample-distance to check batch separation; (3) avoid mixing kits in cohort-scale studies. Joint cohort analysis should use a single kit.

八、小測驗

Q1. FOLD_80_BASE_PENALTY = 1.5 與 3.0 哪個 capture 較均勻?

Q2. 為什麼 WES 對 GBA / PMS2 等基因不可靠?

Q3. WES intervals 為什麼要 padding (~100 bp)?