一、Target enrichment 工作流
WES 與 WGS 在 wet lab 端的根本差異:WES 在 library prep 後加一步 target capture,用幾十萬個 biotinylated probe 去 hybridize 並「釣出」exon 區段:
- Genomic DNA fragmentation(超音波或酶切,~200-300 bp)
- Library prep(end repair, A-tail, adapter ligation, PCR amplify)
- Hybridization:library 與 biotinylated capture probe 在液相中混合 16-24 小時
- Streptavidin bead pulldown:磁珠抓住 biotin,洗掉未結合的非 target 片段
- Post-capture PCR(10-12 cycles)放大濃度,再上 Illumina 定序
這個 capture 步驟是 WES 所有「特色問題」的根源——uneven coverage、capture bias、GC bias、PCR duplicates,全來自此。
The fundamental wet-lab difference between WES and WGS: WES adds a target capture step after library prep, where hundreds of thousands of biotinylated probes hybridize and "fish out" exon segments:
- Genomic DNA fragmentation (sonication or enzymatic, ~200–300 bp)
- Library prep (end repair, A-tail, adapter ligation, PCR amplify)
- Hybridization: library mixed with biotinylated capture probes in solution for 16–24 hours
- Streptavidin bead pulldown: magnetic beads grab biotin, washing off unbound non-target fragments
- Post-capture PCR (10–12 cycles) re-amplifies concentration, then loaded onto Illumina
This capture step is the root of all WES's "characteristic problems" — uneven coverage, capture bias, GC bias, PCR duplicates all originate here.
二、四大 Exome capture kit 比較
| Kit | 廠商 | Target size | 均勻度 | 特色 / 弱點 |
|---|---|---|---|---|
| Twist Exome 2.0 | Twist Bioscience | ~36 Mb | ★★★★★ (最佳) | 2018 後新一代,雙鏈 DNA probe,FOLD_80 ≈ 1.5;GC bias 最低;目前主流選擇 |
| Agilent SureSelect | Agilent | ~50 Mb (V8) | ★★★★ (良好) | 經典老牌,RNA probe,target 含 UTR 與 microRNA;FOLD_80 ≈ 2.0;GC-rich 區較弱 |
| IDT xGen Exome | IDT | ~39 Mb | ★★★★ (良好) | DNA probe,低 input (10 ng) 表現好;常用於 cfDNA / FFPE / single-cell |
| Illumina TruSeq Exome | Illumina | ~45 Mb | ★★★ (一般) | 與 Illumina sequencer 整合工作流;新版本較少更新;逐漸被取代 |
capture probe 雖然只在 exon 上,但 hybridization 會「順帶」抓到 probe 周圍 ~50-100 bp 的 flanking sequence。為了 variant calling 包含 splice site (exon-intron boundary ±10 bp 是 splice donor/acceptor 關鍵區),下游分析應使用「padded intervals」:原 target intervals 兩側各延伸 50-100 bp。GATK 預設 100 bp padding。
Capture probes target only exons, but hybridization incidentally captures ~50–100 bp of flanking sequence around the probe. To include splice sites (exon-intron boundary ±10 bp is the splice donor/acceptor critical region), downstream analysis should use "padded intervals" — original target intervals extended 50–100 bp on each side. GATK defaults to 100-bp padding.
三、CollectHsMetrics — WES 的 QC 之王
Picard 的 CollectHsMetrics("Hs" = Hybrid Selection)專門評估 capture 品質,每個 WES 樣本都應該跑。關鍵 metrics:
Picard's CollectHsMetrics ("Hs" = Hybrid Selection) is purpose-built for capture QC; every WES sample should run it. Key metrics:
| Metric | 意義 | 可接受範圍 |
|---|---|---|
PCT_SELECTED_BASES | 比對到 target 或 padding 區的 base 比例 (capture 效率) | > 60% (Twist 常 > 80%) |
PCT_OFF_BAIT | 未對到 target / padding 的 read 比例 (浪費) | < 30% |
MEAN_TARGET_COVERAGE | target 區的平均 coverage | ≥ 80× (germline)、≥ 200× (somatic) |
PCT_TARGET_BASES_30X | target base 中達到 30× 以上的比例 (medical exome 關鍵) | > 90% |
FOLD_80_BASE_PENALTY | 均勻度指標:80% target 達到目標 coverage 所需 fold over-sequencing;越低越均勻 | < 2.5 (Twist 常 1.5) |
AT_DROPOUT / GC_DROPOUT | 極端 AT-rich (< 30%) 或 GC-rich (> 70%) 區段的覆蓋損失 | 各 < 5%;> 10% 警示 |
PCT_PF_UQ_READS_ALIGNED | Pass-Filter unique aligned reads 比例 | > 95% |
HS_LIBRARY_SIZE | 估計的 library 複雜度(unique molecules) | 越高越好;低值代表 PCR duplicate 嚴重 |
四、互動模擬:capture uniformity (FOLD_80)
調整 FOLD_80_BASE_PENALTY,觀察 mean coverage 100× 時 target base coverage 的分布。FOLD_80 越低(越均勻),coverage 分布越集中;越高越散開,意味著要把「最差的那 20% target」也打到 30× 以上,必須付出更高的 mean coverage 代價。
Adjust FOLD_80_BASE_PENALTY to see the target-base coverage distribution at 100× mean coverage. Lower FOLD_80 (more uniform) = more concentrated distribution; higher = more spread out, meaning to bring the worst 20% of targets above 30× requires a higher mean coverage premium.
五、WES 永恆的盲點
即使最好的 capture kit + 高 coverage,WES 仍有結構性盲點。這些區域若是研究目標,必須改用 WGS 或加 targeted sequencing 補強:
Even with the best capture kit + high coverage, WES has structural blind spots. If these regions matter to your study, switch to WGS or add targeted sequencing to compensate:
GC-rich 基因 (e.g. GBA, HBA, ABO)
GC > 70% 區的 PCR amplification 嚴重 bias。GBA(Gaucher disease)的問題還更複雜:旁邊有高度同源的 GBAP1 pseudogene,會造成 mapping ambiguity。臨床建議用 long-range PCR + Sanger 或 long-read 補正。
PCR amplification in GC > 70% regions has severe bias. GBA (Gaucher disease) is even more complex — adjacent highly homologous GBAP1 pseudogene creates mapping ambiguity. Clinical practice supplements with long-range PCR + Sanger, or long-reads.
Pseudogene 干擾
許多重要基因有 high-identity pseudogene(PMS2 / PMS2CL, CYP2D6 / CYP2D7P, NCF1 / NCF1B),short read 無法區分,variant call 不可靠。需 specific NGS panel 加 long PCR 才能解。
Many important genes have high-identity pseudogenes (PMS2 / PMS2CL, CYP2D6 / CYP2D7P, NCF1 / NCF1B). Short reads can't distinguish them; variant calls are unreliable. Requires specific NGS panel + long PCR.
深度 intronic 變異
距 splice site > 100 bp 的 cryptic splice variants WES 完全看不到。例:CFTR, DMD, NF1 都有 deep intronic pathogenic variants。Splicing assay 或 RNA-seq 可補抓。
Cryptic splice variants > 100 bp from splice sites are invisible to WES. Examples: CFTR, DMD, NF1 all have deep intronic pathogenic variants. Splicing assays or RNA-seq can compensate.
調控區 (promoter / enhancer / 5'UTR)
大多數 WES kit 不涵蓋 promoter / enhancer,但這些區域的變異也能造成嚴重表達失調(如 TBX5 enhancer、TERT promoter)。WGS 必要。
Most WES kits don't cover promoters / enhancers, but variants there can cause severe expression dysregulation (e.g., TBX5 enhancer, TERT promoter). WGS is required.
Tandem repeats / VNTR
如 HTT (Huntington), FMR1 (Fragile X), C9orf72 (ALS) 的 repeat expansion,short-read WES/WGS 都難以準確 sizing,long-read 是首選。
Repeat expansions like HTT (Huntington), FMR1 (Fragile X), C9orf72 (ALS) — short-read WES/WGS struggle to size accurately; long-reads are preferred.
CNV 與 SV (見 SV 章節)
WES 的不連續 capture 讓 read-depth 訊號不可靠;intronic / intergenic breakpoints 完全看不到。CNV 偵測效能遠差於 WGS。
WES's discontinuous capture makes read-depth signals unreliable; intronic/intergenic breakpoints are invisible. CNV detection is far worse than WGS.
六、Pipeline 指令範例
# 把 capture kit 提供的 BED 轉為 Picard interval_list 格式
gatk BedToIntervalList \
-I twist_exome_v2_targets.bed \
-O twist_exome_v2_targets.interval_list \
-SD Homo_sapiens_assembly38.dict
# probe (bait) 與 target 通常不同 — bait 是物理 probe 位置,
# target 是想偵測的基因組區段。CollectHsMetrics 兩者都需要。
gatk BedToIntervalList \
-I twist_exome_v2_baits.bed \
-O twist_exome_v2_baits.interval_list \
-SD Homo_sapiens_assembly38.dict
# 為 variant calling 加 padding (100 bp 每側)
gatk IntervalListTools \
-I twist_exome_v2_targets.interval_list \
-O twist_exome_v2_targets_padded100.interval_list \
--PADDING 100# 跑 CollectHsMetrics — 每個 WES 樣本必跑
gatk CollectHsMetrics \
-I sample.dedup.bam \
-O sample.hs_metrics.txt \
-BI twist_exome_v2_baits.interval_list \
-TI twist_exome_v2_targets.interval_list \
-R Homo_sapiens_assembly38.fasta \
--PER_TARGET_COVERAGE sample.per_target.txt \
--PER_BASE_COVERAGE sample.per_base.txt
# 看關鍵 metrics
grep -A1 "## METRICS" sample.hs_metrics.txt | tail -1 | \
awk '{print "PCT_SELECTED_BASES:",$11,"\nMEAN_TARGET_COV:",$23,
"\nPCT_TARGET_30X:",$36,"\nFOLD_80:",$45,
"\nGC_DROPOUT:",$54}'
# 結合 MultiQC 視覺化整批樣本
multiqc . -n wes_qc_report# WES variant calling — 用 padded intervals
gatk HaplotypeCaller \
-R Homo_sapiens_assembly38.fasta \
-I sample.dedup.bqsr.bam \
-L twist_exome_v2_targets_padded100.interval_list \
-ERC GVCF \
-O sample.g.vcf.gz \
--tmp-dir /scratch
# 注意:WES 樣本量 < 30 通常不適合 VQSR (statistical model 需要量)
# 建議改用 GATK Hard Filter,或 DeepVariant + RTG vcfeval 評估
# DeepVariant on WES (推薦對 small cohort)
docker run -v $(pwd):/data google/deepvariant:1.6.0 \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WES \
--ref=/data/Homo_sapiens_assembly38.fasta \
--reads=/data/sample.dedup.bam \
--regions=/data/twist_exome_v2_targets_padded100.bed \
--output_vcf=/data/sample.dv.vcf.gz# WES CNV — 需要大 cohort 做 normalization (≥ 30 同 kit/run normals)
# 方法 1: GATK gCNV (germline cohort mode)
gatk PreprocessIntervals \
-R ref.fa \
-L twist_exome_v2_targets.interval_list \
--bin-length 0 \
--interval-merging-rule OVERLAPPING_ONLY \
-O preprocessed.interval_list
# 對每個 sample 收 read counts
gatk CollectReadCounts \
-I sample.dedup.bam \
-L preprocessed.interval_list \
--interval-merging-rule OVERLAPPING_ONLY \
-O sample.counts.hdf5
# Cohort-mode CNV calling (fit 在所有 normal 上)
gatk GermlineCNVCaller \
--run-mode COHORT \
-L preprocessed.interval_list \
-I sample1.counts.hdf5 -I sample2.counts.hdf5 ... \
--contig-ploidy-calls ploidy-calls/ \
--output cohort_cnv \
--output-prefix cohort
# 方法 2: ExomeDepth (R package, 較簡便)
# 方法 3: CNVkit (有 batch + reference 子命令)
# 方法 4: 商業 ClinCNV / ExomeAI七、FAQ
什麼時候該選 WGS 而非 WES?
為什麼 WES 樣本量小不適合 VQSR?
medical exome / clinical exome 與 research exome 不一樣?
capture kit 換版本,可以混合分析嗎?
八、小測驗
Q1. FOLD_80_BASE_PENALTY = 1.5 與 3.0 哪個 capture 較均勻?
Q2. 為什麼 WES 對 GBA / PMS2 等基因不可靠?
Q3. WES intervals 為什麼要 padding (~100 bp)?