Advanced: WES Considerations — WGS/WES Tutorial

基礎

一、Target enrichment 工作流

WES 與 WGS 在 wet lab 端的根本差異：WES 在 library prep 後加一步 target capture，用幾十萬個 biotinylated probe 去 hybridize 並「釣出」exon 區段：

Genomic DNA fragmentation（超音波或酶切，~200-300 bp）
Library prep（end repair, A-tail, adapter ligation, PCR amplify）
Hybridization：library 與 biotinylated capture probe 在液相中混合 16-24 小時
Streptavidin bead pulldown：磁珠抓住 biotin，洗掉未結合的非 target 片段
Post-capture PCR（10-12 cycles）放大濃度，再上 Illumina 定序

這個 capture 步驟是 WES 所有「特色問題」的根源——uneven coverage、capture bias、GC bias、PCR duplicates，全來自此。

The fundamental wet-lab difference between WES and WGS: WES adds a target capture step after library prep, where hundreds of thousands of biotinylated probes hybridize and "fish out" exon segments:

Genomic DNA fragmentation (sonication or enzymatic, ~200–300 bp)
Library prep (end repair, A-tail, adapter ligation, PCR amplify)
Hybridization: library mixed with biotinylated capture probes in solution for 16–24 hours
Streptavidin bead pulldown: magnetic beads grab biotin, washing off unbound non-target fragments
Post-capture PCR (10–12 cycles) re-amplifies concentration, then loaded onto Illumina

This capture step is the root of all WES's "characteristic problems" — uneven coverage, capture bias, GC bias, PCR duplicates all originate here.

比較

二、四大 Exome capture kit 比較

Kit	廠商	Target size	均勻度	特色 / 弱點
Twist Exome 2.0	Twist Bioscience	~36 Mb	★★★★★ (最佳)	2018 後新一代，雙鏈 DNA probe，FOLD_80 ≈ 1.5；GC bias 最低；目前主流選擇
Agilent SureSelect	Agilent	~50 Mb (V8)	★★★★ (良好)	經典老牌，RNA probe，target 含 UTR 與 microRNA；FOLD_80 ≈ 2.0；GC-rich 區較弱
IDT xGen Exome	IDT	~39 Mb	★★★★ (良好)	DNA probe，低 input (10 ng) 表現好；常用於 cfDNA / FFPE / single-cell
Illumina TruSeq Exome	Illumina	~45 Mb	★★★ (一般)	與 Illumina sequencer 整合工作流；新版本較少更新；逐漸被取代

註：FOLD_80 = 80th percentile coverage / mean coverage；越接近 1 表示 coverage 越均勻。

Note: FOLD_80 = 80th percentile coverage / mean coverage; closer to 1 means more uniform coverage.

Padding intervals 是什麼？

capture probe 雖然只在 exon 上，但 hybridization 會「順帶」抓到 probe 周圍 ~50-100 bp 的 flanking sequence。為了 variant calling 包含 splice site (exon-intron boundary ±10 bp 是 splice donor/acceptor 關鍵區)，下游分析應使用「padded intervals」：原 target intervals 兩側各延伸 50-100 bp。GATK 預設 100 bp padding。

Capture probes target only exons, but hybridization incidentally captures ~50–100 bp of flanking sequence around the probe. To include splice sites (exon-intron boundary ±10 bp is the splice donor/acceptor critical region), downstream analysis should use "padded intervals" — original target intervals extended 50–100 bp on each side. GATK defaults to 100-bp padding.

三、CollectHsMetrics — WES 的 QC 之王

Picard 的 CollectHsMetrics（"Hs" = Hybrid Selection）專門評估 capture 品質，每個 WES 樣本都應該跑。關鍵 metrics：

Picard's CollectHsMetrics ("Hs" = Hybrid Selection) is purpose-built for capture QC; every WES sample should run it. Key metrics:

Metric	意義	可接受範圍
`PCT_SELECTED_BASES`	比對到 target 或 padding 區的 base 比例 (capture 效率)	> 60% (Twist 常 > 80%)
`PCT_OFF_BAIT`	未對到 target / padding 的 read 比例 (浪費)	< 30%
`MEAN_TARGET_COVERAGE`	target 區的平均 coverage	≥ 80× (germline)、≥ 200× (somatic)
`PCT_TARGET_BASES_30X`	target base 中達到 30× 以上的比例 (medical exome 關鍵)	> 90%
`FOLD_80_BASE_PENALTY`	均勻度指標：80% target 達到目標 coverage 所需 fold over-sequencing；越低越均勻	< 2.5 (Twist 常 1.5)
`AT_DROPOUT` / `GC_DROPOUT`	極端 AT-rich (< 30%) 或 GC-rich (> 70%) 區段的覆蓋損失	各 < 5%；> 10% 警示
`PCT_PF_UQ_READS_ALIGNED`	Pass-Filter unique aligned reads 比例	> 95%
`HS_LIBRARY_SIZE`	估計的 library 複雜度（unique molecules）	越高越好；低值代表 PCR duplicate 嚴重

互動模擬

四、互動模擬：capture uniformity (FOLD_80)

調整 FOLD_80_BASE_PENALTY，觀察 mean coverage 100× 時 target base coverage 的分布。FOLD_80 越低（越均勻），coverage 分布越集中；越高越散開，意味著要把「最差的那 20% target」也打到 30× 以上，必須付出更高的 mean coverage 代價。

Adjust FOLD_80_BASE_PENALTY to see the target-base coverage distribution at 100× mean coverage. Lower FOLD_80 (more uniform) = more concentrated distribution; higher = more spread out, meaning to bring the worst 20% of targets above 30× requires a higher mean coverage premium.

FOLD_80 penalty: 1.8

Mean coverage: 100×

紅線 = 30× 醫療診斷常用最低標。黃 = 20%-80% percentile 範圍。FOLD_80 = 1.5 是 Twist 的水平；> 3.0 通常代表 capture 失敗。

Red line = 30× minimum standard for medical diagnostics. Yellow = 20%-80% percentile range. FOLD_80 = 1.5 is Twist's level; >3.0 typically indicates capture failure.

盲點

五、WES 永恆的盲點

即使最好的 capture kit + 高 coverage，WES 仍有結構性盲點。這些區域若是研究目標，必須改用 WGS 或加 targeted sequencing 補強：

Even with the best capture kit + high coverage, WES has structural blind spots. If these regions matter to your study, switch to WGS or add targeted sequencing to compensate:

GC-rich 基因 (e.g. GBA, HBA, ABO)

GC > 70% 區的 PCR amplification 嚴重 bias。GBA（Gaucher disease）的問題還更複雜：旁邊有高度同源的 GBAP1 pseudogene，會造成 mapping ambiguity。臨床建議用 long-range PCR + Sanger 或 long-read 補正。

PCR amplification in GC > 70% regions has severe bias. GBA (Gaucher disease) is even more complex — adjacent highly homologous GBAP1 pseudogene creates mapping ambiguity. Clinical practice supplements with long-range PCR + Sanger, or long-reads.

Pseudogene 干擾

許多重要基因有 high-identity pseudogene（PMS2 / PMS2CL, CYP2D6 / CYP2D7P, NCF1 / NCF1B），short read 無法區分，variant call 不可靠。需 specific NGS panel 加 long PCR 才能解。

Many important genes have high-identity pseudogenes (PMS2 / PMS2CL, CYP2D6 / CYP2D7P, NCF1 / NCF1B). Short reads can't distinguish them; variant calls are unreliable. Requires specific NGS panel + long PCR.

深度 intronic 變異

距 splice site > 100 bp 的 cryptic splice variants WES 完全看不到。例：CFTR, DMD, NF1 都有 deep intronic pathogenic variants。Splicing assay 或 RNA-seq 可補抓。

Cryptic splice variants > 100 bp from splice sites are invisible to WES. Examples: CFTR, DMD, NF1 all have deep intronic pathogenic variants. Splicing assays or RNA-seq can compensate.

調控區 (promoter / enhancer / 5'UTR)

大多數 WES kit 不涵蓋 promoter / enhancer，但這些區域的變異也能造成嚴重表達失調（如 TBX5 enhancer、TERT promoter）。WGS 必要。

Most WES kits don't cover promoters / enhancers, but variants there can cause severe expression dysregulation (e.g., TBX5 enhancer, TERT promoter). WGS is required.

Tandem repeats / VNTR

如 HTT (Huntington), FMR1 (Fragile X), C9orf72 (ALS) 的 repeat expansion，short-read WES/WGS 都難以準確 sizing，long-read 是首選。

Repeat expansions like HTT (Huntington), FMR1 (Fragile X), C9orf72 (ALS) — short-read WES/WGS struggle to size accurately; long-reads are preferred.

CNV 與 SV (見 SV 章節)

WES 的不連續 capture 讓 read-depth 訊號不可靠；intronic / intergenic breakpoints 完全看不到。CNV 偵測效能遠差於 WGS。

WES's discontinuous capture makes read-depth signals unreliable; intronic/intergenic breakpoints are invisible. CNV detection is far worse than WGS.

實務指令

六、Pipeline 指令範例

# 把 capture kit 提供的 BED 轉為 Picard interval_list 格式
gatk BedToIntervalList \
    -I twist_exome_v2_targets.bed \
    -O twist_exome_v2_targets.interval_list \
    -SD Homo_sapiens_assembly38.dict

# probe (bait) 與 target 通常不同 — bait 是物理 probe 位置，
# target 是想偵測的基因組區段。CollectHsMetrics 兩者都需要。
gatk BedToIntervalList \
    -I twist_exome_v2_baits.bed \
    -O twist_exome_v2_baits.interval_list \
    -SD Homo_sapiens_assembly38.dict

# 為 variant calling 加 padding (100 bp 每側)
gatk IntervalListTools \
    -I twist_exome_v2_targets.interval_list \
    -O twist_exome_v2_targets_padded100.interval_list \
    --PADDING 100

# 跑 CollectHsMetrics — 每個 WES 樣本必跑
gatk CollectHsMetrics \
    -I sample.dedup.bam \
    -O sample.hs_metrics.txt \
    -BI twist_exome_v2_baits.interval_list \
    -TI twist_exome_v2_targets.interval_list \
    -R Homo_sapiens_assembly38.fasta \
    --PER_TARGET_COVERAGE sample.per_target.txt \
    --PER_BASE_COVERAGE sample.per_base.txt

# 看關鍵 metrics
grep -A1 "## METRICS" sample.hs_metrics.txt | tail -1 | \
    awk '{print "PCT_SELECTED_BASES:",$11,"\nMEAN_TARGET_COV:",$23,
              "\nPCT_TARGET_30X:",$36,"\nFOLD_80:",$45,
              "\nGC_DROPOUT:",$54}'

# 結合 MultiQC 視覺化整批樣本
multiqc . -n wes_qc_report

# WES variant calling — 用 padded intervals
gatk HaplotypeCaller \
    -R Homo_sapiens_assembly38.fasta \
    -I sample.dedup.bqsr.bam \
    -L twist_exome_v2_targets_padded100.interval_list \
    -ERC GVCF \
    -O sample.g.vcf.gz \
    --tmp-dir /scratch

# 注意：WES 樣本量 < 30 通常不適合 VQSR (statistical model 需要量)
# 建議改用 GATK Hard Filter，或 DeepVariant + RTG vcfeval 評估

# DeepVariant on WES (推薦對 small cohort)
docker run -v $(pwd):/data google/deepvariant:1.6.0 \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type=WES \
    --ref=/data/Homo_sapiens_assembly38.fasta \
    --reads=/data/sample.dedup.bam \
    --regions=/data/twist_exome_v2_targets_padded100.bed \
    --output_vcf=/data/sample.dv.vcf.gz

# WES CNV — 需要大 cohort 做 normalization (≥ 30 同 kit/run normals)

# 方法 1: GATK gCNV (germline cohort mode)
gatk PreprocessIntervals \
    -R ref.fa \
    -L twist_exome_v2_targets.interval_list \
    --bin-length 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O preprocessed.interval_list

# 對每個 sample 收 read counts
gatk CollectReadCounts \
    -I sample.dedup.bam \
    -L preprocessed.interval_list \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O sample.counts.hdf5

# Cohort-mode CNV calling (fit 在所有 normal 上)
gatk GermlineCNVCaller \
    --run-mode COHORT \
    -L preprocessed.interval_list \
    -I sample1.counts.hdf5 -I sample2.counts.hdf5 ... \
    --contig-ploidy-calls ploidy-calls/ \
    --output cohort_cnv \
    --output-prefix cohort

# 方法 2: ExomeDepth (R package, 較簡便)
# 方法 3: CNVkit (有 batch + reference 子命令)
# 方法 4: 商業 ClinCNV / ExomeAI

常見問題

七、FAQ

什麼時候該選 WGS 而非 WES？

明確的 WGS 指證：(1) 已做 WES 但無診斷結果 (約 50-60% case 仍 negative)，需查 deep intronic / regulatory variants; (2) 重點 SV/CNV detection（如 22q11.2 microdeletion); (3) GC-rich 或 pseudogene 困擾的基因 (GBA, PMS2); (4) 需 mitochondrial / repeat expansion 一次到位; (5) 大 cohort 研究預算夠 (現在 30× WGS 約 $300-500 USD/sample，與 WES 已接近)。

Clear WGS indications: (1) WES negative case (50–60% of WES cases remain undiagnosed) requiring deep-intronic / regulatory variant search; (2) focused SV/CNV detection (e.g., 22q11.2 microdeletion); (3) GC-rich or pseudogene-confounded genes (GBA, PMS2); (4) need mitochondrial / repeat expansion in one shot; (5) large-cohort study with budget (30× WGS now ~$300–500 USD/sample, close to WES).

為什麼 WES 樣本量小不適合 VQSR？

VQSR 需要足夠的 variant 數量訓練 Gaussian mixture model：SNP > 30 樣本、indel > 100 樣本通常才穩定。WES 樣本太少時模型 overfitting 或無法收斂。GATK 官方 troubleshooting 指出 single-sample WES 直接用 hard filter 反而更穩。

VQSR needs enough variants to train a Gaussian mixture model — typically >30 samples for SNPs, >100 for indels. With too few WES samples, the model overfits or fails to converge. GATK official troubleshooting notes that single-sample WES is more reliable with hard filters.

medical exome / clinical exome 與 research exome 不一樣？

是。"Medical exome" 通常指針對 ~5,000-7,000 個已知疾病基因的高覆蓋 enriched panel（例：Twist Comprehensive Exome、Agilent ClearSeq Inherited Disease）。target 較小 (~10-15 Mb)、coverage 更深 (≥ 100×)、含關鍵 intronic 與 promoter region。Research exome 追求廣度 (~50 Mb)，medical exome 追求關鍵基因深度與一致性。

Yes. "Medical exome" usually refers to high-coverage enriched panels targeting ~5,000–7,000 known disease genes (e.g., Twist Comprehensive Exome, Agilent ClearSeq Inherited Disease). Target is smaller (~10–15 Mb), coverage deeper (≥100×), with key intronic and promoter regions included. Research exomes pursue breadth (~50 Mb); medical exomes pursue depth and consistency in key genes.

capture kit 換版本，可以混合分析嗎？

不建議。不同 kit / 同 kit 不同版本的 target intervals 與均勻度差異大，混合做 joint genotyping 會引入 batch effect（特別是 CNV 與 borderline variants）。最少要：(1) 取所有 kit 的 target 交集做 analysis; (2) 跑 PCA / sample-distance 檢查 batch separation; (3) cohort 級研究避免混 kit。Joint cohort 分析一律建議單一 kit。

Not recommended. Target intervals and uniformity differ greatly across kits and even across versions of the same kit; joint genotyping mixed kits introduces batch effects (especially for CNVs and borderline variants). Minimum: (1) take target intersection across all kits; (2) run PCA / sample-distance to check batch separation; (3) avoid mixing kits in cohort-scale studies. Joint cohort analysis should use a single kit.

自我檢測

八、小測驗

Q1. FOLD_80_BASE_PENALTY = 1.5 與 3.0 哪個 capture 較均勻？

Q2. 為什麼 WES 對 GBA / PMS2 等基因不可靠？

Q3. WES intervals 為什麼要 padding (~100 bp)？