Step 1: Intro — WGS vs WES Tutorial

概覽

為什麼這個選擇是整個流程的起點？

WGS (Whole Genome Sequencing) 嘗試讀完整個基因體；WES (Whole Exome Sequencing) 只讀蛋白質編碼的外顯子區。雖然兩者上游分析流程高度重疊（都用 BWA、GATK），但選擇哪一種會深刻影響下游每一步的決策——從 coverage 設定、QC 閾值、到能不能偵測 CNV 與 non-coding variants。

人類基因體約 3 × 10⁹ bp，但其中只有約 1–2% 是外顯子（總共 ~60 Mb，涵蓋約 20,000 基因的約 180,000 個 exons）。然而，目前估計約 85% 的已知致病性 Mendelian 變異位於外顯子內，這就是 WES 在臨床診斷上長期受歡迎的原因。

WGS (Whole Genome Sequencing) attempts to read the entire genome; WES (Whole Exome Sequencing) reads only protein-coding exons. While both share most of the upstream analysis pipeline (both use BWA, GATK), the choice profoundly shapes every downstream decision — coverage targets, QC thresholds, and whether you can detect CNVs and non-coding variants.

The human genome is ~3 × 10⁹ bp, but only ~1–2% is exonic (~60 Mb, covering ~180,000 exons across ~20,000 genes). Yet roughly 85% of known pathogenic Mendelian variants reside within exons — the reason WES has long been the workhorse of clinical diagnostics.

💡

核心原則：WES 用更少的成本提供 exon 區域的更深 coverage（典型 100× vs WGS 30×），但完全看不到 introns、promoters、enhancers。WGS 雖然每個位點淺，卻提供更均勻且能偵測 SV/CNV 的全景圖。問題從來不是「哪個比較好」，而是「你的研究問題是什麼」。 Core principle: WES delivers deeper coverage in exonic regions at lower cost (typically 100× vs WGS's 30×), but is blind to introns, promoters, and enhancers. WGS gives shallower per-base depth but a far more uniform picture and can detect SVs/CNVs. The question is never "which is better" — it's "what is your research question?"

技術對比

一、技術規格全面對比

特性	WGS	WES
覆蓋範圍基因體中讀取的部分portion of genome sequenced	~98%（全基因體 3 Gb）	~1–2%（外顯子 ~60 Mb）	~98% (whole genome ~3 Gb)	~1–2% (exons ~60 Mb)
典型 coverage	30–60×	100–200×
製備方法	PCR-free 直接定序	target capture (探針雜合捕獲)	PCR-free direct sequencing	Target capture (probe hybridization)
Coverage 均勻度	⭐⭐⭐⭐⭐ 非常均勻very uniform	⭐⭐⭐ 受捕獲偏好影響capture-bias affected
Non-coding 變異	✅ 可偵測detectable	❌ 幾乎看不到essentially invisible
SV / CNV	✅ 高解析度斷點single-bp breakpoints	⚠️ 僅 exon 內，斷點不準exon-only, imprecise breakpoints
GC-rich 區域	PCR-free WGS 涵蓋良好	捕獲偏差較大	PCR-free WGS covers well	Significant capture bias
數據量 / sample	~90–120 GB FASTQ	~10–15 GB FASTQ
相對成本	3–5×	1×
主要用途	研究、罕見病、population genetics、SV 分析	臨床 Mendelian 診斷、大規模族群篩檢	Research, rare disease, population genetics, SV analysis	Clinical Mendelian diagnostics, large cohort screening

互動模擬

二、Coverage 分布模擬器

這個模擬器顯示同一段基因體區域在 WGS 與 WES 下的 read depth 分布。切換按鈕，觀察兩種策略在 exon（深色區）與 intron（淺色區）的差異。注意 WES 在 exon 內 coverage 高但很不均勻；WGS 則覆蓋全區但深度較淺。

This simulator shows read-depth distribution across the same genomic region under WGS and WES. Toggle to compare how each strategy handles exons (dark bands) vs introns (light bands). Note that WES is deep but uneven in exons; WGS is shallower but uniform across the whole region.

X 軸：基因體位置 | Y 軸：read depth | 灰色背景：exon 區

決策樹

三、我該選 WGS 還是 WES？

🌳 決策路徑

Q1:

你只在意 protein-coding 變異，且預算有限？ → 是 → WES（如大規模 Mendelian 疾病基因篩查）。

Q2:

需要偵測 deep intronic、promoter/enhancer、或 splicing 異常？ → 是 → WGS（WES 看不到這些）。

Q3:

需要可靠地偵測 CNV、SV、translocations？ → 是 → WGS（WES 的 capture 偏差讓 CNV 偵測不可靠）。

Q4:

樣本量大（>1000 人），希望快速 turnaround？ → 是 → WES（成本/儲存/算力都低很多）。

Q5:

研究 GC-rich 基因（如 GBA、HBA）或假基因家族？ → 是 → PCR-free WGS（WES 對 GC-rich 區嚴重欠覆蓋）。

Q1:

Only care about protein-coding variants, on a budget? → Yes → WES (e.g. large-cohort Mendelian disease screening).

Q2:

Need deep-intronic, promoter/enhancer, or splicing variants? → Yes → WGS (WES is blind to these).

Q3:

Need reliable CNV/SV/translocation detection? → Yes → WGS (WES capture bias makes CNV calls unreliable).

Q4:

Large cohort (>1000 samples), need fast turnaround? → Yes → WES (much lower cost, storage, compute).

Q5:

Studying GC-rich genes (GBA, HBA) or pseudogene families? → Yes → PCR-free WGS (WES severely under-covers GC-rich regions).

⚠️

新趨勢：近年 PCR-free WGS 成本大幅下降，且其在編碼區的 coverage 完整性已超越 WES。許多臨床機構正在從「WES first → WGS reflex」轉向「直接 WGS」的單階段策略。 Trend to watch: PCR-free WGS costs have dropped sharply, and its coding-region coverage now surpasses WES. Many clinical labs are moving from "WES first → WGS reflex" toward a single-stage "WGS upfront" strategy.

流程預覽

四、整體分析流程鳥瞰

無論是 WGS 還是 WES，從 FASTQ 到 annotated VCF 的核心步驟高度相似。WES 只在 calling 與 coverage 評估時加上 --intervals 限制目標區。下面是後續 8 個章節將涵蓋的主軸：

Whether WGS or WES, the core path from FASTQ to annotated VCF is largely shared. WES merely adds --intervals to confine calling and coverage assessment to target regions. Here is the spine of the next 8 chapters:

FASTQ QC — FastQC, fastp

檢查 Phred 分數、adapter 含量、duplication；trim 低品質 bases。

Check Phred scores, adapter content, duplication. Trim low-quality bases.

Alignment — BWA-MEM2

比對到 GRCh38；輸出 BAM。

Align to GRCh38; output BAM.

Post-alignment — MarkDuplicates, BQSR

標記 PCR duplicates、校正 base quality scores。

Mark PCR duplicates, recalibrate base quality scores.

Variant calling — HaplotypeCaller / DeepVariant

產出每樣本 GVCF 或單樣本 VCF。

Produce per-sample GVCF or single-sample VCF.

Joint genotyping — GenomicsDBImport, GenotypeGVCFs

跨樣本合併、refining genotypes。

Cross-sample merging, refined genotyping.

Filtering — VQSR / Hard filters

移除假陽性 SNV/indel calls。

Remove false-positive SNV/indel calls.

Annotation — VEP, ANNOVAR, SnpEff

加上基因、後果、族群頻率、臨床資料庫。

Add gene, consequence, population frequency, clinical databases.

Interpretation — ACMG/AMP

套用 28 條證據準則，分為 Pathogenic ↔ Benign 五級。

Apply 28 evidence criteria, classify across 5 tiers (Pathogenic ↔ Benign).

參考基因組

五、選擇參考基因組

GRCh38 / hg38

2013 起的主流參考。包含 ALT contigs、修正了 hg19 的多處錯誤，是 GATK Best Practices 的預設選擇。新專案應一律用 hg38。

The mainstream reference since 2013. Includes ALT contigs, fixes many hg19 errors. Default in GATK Best Practices. New projects should use hg38.

GRCh37 / hg19

2009 版本。仍見於既有 cohort 資料庫（如 1000 Genomes phase 3、ExAC v1）。除非為了與舊資料整合，否則不建議新專案使用。

The 2009 build. Still present in legacy cohort databases (1000 Genomes phase 3, ExAC v1). Avoid for new projects unless legacy integration is required.

T2T-CHM13

2022 年 telomere-to-telomere 完整組裝（包含 centromeres、acrocentric arms）。研究價值高，但生態系工具支援還在追趕中。

The 2022 telomere-to-telomere complete assembly (including centromeres and acrocentric arms). High research value; tool ecosystem still catching up.

程式碼

六、第一步：取得參考檔

# 從 GATK resource bundle 下載 GRCh38 參考
gsutil cp gs://genomics-public-data/references/GRCh38/Homo_sapiens_assembly38.fasta .
gsutil cp gs://genomics-public-data/references/GRCh38/Homo_sapiens_assembly38.fasta.fai .
gsutil cp gs://genomics-public-data/references/GRCh38/Homo_sapiens_assembly38.dict .

# BWA-MEM2 需要建索引
bwa-mem2 index Homo_sapiens_assembly38.fasta

# 已知變異位點（給 BQSR / VQSR 使用）
gsutil cp gs://genomics-public-data/references/GRCh38/dbsnp_146.hg38.vcf.gz .
gsutil cp gs://genomics-public-data/references/GRCh38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz .

# WES：下載捕獲 kit 的 BED 檔（依廠商提供）
# 例：Twist Comprehensive Exome、Agilent SureSelect、IDT xGen
wget https://example-vendor.com/exome_v2_targets.bed

# 加上 100 bp padding 以涵蓋 splicing 區與 capture 邊緣
gatk PreprocessIntervals \
  -L exome_v2_targets.bed \
  -R Homo_sapiens_assembly38.fasta \
  --bin-length 0 \
  --padding 100 \
  --interval-merging-rule OVERLAPPING_ONLY \
  -O exome_v2_targets.padded.interval_list

# 後續 HaplotypeCaller 加入 -L 即可限制在 exon 區

📋

WES 重點：務必使用定序廠商提供的 capture kit BED 檔，不要自行從 GENCODE/Ensembl 拼湊 exon 座標。實際被定序的區域 = capture probe 涵蓋區，不等於 annotated exons。 WES essential: Always use the vendor-supplied capture kit BED file. Don't piece together exon coordinates from GENCODE/Ensembl yourself. The actually-sequenced region = capture probe coverage, which is not the same as annotated exons.

📝 自我檢測

1. 你的研究團隊想找出一個 unsolved Mendelian 家族中的致病變異——前一輪 WES 已經做過但沒找到答案。下一步最合理的選擇是？

1. Your team wants to identify the pathogenic variant in an unsolved Mendelian family — a prior WES round found nothing. What's the most reasonable next step?

A. 重做一次 WES，但提高 coverage 到 200×A. Repeat WES at higher coverage (200×)

B. 換一家 WES 廠商使用不同 capture kitB. Switch WES vendors to use a different capture kit

C. 改做 WGS，因為可能涉及 deep intronic / regulatory / SV 變異C. Switch to WGS — may involve deep intronic / regulatory / SV variants

D. 放棄定序，改用 Sanger 重測候選基因D. Drop sequencing, do Sanger on candidate genes

2. 關於 WES coverage，以下何者正確？

2. Which statement about WES coverage is correct?

A. WES 在每個 exon 上覆蓋都很均勻，跟 WGS 一樣A. WES coverage is uniform across all exons, just like WGS

B. WES 的 capture 對 GC-rich 區域有顯著偏差，部分 exons 可能完全 dropoutB. WES capture is biased against GC-rich regions; some exons may completely drop out

C. WES 不需要設定 target intervals，HaplotypeCaller 會自動偵測C. WES doesn't need target intervals; HaplotypeCaller detects them automatically

D. WES 與 WGS 的數據量相同D. WES and WGS produce the same data volume

3. 一位 PI 說：「現在 WGS 比較貴，但 PCR-free WGS 對 exon 的 coverage 反而比 WES 更完整」。這個說法？

3. A PI says: "WGS is more expensive, but PCR-free WGS actually covers exons more completely than WES." This statement is:

A. 大致正確——文獻已證實 PCR-free WGS 在編碼區的 coverage 完整性勝過 WESA. Largely correct — literature confirms PCR-free WGS gives more complete coding-region coverage than WES

B. 完全錯誤，WES 永遠在 exon 上 coverage 較好B. Completely wrong; WES always has better exon coverage

C. 只在 short-read 平台成立C. True only on short-read platforms

D. 與 reference genome 版本有關D. Depends on the reference genome version