為什麼這個選擇是整個流程的起點?
WGS (Whole Genome Sequencing) 嘗試讀完整個基因體;WES (Whole Exome Sequencing) 只讀蛋白質編碼的外顯子區。雖然兩者上游分析流程高度重疊(都用 BWA、GATK),但選擇哪一種會深刻影響下游每一步的決策——從 coverage 設定、QC 閾值、到能不能偵測 CNV 與 non-coding variants。
人類基因體約 3 × 10⁹ bp,但其中只有約 1–2% 是外顯子(總共 ~60 Mb,涵蓋約 20,000 基因的約 180,000 個 exons)。然而,目前估計約 85% 的已知致病性 Mendelian 變異位於外顯子內,這就是 WES 在臨床診斷上長期受歡迎的原因。
WGS (Whole Genome Sequencing) attempts to read the entire genome; WES (Whole Exome Sequencing) reads only protein-coding exons. While both share most of the upstream analysis pipeline (both use BWA, GATK), the choice profoundly shapes every downstream decision — coverage targets, QC thresholds, and whether you can detect CNVs and non-coding variants.
The human genome is ~3 × 10⁹ bp, but only ~1–2% is exonic (~60 Mb, covering ~180,000 exons across ~20,000 genes). Yet roughly 85% of known pathogenic Mendelian variants reside within exons — the reason WES has long been the workhorse of clinical diagnostics.
一、技術規格全面對比
| 特性 | WGS | WES | ||
|---|---|---|---|---|
| 覆蓋範圍 基因體中讀取的部分portion of genome sequenced | ~98%(全基因體 3 Gb) | ~1–2%(外顯子 ~60 Mb) | ~98% (whole genome ~3 Gb) | ~1–2% (exons ~60 Mb) |
| 典型 coverage | 30–60× | 100–200× | ||
| 製備方法 | PCR-free 直接定序 | target capture (探針雜合捕獲) | PCR-free direct sequencing | Target capture (probe hybridization) |
| Coverage 均勻度 | ⭐⭐⭐⭐⭐ 非常均勻very uniform | ⭐⭐⭐ 受捕獲偏好影響capture-bias affected | ||
| Non-coding 變異 | ✅ 可偵測detectable | ❌ 幾乎看不到essentially invisible | ||
| SV / CNV | ✅ 高解析度斷點single-bp breakpoints | ⚠️ 僅 exon 內,斷點不準exon-only, imprecise breakpoints | ||
| GC-rich 區域 | PCR-free WGS 涵蓋良好 | 捕獲偏差較大 | PCR-free WGS covers well | Significant capture bias |
| 數據量 / sample | ~90–120 GB FASTQ | ~10–15 GB FASTQ | ||
| 相對成本 | 3–5× | 1× | ||
| 主要用途 | 研究、罕見病、population genetics、SV 分析 | 臨床 Mendelian 診斷、大規模族群篩檢 | Research, rare disease, population genetics, SV analysis | Clinical Mendelian diagnostics, large cohort screening |
二、Coverage 分布模擬器
這個模擬器顯示同一段基因體區域在 WGS 與 WES 下的 read depth 分布。切換按鈕,觀察兩種策略在 exon(深色區)與 intron(淺色區)的差異。注意 WES 在 exon 內 coverage 高但很不均勻;WGS 則覆蓋全區但深度較淺。
This simulator shows read-depth distribution across the same genomic region under WGS and WES. Toggle to compare how each strategy handles exons (dark bands) vs introns (light bands). Note that WES is deep but uneven in exons; WGS is shallower but uniform across the whole region.
X 軸:基因體位置 | Y 軸:read depth | 灰色背景:exon 區
三、我該選 WGS 還是 WES?
🌳 決策路徑
四、整體分析流程鳥瞰
無論是 WGS 還是 WES,從 FASTQ 到 annotated VCF 的核心步驟高度相似。WES 只在 calling 與 coverage 評估時加上 --intervals 限制目標區。下面是後續 8 個章節將涵蓋的主軸:
Whether WGS or WES, the core path from FASTQ to annotated VCF is largely shared. WES merely adds --intervals to confine calling and coverage assessment to target regions. Here is the spine of the next 8 chapters:
FASTQ QC — FastQC, fastp
檢查 Phred 分數、adapter 含量、duplication;trim 低品質 bases。
Check Phred scores, adapter content, duplication. Trim low-quality bases.
Alignment — BWA-MEM2
比對到 GRCh38;輸出 BAM。
Align to GRCh38; output BAM.
Post-alignment — MarkDuplicates, BQSR
標記 PCR duplicates、校正 base quality scores。
Mark PCR duplicates, recalibrate base quality scores.
Variant calling — HaplotypeCaller / DeepVariant
產出每樣本 GVCF 或單樣本 VCF。
Produce per-sample GVCF or single-sample VCF.
Joint genotyping — GenomicsDBImport, GenotypeGVCFs
跨樣本合併、refining genotypes。
Cross-sample merging, refined genotyping.
Filtering — VQSR / Hard filters
移除假陽性 SNV/indel calls。
Remove false-positive SNV/indel calls.
Annotation — VEP, ANNOVAR, SnpEff
加上基因、後果、族群頻率、臨床資料庫。
Add gene, consequence, population frequency, clinical databases.
Interpretation — ACMG/AMP
套用 28 條證據準則,分為 Pathogenic ↔ Benign 五級。
Apply 28 evidence criteria, classify across 5 tiers (Pathogenic ↔ Benign).
五、選擇參考基因組
GRCh38 / hg38
2013 起的主流參考。包含 ALT contigs、修正了 hg19 的多處錯誤,是 GATK Best Practices 的預設選擇。新專案應一律用 hg38。
The mainstream reference since 2013. Includes ALT contigs, fixes many hg19 errors. Default in GATK Best Practices. New projects should use hg38.
GRCh37 / hg19
2009 版本。仍見於既有 cohort 資料庫(如 1000 Genomes phase 3、ExAC v1)。除非為了與舊資料整合,否則不建議新專案使用。
The 2009 build. Still present in legacy cohort databases (1000 Genomes phase 3, ExAC v1). Avoid for new projects unless legacy integration is required.
T2T-CHM13
2022 年 telomere-to-telomere 完整組裝(包含 centromeres、acrocentric arms)。研究價值高,但生態系工具支援還在追趕中。
The 2022 telomere-to-telomere complete assembly (including centromeres and acrocentric arms). High research value; tool ecosystem still catching up.
六、第一步:取得參考檔
# 從 GATK resource bundle 下載 GRCh38 參考 gsutil cp gs://genomics-public-data/references/GRCh38/Homo_sapiens_assembly38.fasta . gsutil cp gs://genomics-public-data/references/GRCh38/Homo_sapiens_assembly38.fasta.fai . gsutil cp gs://genomics-public-data/references/GRCh38/Homo_sapiens_assembly38.dict . # BWA-MEM2 需要建索引 bwa-mem2 index Homo_sapiens_assembly38.fasta # 已知變異位點(給 BQSR / VQSR 使用) gsutil cp gs://genomics-public-data/references/GRCh38/dbsnp_146.hg38.vcf.gz . gsutil cp gs://genomics-public-data/references/GRCh38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz .
# WES:下載捕獲 kit 的 BED 檔(依廠商提供) # 例:Twist Comprehensive Exome、Agilent SureSelect、IDT xGen wget https://example-vendor.com/exome_v2_targets.bed # 加上 100 bp padding 以涵蓋 splicing 區與 capture 邊緣 gatk PreprocessIntervals \ -L exome_v2_targets.bed \ -R Homo_sapiens_assembly38.fasta \ --bin-length 0 \ --padding 100 \ --interval-merging-rule OVERLAPPING_ONLY \ -O exome_v2_targets.padded.interval_list # 後續 HaplotypeCaller 加入 -L 即可限制在 exon 區
📝 自我檢測
1. 你的研究團隊想找出一個 unsolved Mendelian 家族中的致病變異——前一輪 WES 已經做過但沒找到答案。下一步最合理的選擇是?
1. Your team wants to identify the pathogenic variant in an unsolved Mendelian family — a prior WES round found nothing. What's the most reasonable next step?
2. 關於 WES coverage,以下何者正確?
2. Which statement about WES coverage is correct?
3. 一位 PI 說:「現在 WGS 比較貴,但 PCR-free WGS 對 exon 的 coverage 反而比 WES 更完整」。這個說法?
3. A PI says: "WGS is more expensive, but PCR-free WGS actually covers exons more completely than WES." This statement is: