一、為什麼 SV/CNV 是另一個世界?
前面 8 個章節討論的 SNV 與 small indel (1-50 bp),short reads (150 bp) 都能直接「跨越」變異位點,比對 + alignment 即可看見。但 ≥ 50 bp 的變異就麻煩了:
- 無法用 single read 跨越:例如 1000 bp deletion,沒有 read 能完整覆蓋兩個 breakpoint
- 必須從間接證據推斷:依靠 paired-end orientation、split reads、read depth、local assembly 等多重線索
- repeat 區域是地獄:人類基因組約 50% 是 repeat,short read 無法區分 repeat 內位置,SV breakpoint 常落在這些區
因此 SV calling 的 sensitivity / precision 都遠不如 SNV——典型 short-read pipeline 的 SV recall 只有 60-80%。Long-read(PacBio HiFi、ONT R10.4)可達 95%+,但成本仍較高。
The previous 8 chapters discussed SNVs and small indels (1–50 bp), where short reads (150 bp) can span the variant directly — alignment shows them. But variants ≥50 bp are trouble:
- No single read can span them: e.g., a 1000-bp deletion — no read fully covers both breakpoints
- Must infer from indirect evidence: paired-end orientation, split reads, read depth, local assembly — multiple complementary clues
- Repeat regions are hell: ~50% of the human genome is repeats; short reads can't disambiguate within them, and SV breakpoints often fall in these regions
So SV calling sensitivity/precision lags far behind SNVs — typical short-read pipelines achieve only 60–80% SV recall. Long-read (PacBio HiFi, ONT R10.4) reaches 95%+, but at higher cost.
二、SV 分類
DEL (Deletion)
參考基因組的某段在樣本中缺失。Discordant paired-end 的 insert size 變大、跨越區段 read depth 下降一半(het)或歸零(hom)。VCF SVTYPE=DEL。
A segment of reference is missing in the sample. Discordant paired-end insert size grows; read depth across the region halves (het) or drops to zero (hom). VCF SVTYPE=DEL.
DUP (Duplication)
某段被複製出現多份。Tandem dup 容易(discordant pair + depth 上升),dispersed dup 較難。SVTYPE=DUP, CN=3 等。
A segment appears multiple times. Tandem dup is easier (discordant pairs + depth rises); dispersed dup is harder. SVTYPE=DUP, CN=3, etc.
INV (Inversion)
某段方向倒置。Discordant pair 方向異常(FF 或 RR 而非 FR)。對 read depth 沒影響,純靠 paired-end / split-read 偵測。
A segment is flipped. Discordant pair orientation becomes abnormal (FF or RR instead of FR). No depth change; relies purely on paired-end / split-read detection.
INS (Insertion)
參考中沒有的新序列插入。短 INS (< 100 bp) 從 split read 拼出;長 INS (> 500 bp) short-read 幾乎抓不到,是 long-read 的專長。
A novel sequence inserted (not in reference). Short INS (<100 bp) reconstructed from split reads; long INS (>500 bp) is nearly invisible to short reads — a long-read specialty.
BND / Translocation
不同染色體間的接合(如 BCR-ABL t(9;22))。VCF 用 BND (breakend) 記錄,每個 breakpoint 一條。癌症 driver fusion 的關鍵類型。
Inter-chromosomal junctions (e.g., BCR-ABL t(9;22)). VCF uses BND (breakend) records — one per breakpoint. A key category for cancer driver fusions.
CNV
概括稱呼 DEL + DUP,常用於 chromosome 級的 large variants 與癌症 copy-number profiling。GATK gCNV, CNVkit, CNVnator 等工具產出。
Umbrella term for DEL + DUP, commonly used for chromosome-scale variants and cancer copy-number profiling. Tools: GATK gCNV, CNVkit, CNVnator, etc.
三、SV caller 三大訊號類型
不同 caller 倚重不同的證據來源,理解這點才能理性選擇 / 組合工具:
Different callers rely on different evidence sources. Understanding this enables rational tool selection / combination:
| 訊號 | 原理 | 擅長 | 代表工具 |
|---|---|---|---|
| Paired-end (discordant) | paired reads 的 insert size / orientation 異於預期 | 中-大型 SV、translocation | Delly, Lumpy |
| Split-read | 同一 read 的兩段被 split 對到不同位置 → 推斷 breakpoint | 精確 breakpoint 解析度 | Manta, GRIDSS, SvABA |
| Read-depth | 某區段 coverage 顯著高/低於背景 | 大型 CNV、低 mappability 區 | CNVnator, ERDS, GATK gCNV |
| Local assembly | 把可疑區的 reads 重新 de novo assemble,找出非 reference 序列 | 複雜 SV、insertion、breakpoint 序列 | Manta, GRIDSS, SvABA |
大型 benchmark(如 Cameron et al. 2019, GIAB SV truth set)顯示單一 SV caller 的 recall 通常不到 70%,但不同 caller 抓到的 false negatives 重疊度低。實務常用 4 個 caller 組合(如 Manta + Delly + ERDS + CNVnator),用 SURVIVOR / Jasmine 合併取交集(嚴格)或多數決(寬鬆),可把 recall 推到 90%+。
Large benchmarks (Cameron et al. 2019, GIAB SV truth set) show single-caller recall is usually under 70%, but different callers' false negatives overlap little. Best practice combines 4 callers (e.g., Manta + Delly + ERDS + CNVnator), merged via SURVIVOR / Jasmine using intersection (strict) or majority vote (lenient), pushing recall to 90%+.
四、互動模擬:SV 三訊號示意
選擇不同 SV 類型,觀察三種訊號 (paired-end discordant, split-read, read-depth) 在該 SV 上的訊號強度。設計 caller 時就是依據這三條曲線的組合來識別不同類型。
Pick an SV type to see the three signal strengths (paired-end discordant, split-read, read-depth) for it. Caller design uses combinations of these three to identify each type.
五、特殊類別:mobile element insertion (MEI)
人類基因組中約 ~17% 是 LINE-1 (L1),~11% 是 Alu,~3% 是 SVA。這些 retrotransposon 偶爾會跳到新位置,產生 MEI (mobile element insertion)。雖然數量不多(每人約 100-200 個 polymorphic MEI),但已知會造成多種疾病(如 hemophilia、Duchenne MD、neurofibromatosis)。
普通 SV caller 對 MEI 不靈敏,因為新插入的序列就是 repeat element 自己,普通 split-read 會 ambiguously map 到全基因組數萬份 copies。專用工具:
- MELT (Mobile Element Locator Tool):針對 ALU / L1 / SVA 三大家族,1000 Genomes Project 的標準工具
- Mobster:類似 MELT,含 graphical pipeline
- xTea:支援 short-read + long-read,2021 後較新工具
The human genome is ~17% LINE-1 (L1), ~11% Alu, ~3% SVA. These retrotransposons occasionally jump to new locations, creating MEIs (mobile element insertions). Though not numerous (~100–200 polymorphic MEIs per person), they cause several diseases (hemophilia, Duchenne MD, neurofibromatosis).
Generic SV callers are insensitive to MEIs because the inserted sequence is a repeat element itself — normal split-reads ambiguously map to tens of thousands of genome-wide copies. Specialized tools:
- MELT (Mobile Element Locator Tool): targets the ALU / L1 / SVA families — 1000 Genomes Project's standard tool
- Mobster: similar to MELT, with graphical pipeline
- xTea: supports short-read + long-read, a newer (post-2021) tool
六、Pipeline 指令範例
# Manta — 適用 germline / somatic / 跨樣本
# Germline mode (single sample)
configManta.py \
--bam dedup.bam \
--referenceFasta ref.fa \
--runDir manta_germline
manta_germline/runWorkflow.py -j 16
# 結果在 manta_germline/results/variants/
# diploidSV.vcf.gz — germline SVs
# candidateSV.vcf.gz — 候選 (含未過濾 small SV)
# candidateSmallIndels.vcf.gz# Delly — paired-end + split-read combined
# 一次只能 call 一種 SV type
for sv in DEL DUP INV TRA INS; do
delly call -t $sv -g ref.fa -o sample.${sv}.bcf dedup.bam &
done; wait
# 合併
bcftools concat sample.DEL.bcf sample.DUP.bcf sample.INV.bcf sample.TRA.bcf sample.INS.bcf \
-Oz -o sample.delly.vcf.gz
tabix -p vcf sample.delly.vcf.gz
# 過濾 (Delly 有自家 quality 標籤)
bcftools view -f PASS sample.delly.vcf.gz -Oz -o sample.delly.pass.vcf.gz# CNVnator — read-depth based CNV (germline)
# 1. extract 每 100 bp 的 depth
cnvnator -root sample.root -tree dedup.bam
# 2. 設 bin size (WGS 100, WES 1000)
cnvnator -root sample.root -his 100 -d ref_chrs/
# 3. 計算統計
cnvnator -root sample.root -stat 100
# 4. 分割 region
cnvnator -root sample.root -partition 100
# 5. 呼叫 CNV
cnvnator -root sample.root -call 100 > sample.cnv.txt
# 轉為 VCF
cnvnator2VCF.pl sample.cnv.txt > sample.cnv.vcf# 用 SURVIVOR 合併多 caller 結果
ls sample.manta.vcf sample.delly.vcf sample.lumpy.vcf sample.cnvnator.vcf > vcf_list.txt
# 合併條件: 距離 ≤ 500 bp, 同 SV type, 至少 2 個 caller 支持
SURVIVOR merge vcf_list.txt 500 2 1 1 0 30 sample.merged.vcf
# AnnotSV 註解 (gene context, OMIM, dbVar)
AnnotSV \
-SVinputFile sample.merged.vcf \
-outputFile sample.annotsv.tsv \
-genomeBuild GRCh38
# 看高 ranking 候選
awk -F'\t' '$NF == "5" || $NF == "4"' sample.annotsv.tsv | less七、FAQ
為什麼 WES 不適合做 SV/CNV?
long-read 真的能解決所有 SV 問題嗎?
SV 的 ACMG 判讀有獨立指引嗎?
GIAB SV truth set 有哪些版本?
八、小測驗
Q1. 一個純粹的 inversion 可由哪些訊號偵測?
Q2. 為什麼推薦多 caller 組合?
Q3. WES 偵測 CNV 的最大障礙?