Advanced: Structural Variants & CNV

為什麼

一、為什麼 SV/CNV 是另一個世界？

前面 8 個章節討論的 SNV 與 small indel (1-50 bp)，short reads (150 bp) 都能直接「跨越」變異位點，比對 + alignment 即可看見。但 ≥ 50 bp 的變異就麻煩了：

無法用 single read 跨越：例如 1000 bp deletion，沒有 read 能完整覆蓋兩個 breakpoint
必須從間接證據推斷：依靠 paired-end orientation、split reads、read depth、local assembly 等多重線索
repeat 區域是地獄：人類基因組約 50% 是 repeat，short read 無法區分 repeat 內位置，SV breakpoint 常落在這些區

因此 SV calling 的 sensitivity / precision 都遠不如 SNV——典型 short-read pipeline 的 SV recall 只有 60-80%。Long-read（PacBio HiFi、ONT R10.4）可達 95%+，但成本仍較高。

The previous 8 chapters discussed SNVs and small indels (1–50 bp), where short reads (150 bp) can span the variant directly — alignment shows them. But variants ≥50 bp are trouble:

No single read can span them: e.g., a 1000-bp deletion — no read fully covers both breakpoints
Must infer from indirect evidence: paired-end orientation, split reads, read depth, local assembly — multiple complementary clues
Repeat regions are hell: ~50% of the human genome is repeats; short reads can't disambiguate within them, and SV breakpoints often fall in these regions

So SV calling sensitivity/precision lags far behind SNVs — typical short-read pipelines achieve only 60–80% SV recall. Long-read (PacBio HiFi, ONT R10.4) reaches 95%+, but at higher cost.

技術

二、SV 分類

DEL (Deletion)

參考基因組的某段在樣本中缺失。Discordant paired-end 的 insert size 變大、跨越區段 read depth 下降一半（het）或歸零（hom）。VCF SVTYPE=DEL。

A segment of reference is missing in the sample. Discordant paired-end insert size grows; read depth across the region halves (het) or drops to zero (hom). VCF SVTYPE=DEL.

DUP (Duplication)

某段被複製出現多份。Tandem dup 容易（discordant pair + depth 上升），dispersed dup 較難。SVTYPE=DUP, CN=3 等。

A segment appears multiple times. Tandem dup is easier (discordant pairs + depth rises); dispersed dup is harder. SVTYPE=DUP, CN=3, etc.

INV (Inversion)

某段方向倒置。Discordant pair 方向異常（FF 或 RR 而非 FR）。對 read depth 沒影響，純靠 paired-end / split-read 偵測。

A segment is flipped. Discordant pair orientation becomes abnormal (FF or RR instead of FR). No depth change; relies purely on paired-end / split-read detection.

INS (Insertion)

參考中沒有的新序列插入。短 INS (< 100 bp) 從 split read 拼出；長 INS (> 500 bp) short-read 幾乎抓不到，是 long-read 的專長。

A novel sequence inserted (not in reference). Short INS (<100 bp) reconstructed from split reads; long INS (>500 bp) is nearly invisible to short reads — a long-read specialty.

BND / Translocation

不同染色體間的接合（如 BCR-ABL t(9;22)）。VCF 用 BND (breakend) 記錄，每個 breakpoint 一條。癌症 driver fusion 的關鍵類型。

Inter-chromosomal junctions (e.g., BCR-ABL t(9;22)). VCF uses BND (breakend) records — one per breakpoint. A key category for cancer driver fusions.

CNV

概括稱呼 DEL + DUP，常用於 chromosome 級的 large variants 與癌症 copy-number profiling。GATK gCNV, CNVkit, CNVnator 等工具產出。

Umbrella term for DEL + DUP, commonly used for chromosome-scale variants and cancer copy-number profiling. Tools: GATK gCNV, CNVkit, CNVnator, etc.

技術

三、SV caller 三大訊號類型

不同 caller 倚重不同的證據來源，理解這點才能理性選擇 / 組合工具：

Different callers rely on different evidence sources. Understanding this enables rational tool selection / combination:

訊號	原理	擅長	代表工具
Paired-end (discordant)	paired reads 的 insert size / orientation 異於預期	中-大型 SV、translocation	Delly, Lumpy
Split-read	同一 read 的兩段被 split 對到不同位置 → 推斷 breakpoint	精確 breakpoint 解析度	Manta, GRIDSS, SvABA
Read-depth	某區段 coverage 顯著高/低於背景	大型 CNV、低 mappability 區	CNVnator, ERDS, GATK gCNV
Local assembly	把可疑區的 reads 重新 de novo assemble，找出非 reference 序列	複雜 SV、insertion、breakpoint 序列	Manta, GRIDSS, SvABA

為什麼推薦多 caller 組合？

大型 benchmark（如 Cameron et al. 2019, GIAB SV truth set）顯示單一 SV caller 的 recall 通常不到 70%，但不同 caller 抓到的 false negatives 重疊度低。實務常用 4 個 caller 組合（如 Manta + Delly + ERDS + CNVnator），用 SURVIVOR / Jasmine 合併取交集（嚴格）或多數決（寬鬆），可把 recall 推到 90%+。

Large benchmarks (Cameron et al. 2019, GIAB SV truth set) show single-caller recall is usually under 70%, but different callers' false negatives overlap little. Best practice combines 4 callers (e.g., Manta + Delly + ERDS + CNVnator), merged via SURVIVOR / Jasmine using intersection (strict) or majority vote (lenient), pushing recall to 90%+.

互動模擬

四、互動模擬：SV 三訊號示意

選擇不同 SV 類型，觀察三種訊號 (paired-end discordant, split-read, read-depth) 在該 SV 上的訊號強度。設計 caller 時就是依據這三條曲線的組合來識別不同類型。

Pick an SV type to see the three signal strengths (paired-end discordant, split-read, read-depth) for it. Caller design uses combinations of these three to identify each type.

選擇 SV 類型:

分數 0-10 為示意值。可看到不同 SV 類型「依賴」不同訊號——例如 INV 完全靠 paired-end，DEL 三項都強。caller 的算法都是用這些訊號的組合來偵測。

Scores 0–10 are illustrative. Different SV types rely on different signals — INV depends entirely on paired-end, DEL has all three strong. Caller algorithms detect via combinations of these signals.

技術

五、特殊類別：mobile element insertion (MEI)

人類基因組中約 ~17% 是 LINE-1 (L1)，~11% 是 Alu，~3% 是 SVA。這些 retrotransposon 偶爾會跳到新位置，產生 MEI (mobile element insertion)。雖然數量不多（每人約 100-200 個 polymorphic MEI），但已知會造成多種疾病（如 hemophilia、Duchenne MD、neurofibromatosis）。

普通 SV caller 對 MEI 不靈敏，因為新插入的序列就是 repeat element 自己，普通 split-read 會 ambiguously map 到全基因組數萬份 copies。專用工具：

MELT (Mobile Element Locator Tool)：針對 ALU / L1 / SVA 三大家族，1000 Genomes Project 的標準工具
Mobster：類似 MELT，含 graphical pipeline
xTea：支援 short-read + long-read，2021 後較新工具

The human genome is ~17% LINE-1 (L1), ~11% Alu, ~3% SVA. These retrotransposons occasionally jump to new locations, creating MEIs (mobile element insertions). Though not numerous (~100–200 polymorphic MEIs per person), they cause several diseases (hemophilia, Duchenne MD, neurofibromatosis).

Generic SV callers are insensitive to MEIs because the inserted sequence is a repeat element itself — normal split-reads ambiguously map to tens of thousands of genome-wide copies. Specialized tools:

MELT (Mobile Element Locator Tool): targets the ALU / L1 / SVA families — 1000 Genomes Project's standard tool
Mobster: similar to MELT, with graphical pipeline
xTea: supports short-read + long-read, a newer (post-2021) tool

實務指令

六、Pipeline 指令範例

# Manta — 適用 germline / somatic / 跨樣本
# Germline mode (single sample)
configManta.py \
    --bam dedup.bam \
    --referenceFasta ref.fa \
    --runDir manta_germline

manta_germline/runWorkflow.py -j 16

# 結果在 manta_germline/results/variants/
#   diploidSV.vcf.gz       — germline SVs
#   candidateSV.vcf.gz     — 候選 (含未過濾 small SV)
#   candidateSmallIndels.vcf.gz

# Delly — paired-end + split-read combined
# 一次只能 call 一種 SV type
for sv in DEL DUP INV TRA INS; do
  delly call -t $sv -g ref.fa -o sample.${sv}.bcf dedup.bam &
done; wait

# 合併
bcftools concat sample.DEL.bcf sample.DUP.bcf sample.INV.bcf sample.TRA.bcf sample.INS.bcf \
    -Oz -o sample.delly.vcf.gz
tabix -p vcf sample.delly.vcf.gz

# 過濾 (Delly 有自家 quality 標籤)
bcftools view -f PASS sample.delly.vcf.gz -Oz -o sample.delly.pass.vcf.gz

# CNVnator — read-depth based CNV (germline)
# 1. extract 每 100 bp 的 depth
cnvnator -root sample.root -tree dedup.bam

# 2. 設 bin size (WGS 100, WES 1000)
cnvnator -root sample.root -his 100 -d ref_chrs/

# 3. 計算統計
cnvnator -root sample.root -stat 100

# 4. 分割 region
cnvnator -root sample.root -partition 100

# 5. 呼叫 CNV
cnvnator -root sample.root -call 100 > sample.cnv.txt

# 轉為 VCF
cnvnator2VCF.pl sample.cnv.txt > sample.cnv.vcf

# 用 SURVIVOR 合併多 caller 結果
ls sample.manta.vcf sample.delly.vcf sample.lumpy.vcf sample.cnvnator.vcf > vcf_list.txt

# 合併條件: 距離 ≤ 500 bp, 同 SV type, 至少 2 個 caller 支持
SURVIVOR merge vcf_list.txt 500 2 1 1 0 30 sample.merged.vcf

# AnnotSV 註解 (gene context, OMIM, dbVar)
AnnotSV \
    -SVinputFile sample.merged.vcf \
    -outputFile sample.annotsv.tsv \
    -genomeBuild GRCh38

# 看高 ranking 候選
awk -F'\t' '$NF == "5" || $NF == "4"' sample.annotsv.tsv | less

常見問題

七、FAQ

為什麼 WES 不適合做 SV/CNV？

WES 的 capture 是斷續的 exon islands，read depth 受 capture 效率影響極大（同 exon 在不同樣本可能差 5-10 倍）。read-depth 訊號難以區分「真 CNV」與「capture 偏差」。intronic / intergenic breakpoint 完全看不到。WES CNV 偵測需 EXCAVATOR2 / CoNIFER / CLAMMS 等專用工具，且需要大 cohort 才能 normalize 偏差。研究 SV 還是 PCR-free WGS 為佳。

WES capture is discontinuous exon islands; read depth is heavily affected by capture efficiency (the same exon can vary 5–10× across samples). Read-depth signals can't distinguish real CNVs from capture bias. Intronic/intergenic breakpoints are invisible. WES CNV detection requires specialized tools (EXCAVATOR2, CoNIFER, CLAMMS) and large cohorts to normalize bias. For SVs, PCR-free WGS remains preferable.

long-read 真的能解決所有 SV 問題嗎？

大幅改善但非萬能。PacBio HiFi (15-20 kb, Q30+) 與 ONT R10.4 (Q20+) 對 SV 的 recall 可達 95%+，能直接看到 long INS、resolve 複雜 rearrangement。但成本仍 3-10× short-read，且對某些工作流程（如 amplicon、UMI consensus）不適合。實務漸成「short-read 為主 + 重點 case 補 long-read」混合模式。

Greatly improves but isn't universal. PacBio HiFi (15–20 kb, Q30+) and ONT R10.4 (Q20+) reach 95%+ SV recall and can directly see long INS or resolve complex rearrangements. But cost is still 3–10× short-read, and some workflows (amplicons, UMI consensus) don't fit. Practice is shifting toward "short-read primary + long-read for key cases."

SV 的 ACMG 判讀有獨立指引嗎？

有。ACMG/ClinGen 2020 CNV interpretation guideline（Riggs et al., Genetics in Medicine）為 copy-number gain/loss 提供 5-tier classification 與專屬證據碼，與 SNV ACMG 2015 互補。AnnotSV 工具會自動套用此指引給 CNV ranking 1-5。

Yes. The ACMG/ClinGen 2020 CNV interpretation guideline (Riggs et al., Genetics in Medicine) provides a 5-tier classification and dedicated evidence codes for copy-number gains/losses, complementing the SNV ACMG 2015. AnnotSV automatically applies this guideline to give CNVs a ranking of 1–5.

GIAB SV truth set 有哪些版本？

GIAB (Genome in a Bottle) HG002 (Ashkenazi son) 是 SV benchmark 的事實標準。v0.6 (2019) 約 12,000 high-confidence SVs；v1.0 (2024) 升級到 long-read assembly 為基礎，~50,000 SVs，含 MEI、tandem repeat 與複雜 rearrangement。新工具發表必須在這上面 benchmark。

GIAB (Genome in a Bottle) HG002 (Ashkenazi son) is the de facto SV benchmark. v0.6 (2019) had ~12,000 high-confidence SVs; v1.0 (2024) upgraded to long-read-assembly-based with ~50,000 SVs including MEIs, tandem repeats, and complex rearrangements. New tools must benchmark on this.

自我檢測

八、小測驗

Q1. 一個純粹的 inversion 可由哪些訊號偵測？

Q2. 為什麼推薦多 caller 組合？

Q3. WES 偵測 CNV 的最大障礙？