Step 8: Variant Annotation — WGS/WES Tutorial

為什麼

一、為什麼需要 annotation？

過濾後的 VCF 還只是「一堆位點與基因型」。要回答「這個變異會不會致病」、「在人群中多常見」、「影響哪個 transcript 的哪個 codon」，必須註解：

Gene context：variant 落在哪個 gene、transcript、exon、intron？是 coding 還是 UTR？
Functional consequence：synonymous、missense、nonsense、frameshift、splice site？
Population frequency：gnomAD、TOPMed、1000 Genomes 中的 allele frequency？
Clinical significance：ClinVar 是否已收錄？是 Pathogenic / Benign / VUS？
In silico predictions：SIFT、PolyPhen-2、CADD、REVEL、AlphaMissense 預測致病性

後續的 ACMG/AMP 變異判讀（Step 9）幾乎完全建立在 annotation 結果之上。

A filtered VCF is still just "positions and genotypes." To answer whether a variant is pathogenic, how common it is, and which codon of which transcript it hits — it must be annotated:

Gene context: which gene, transcript, exon, intron? Coding or UTR?
Functional consequence: synonymous, missense, nonsense, frameshift, splice site?
Population frequency: AF in gnomAD, TOPMed, 1000 Genomes?
Clinical significance: in ClinVar? Pathogenic / Benign / VUS?
In silico predictions: SIFT, PolyPhen-2, CADD, REVEL, AlphaMissense pathogenicity scores

Downstream ACMG/AMP variant classification (Step 9) is almost entirely built on annotation output.

技術

二、三大註解工具比較

Tool	開發單位	輸出	優勢	弱點
VEP	Ensembl / EBI	VCF (CSQ field) / TXT / JSON	社群最活躍、Ensembl/RefSeq 雙轉錄本支援、外掛系統 (REVEL, SpliceAI, AlphaMissense...)、Sequence Ontology 標準化	安裝較複雜（需要 cache 或 API）、執行較慢
ANNOVAR	USC	TXT / CSV (multi-anno table)	速度最快、表格輸出方便人讀、龐大 db 生態 (refGene, dbNSFP, ClinVar, gnomAD)	非自由軟體 (商用付費)、註解術語非標準化、輸出非 VCF
SnpEff	開源社群	VCF (ANN field)	完全開源、安裝最簡單、內建大量物種 db (~70,000 genomes)、SnpSift 強大查詢	人類資料庫不如 VEP 即時、外掛較少

⚠ 三者結果並非完全一致

研究顯示三大工具在 Loss-of-Function (LoF) 變異的判定上只有約 65% concordance，主要差異來自：(1) 預設 transcript 不同（VEP 預設 Ensembl canonical，ANNOVAR 多 RefSeq）、(2) splice site 邊界定義差異、(3) Sequence Ontology vs ANNOVAR 自製術語對應差異。臨床診斷常推薦同時跑兩個工具交叉驗證。

Studies show the three tools only reach about 65% concordance for Loss-of-Function (LoF) calls. Differences trace to: (1) different default transcripts (VEP defaults to Ensembl canonical, ANNOVAR often RefSeq); (2) splice-site boundary definitions; (3) Sequence Ontology vs ANNOVAR's custom terminology. Clinical workflows often run two tools in parallel for cross-validation.

格式

三、VEP CSQ 欄位結構

VEP 把所有註解塞進 INFO 欄的 CSQ 子欄位，每個 transcript 一條紀錄，欄位以 | 分隔。header 中的 ##INFO=<ID=CSQ,...> 會列出順序。

VEP packs all annotations into the INFO field's CSQ sub-field — one record per transcript, fields separated by |. The header's ##INFO=<ID=CSQ,...> declares the order.

# Header (描述欄位順序):

# 範例 INFO 欄:

Consequence

Sequence Ontology 標準術語：missense_variant, stop_gained, frameshift_variant, splice_donor_variant, synonymous_variant ...

Sequence Ontology standard terms: missense_variant, stop_gained, frameshift_variant, splice_donor_variant, synonymous_variant, etc.

IMPACT

VEP 自評嚴重度：HIGH (LoF), MODERATE (missense), LOW (synonymous), MODIFIER (intron/UTR)。

VEP's severity tier: HIGH (LoF), MODERATE (missense), LOW (synonymous), MODIFIER (intron/UTR).

HGVSc / HGVSp

HGVS 命名標準。HGVSc = coding-level (如 c.1067A>G)；HGVSp = protein-level (如 p.Asp356Gly)。臨床報告必備。

HGVS nomenclature standard. HGVSc = coding-level (e.g., c.1067A>G); HGVSp = protein-level (e.g., p.Asp356Gly). Required in clinical reports.

CLIN_SIG

ClinVar 臨床顯著性：pathogenic, likely_pathogenic, uncertain_significance, likely_benign, benign。

ClinVar clinical significance: pathogenic, likely_pathogenic, uncertain_significance, likely_benign, benign.

互動模擬

四、互動模擬：consequence 嚴重度排序

選擇不同的 variant consequence，看 VEP IMPACT 與一般遺傳學上的「致病可能性」對照。一個 variant 通常影響多個 transcript（each consequence per transcript），VEP 會報出所有，同時提供 --pick 選最嚴重那條。

Pick different variant consequences to see how VEP's IMPACT lines up with the general genetic intuition for pathogenicity. A variant typically affects multiple transcripts (one consequence per transcript) — VEP reports all and offers --pick to choose the most severe.

選擇 consequence:

藍 = VEP IMPACT 分數 (HIGH=4, MODERATE=3, LOW=2, MODIFIER=1)；橘 = 該 consequence 在 ClinVar Pathogenic 中的相對富集（示意值）。

Blue = VEP IMPACT score (HIGH=4, MODERATE=3, LOW=2, MODIFIER=1); Orange = relative enrichment of this consequence in ClinVar Pathogenic (illustrative).

技術

五、in silico 致病性預測工具

SIFT

最早期的 missense 預測工具之一，基於物種間保守性。輸出 0-1（<0.05 = "deleterious"）。

One of the earliest missense predictors; based on cross-species conservation. Outputs 0-1 (<0.05 = "deleterious").

PolyPhen-2

結合保守性與蛋白質結構。輸出 0-1（>0.85 = "probably damaging"）。常與 SIFT 一起報告。

Combines conservation with protein structure. Outputs 0-1 (>0.85 = "probably damaging"). Often reported alongside SIFT.

CADD

整合 60+ feature 的 ML 模型，給每個 SNV/indel 一個 PHRED 分數。CADD ≥ 20 約等於最嚴重 1%；ACMG 常用 ≥ 25 作為支持性證據。

An ML model integrating 60+ features, giving each SNV/indel a PHRED score. CADD ≥ 20 ≈ top 1% most-deleterious; ACMG often uses ≥ 25 as supporting evidence.

REVEL

專為 missense 設計的 ensemble，整合 13 個工具。0-1 連續值，> 0.5 視為 likely pathogenic。臨床診斷實務常用。

A missense-specific ensemble combining 13 tools. Continuous 0-1; >0.5 = likely pathogenic. Heavily used in clinical workflows.

SpliceAI

DeepMind 風格的 CNN，預測 splice gain/loss。delta score > 0.2 即值得注意。可發現 deep intronic 隱性 splicing 變異。

A CNN (DeepMind-style) predicting splice gain/loss. Delta score > 0.2 is noteworthy. Can uncover deep-intronic cryptic splicing variants.

AlphaMissense

2023 DeepMind 推出，基於 AlphaFold 結構預測 missense pathogenicity。0-1 連續值，覆蓋全人類 71 M missense；ACMG 已有採用建議。

Released 2023 by DeepMind; uses AlphaFold structures to predict missense pathogenicity. 0-1 continuous, covers all 71 M human missense variants; already cited in ACMG recommendations.

實務指令

六、Pipeline 指令範例

# VEP 完整註解 (含 plugins)
vep \
    -i cohort.pass.vcf.gz \
    -o cohort.vep.vcf.gz \
    --vcf --compress_output bgzip \
    --cache --offline --dir_cache /data/vep_cache \
    --assembly GRCh38 \
    --species homo_sapiens \
    --everything \
    --pick_allele_gene \
    --plugin REVEL,/data/plugins/revel_grch38.tsv.gz \
    --plugin SpliceAI,snv=/data/spliceai_scores.snv.vcf.gz,indel=/data/spliceai_scores.indel.vcf.gz \
    --plugin AlphaMissense,file=/data/AlphaMissense_hg38.tsv.gz \
    --custom /data/clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNDN \
    --custom /data/gnomad.exomes.v4.1.sites.vcf.bgz,gnomAD,vcf,exact,0,AF,AF_popmax \
    --fork 8

# ANNOVAR — 把 VCF 轉成 ANNOVAR input
convert2annovar.pl -format vcf4 cohort.pass.vcf.gz \
    -outfile cohort.avinput -allsample -withfreq

# 多 db 同時註解 (table_annovar)
table_annovar.pl cohort.avinput humandb/ \
    -buildver hg38 \
    -out cohort \
    -remove \
    -protocol refGene,knownGene,clinvar_20240611,gnomad41_exome,dbnsfp42a,dbscsnv11 \
    -operation g,g,f,f,f,f \
    -nastring . \
    -vcfinput \
    -polish

# 結果: cohort.hg38_multianno.txt (tab-separated table)
#       cohort.hg38_multianno.vcf  (VCF format with INFO fields)

# SnpEff — 下載 db (一次)
snpEff download -v GRCh38.105

# 註解
snpEff -v GRCh38.105 cohort.pass.vcf.gz > cohort.snpeff.vcf
# 同時產生 snpEff_summary.html 與 snpEff_genes.txt 統計報表

# 加 ClinVar 註解 (SnpSift)
SnpSift annotate clinvar.vcf.gz cohort.snpeff.vcf > cohort.snpeff.clinvar.vcf

# 加 dbNSFP (CADD, REVEL, SIFT, PolyPhen 等)
SnpSift dbnsfp -db dbNSFP4.7a.txt.gz cohort.snpeff.clinvar.vcf > cohort.full.vcf

# SnpSift filter — 找 high-impact rare variants
SnpSift filter \
    "(ANN[0].IMPACT = 'HIGH') && (dbNSFP_gnomAD_exomes_AF < 0.001 || na dbNSFP_gnomAD_exomes_AF)" \
    cohort.full.vcf > rare_lof.vcf

# bcftools query — 提取 VEP CSQ 為表格
bcftools +split-vep cohort.vep.vcf.gz \
    -d -f '%CHROM\t%POS\t%SYMBOL\t%Consequence\t%IMPACT\t%HGVSp\t%gnomAD_AF\t%CLIN_SIG\n' \
    | head

# 找 PVS1 candidates: HIGH impact + gnomAD AF < 0.0001 + ClinVar pathogenic OR no entry
bcftools +split-vep cohort.vep.vcf.gz \
    -i 'IMPACT="HIGH" & (gnomAD_AF<0.0001 | gnomAD_AF=".")' \
    -f '%CHROM\t%POS\t%SYMBOL\t%Consequence\t%HGVSp\t%CLIN_SIG\n'

常見問題

七、FAQ

一個 variant 為什麼會有多條 CSQ？

因為一個基因可能有多個 transcript（alternative splicing），同一變異對不同 transcript 的影響不同（在 transcript A 是 missense，transcript B 是 intron）。VEP 預設報所有，加 --pick / --pick_allele_gene / --per_gene 縮減。臨床報告通常用 MANE Select 或 Ensembl canonical transcript。

Because one gene may have multiple transcripts (alternative splicing); the same variant has different consequences across them (missense in transcript A, intronic in transcript B). VEP reports all by default; use --pick / --pick_allele_gene / --per_gene to reduce. Clinical reports typically use the MANE Select or Ensembl canonical transcript.

VEP cache 要多大空間？

人類 GRCh38 完整 cache 約 25-30 GB（包含 GERP scores、regulatory features）。最小 indexed cache 約 5 GB。離線 cache 比 API 快 100 倍以上，強烈建議。

Human GRCh38 full cache: ~25–30 GB (includes GERP scores, regulatory features). Minimal indexed cache: ~5 GB. Offline cache is 100×+ faster than the API — strongly recommended.

CLIN_SIG 有矛盾解讀怎麼辦？

ClinVar 對同一變異常有多家提交，可能 P / LP / VUS / LB / B 並存，這時欄位會出現 conflicting_interpretations_of_pathogenicity。實務上需到 ClinVar 網站查 review status (1-4 stars) 與 submitter 細節，以最新最高 review status 為準。

ClinVar often has multiple submitters per variant, sometimes mixing P/LP/VUS/LB/B — the field then says conflicting_interpretations_of_pathogenicity. In practice, check the ClinVar website for review status (1–4 stars) and submitter details; defer to the most recent, highest-status assertion.

MANE Select 是什麼？

NCBI 與 EMBL-EBI 共同維護的「每個基因一個官方代表 transcript」標準。MANE Select 通常是該基因臨床上最重要、最廣 used 的 transcript。臨床報告建議標 HGVS 時使用 MANE Select transcript ID。VEP 加 --mane flag 標記。

A joint NCBI/EMBL-EBI standard providing "one official representative transcript per gene." MANE Select is typically the gene's most clinically relevant, most widely used transcript. Clinical reports should use the MANE Select transcript ID for HGVS. Add the --mane flag in VEP to mark it.

自我檢測

八、小測驗

Q1. VEP、ANNOVAR、SnpEff 對同一變異有時給不同 LoF 判定，主因是？

Q2. VEP IMPACT=HIGH 通常對應哪類變異？

Q3. 為什麼 ACMG 鼓勵使用 MANE Select transcript？