一、為什麼需要 annotation?
過濾後的 VCF 還只是「一堆位點與基因型」。要回答「這個變異會不會致病」、「在人群中多常見」、「影響哪個 transcript 的哪個 codon」,必須註解:
- Gene context:variant 落在哪個 gene、transcript、exon、intron?是 coding 還是 UTR?
- Functional consequence:synonymous、missense、nonsense、frameshift、splice site?
- Population frequency:gnomAD、TOPMed、1000 Genomes 中的 allele frequency?
- Clinical significance:ClinVar 是否已收錄?是 Pathogenic / Benign / VUS?
- In silico predictions:SIFT、PolyPhen-2、CADD、REVEL、AlphaMissense 預測致病性
後續的 ACMG/AMP 變異判讀(Step 9)幾乎完全建立在 annotation 結果之上。
A filtered VCF is still just "positions and genotypes." To answer whether a variant is pathogenic, how common it is, and which codon of which transcript it hits — it must be annotated:
- Gene context: which gene, transcript, exon, intron? Coding or UTR?
- Functional consequence: synonymous, missense, nonsense, frameshift, splice site?
- Population frequency: AF in gnomAD, TOPMed, 1000 Genomes?
- Clinical significance: in ClinVar? Pathogenic / Benign / VUS?
- In silico predictions: SIFT, PolyPhen-2, CADD, REVEL, AlphaMissense pathogenicity scores
Downstream ACMG/AMP variant classification (Step 9) is almost entirely built on annotation output.
二、三大註解工具比較
| Tool | 開發單位 | 輸出 | 優勢 | 弱點 |
|---|---|---|---|---|
| VEP | Ensembl / EBI | VCF (CSQ field) / TXT / JSON | 社群最活躍、Ensembl/RefSeq 雙轉錄本支援、外掛系統 (REVEL, SpliceAI, AlphaMissense...)、Sequence Ontology 標準化 | 安裝較複雜(需要 cache 或 API)、執行較慢 |
| ANNOVAR | USC | TXT / CSV (multi-anno table) | 速度最快、表格輸出方便人讀、龐大 db 生態 (refGene, dbNSFP, ClinVar, gnomAD) | 非自由軟體 (商用付費)、註解術語非標準化、輸出非 VCF |
| SnpEff | 開源社群 | VCF (ANN field) | 完全開源、安裝最簡單、內建大量物種 db (~70,000 genomes)、SnpSift 強大查詢 | 人類資料庫不如 VEP 即時、外掛較少 |
研究顯示三大工具在 Loss-of-Function (LoF) 變異的判定上只有約 65% concordance,主要差異來自:(1) 預設 transcript 不同(VEP 預設 Ensembl canonical,ANNOVAR 多 RefSeq)、(2) splice site 邊界定義差異、(3) Sequence Ontology vs ANNOVAR 自製術語對應差異。臨床診斷常推薦同時跑兩個工具交叉驗證。
Studies show the three tools only reach about 65% concordance for Loss-of-Function (LoF) calls. Differences trace to: (1) different default transcripts (VEP defaults to Ensembl canonical, ANNOVAR often RefSeq); (2) splice-site boundary definitions; (3) Sequence Ontology vs ANNOVAR's custom terminology. Clinical workflows often run two tools in parallel for cross-validation.
三、VEP CSQ 欄位結構
VEP 把所有註解塞進 INFO 欄的 CSQ 子欄位,每個 transcript 一條紀錄,欄位以 | 分隔。header 中的 ##INFO=<ID=CSQ,...> 會列出順序。
VEP packs all annotations into the INFO field's CSQ sub-field — one record per transcript, fields separated by |. The header's ##INFO=<ID=CSQ,...> declares the order.
Consequence
Sequence Ontology 標準術語:missense_variant, stop_gained, frameshift_variant, splice_donor_variant, synonymous_variant ...
Sequence Ontology standard terms: missense_variant, stop_gained, frameshift_variant, splice_donor_variant, synonymous_variant, etc.
IMPACT
VEP 自評嚴重度:HIGH (LoF), MODERATE (missense), LOW (synonymous), MODIFIER (intron/UTR)。
VEP's severity tier: HIGH (LoF), MODERATE (missense), LOW (synonymous), MODIFIER (intron/UTR).
HGVSc / HGVSp
HGVS 命名標準。HGVSc = coding-level (如 c.1067A>G);HGVSp = protein-level (如 p.Asp356Gly)。臨床報告必備。
HGVS nomenclature standard. HGVSc = coding-level (e.g., c.1067A>G); HGVSp = protein-level (e.g., p.Asp356Gly). Required in clinical reports.
CLIN_SIG
ClinVar 臨床顯著性:pathogenic, likely_pathogenic, uncertain_significance, likely_benign, benign。
ClinVar clinical significance: pathogenic, likely_pathogenic, uncertain_significance, likely_benign, benign.
四、互動模擬:consequence 嚴重度排序
選擇不同的 variant consequence,看 VEP IMPACT 與一般遺傳學上的「致病可能性」對照。一個 variant 通常影響多個 transcript(each consequence per transcript),VEP 會報出所有,同時提供 --pick 選最嚴重那條。
Pick different variant consequences to see how VEP's IMPACT lines up with the general genetic intuition for pathogenicity. A variant typically affects multiple transcripts (one consequence per transcript) — VEP reports all and offers --pick to choose the most severe.
五、in silico 致病性預測工具
SIFT
最早期的 missense 預測工具之一,基於物種間保守性。輸出 0-1(<0.05 = "deleterious")。
One of the earliest missense predictors; based on cross-species conservation. Outputs 0-1 (<0.05 = "deleterious").
PolyPhen-2
結合保守性與蛋白質結構。輸出 0-1(>0.85 = "probably damaging")。常與 SIFT 一起報告。
Combines conservation with protein structure. Outputs 0-1 (>0.85 = "probably damaging"). Often reported alongside SIFT.
CADD
整合 60+ feature 的 ML 模型,給每個 SNV/indel 一個 PHRED 分數。CADD ≥ 20 約等於最嚴重 1%;ACMG 常用 ≥ 25 作為支持性證據。
An ML model integrating 60+ features, giving each SNV/indel a PHRED score. CADD ≥ 20 ≈ top 1% most-deleterious; ACMG often uses ≥ 25 as supporting evidence.
REVEL
專為 missense 設計的 ensemble,整合 13 個工具。0-1 連續值,> 0.5 視為 likely pathogenic。臨床診斷實務常用。
A missense-specific ensemble combining 13 tools. Continuous 0-1; >0.5 = likely pathogenic. Heavily used in clinical workflows.
SpliceAI
DeepMind 風格的 CNN,預測 splice gain/loss。delta score > 0.2 即值得注意。可發現 deep intronic 隱性 splicing 變異。
A CNN (DeepMind-style) predicting splice gain/loss. Delta score > 0.2 is noteworthy. Can uncover deep-intronic cryptic splicing variants.
AlphaMissense
2023 DeepMind 推出,基於 AlphaFold 結構預測 missense pathogenicity。0-1 連續值,覆蓋全人類 71 M missense;ACMG 已有採用建議。
Released 2023 by DeepMind; uses AlphaFold structures to predict missense pathogenicity. 0-1 continuous, covers all 71 M human missense variants; already cited in ACMG recommendations.
六、Pipeline 指令範例
# VEP 完整註解 (含 plugins)
vep \
-i cohort.pass.vcf.gz \
-o cohort.vep.vcf.gz \
--vcf --compress_output bgzip \
--cache --offline --dir_cache /data/vep_cache \
--assembly GRCh38 \
--species homo_sapiens \
--everything \
--pick_allele_gene \
--plugin REVEL,/data/plugins/revel_grch38.tsv.gz \
--plugin SpliceAI,snv=/data/spliceai_scores.snv.vcf.gz,indel=/data/spliceai_scores.indel.vcf.gz \
--plugin AlphaMissense,file=/data/AlphaMissense_hg38.tsv.gz \
--custom /data/clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNDN \
--custom /data/gnomad.exomes.v4.1.sites.vcf.bgz,gnomAD,vcf,exact,0,AF,AF_popmax \
--fork 8# ANNOVAR — 把 VCF 轉成 ANNOVAR input
convert2annovar.pl -format vcf4 cohort.pass.vcf.gz \
-outfile cohort.avinput -allsample -withfreq
# 多 db 同時註解 (table_annovar)
table_annovar.pl cohort.avinput humandb/ \
-buildver hg38 \
-out cohort \
-remove \
-protocol refGene,knownGene,clinvar_20240611,gnomad41_exome,dbnsfp42a,dbscsnv11 \
-operation g,g,f,f,f,f \
-nastring . \
-vcfinput \
-polish
# 結果: cohort.hg38_multianno.txt (tab-separated table)
# cohort.hg38_multianno.vcf (VCF format with INFO fields)# SnpEff — 下載 db (一次)
snpEff download -v GRCh38.105
# 註解
snpEff -v GRCh38.105 cohort.pass.vcf.gz > cohort.snpeff.vcf
# 同時產生 snpEff_summary.html 與 snpEff_genes.txt 統計報表
# 加 ClinVar 註解 (SnpSift)
SnpSift annotate clinvar.vcf.gz cohort.snpeff.vcf > cohort.snpeff.clinvar.vcf
# 加 dbNSFP (CADD, REVEL, SIFT, PolyPhen 等)
SnpSift dbnsfp -db dbNSFP4.7a.txt.gz cohort.snpeff.clinvar.vcf > cohort.full.vcf# SnpSift filter — 找 high-impact rare variants
SnpSift filter \
"(ANN[0].IMPACT = 'HIGH') && (dbNSFP_gnomAD_exomes_AF < 0.001 || na dbNSFP_gnomAD_exomes_AF)" \
cohort.full.vcf > rare_lof.vcf
# bcftools query — 提取 VEP CSQ 為表格
bcftools +split-vep cohort.vep.vcf.gz \
-d -f '%CHROM\t%POS\t%SYMBOL\t%Consequence\t%IMPACT\t%HGVSp\t%gnomAD_AF\t%CLIN_SIG\n' \
| head
# 找 PVS1 candidates: HIGH impact + gnomAD AF < 0.0001 + ClinVar pathogenic OR no entry
bcftools +split-vep cohort.vep.vcf.gz \
-i 'IMPACT="HIGH" & (gnomAD_AF<0.0001 | gnomAD_AF=".")' \
-f '%CHROM\t%POS\t%SYMBOL\t%Consequence\t%HGVSp\t%CLIN_SIG\n'七、FAQ
一個 variant 為什麼會有多條 CSQ?
--pick / --pick_allele_gene / --per_gene 縮減。臨床報告通常用 MANE Select 或 Ensembl canonical transcript。--pick / --pick_allele_gene / --per_gene to reduce. Clinical reports typically use the MANE Select or Ensembl canonical transcript.VEP cache 要多大空間?
CLIN_SIG 有矛盾解讀怎麼辦?
conflicting_interpretations_of_pathogenicity。實務上需到 ClinVar 網站查 review status (1-4 stars) 與 submitter 細節,以最新最高 review status 為準。conflicting_interpretations_of_pathogenicity. In practice, check the ClinVar website for review status (1–4 stars) and submitter details; defer to the most recent, highest-status assertion.MANE Select 是什麼?
--mane flag 標記。--mane flag in VEP to mark it.八、小測驗
Q1. VEP、ANNOVAR、SnpEff 對同一變異有時給不同 LoF 判定,主因是?
Q2. VEP IMPACT=HIGH 通常對應哪類變異?
Q3. 為什麼 ACMG 鼓勵使用 MANE Select transcript?