一、VCF 檔案格式
VCF (Variant Call Format) 是儲存變異 (SNP / indel / SV) 的標準格式。一行一個變異位點:
- 必填 8 欄:CHROM、POS、ID、REF、ALT、QUAL、FILTER、INFO
- FORMAT 欄定義樣本資訊(如 GT、DP、GQ)
- 每個樣本一欄:
0/0同型 ref;0/1雜合;1/1同型 alt;./.missing
VCF (Variant Call Format) is the standard for variants (SNPs / indels / SVs). One row per locus:
- 8 mandatory fields: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO
- FORMAT defines per-sample fields (GT, DP, GQ...)
- One column per sample:
0/0hom-ref,0/1het,1/1hom-alt,./.missing
##fileformat=VCFv4.3 ##contig=<ID=chr1,length=248956422> ##INFO=<ID=DP,Number=1,Type=Integer> ##FORMAT=<ID=GT,Number=1,Type=String> ##FORMAT=<ID=DP,Number=1,Type=Integer> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 chr1 1000 rs1 A G 99 PASS DP=42 GT:DP 0/1:30 1/1:25 chr1 1500 . C T 65 PASS DP=18 GT:DP 0/0:18 0/1:14 chr2 2500 rs2 G A 80 LowQual DP=11 GT:DP ./.:0 0/1:11
二、用 vcfR 讀寫 VCF
# install.packages('vcfR')
library(vcfR)
# 讀檔(支援 .gz)/ Read (supports .gz)
vcf <- read.vcfR('data/sample.vcf.gz', verbose = FALSE)
vcf # 摘要:n variants × n samples
# 三大區段 / Three sections
queryMETA(vcf) # header meta
head(getFIX(vcf)) # 8 個固定欄
head(extract.gt(vcf)) # 樣本 genotypes
head(vcf@gt) # 原始 FORMAT 欄
# 過濾 / Filter
gt <- extract.gt(vcf, element = 'GT', as.numeric = FALSE)
dp <- extract.gt(vcf, element = 'DP', as.numeric = TRUE)
# 範例:過濾 DP < 10 設為 NA
gt[dp < 10] <- NA
# 變異到 tidy data.frame
df <- vcfR2tidy(vcf, single_frame = TRUE)$dat
head(df)
# 寫回 VCF / Write VCF
# write.vcf(vcf, 'data/filtered.vcf.gz')
三、VariantAnnotation:Bioc 標準
# BiocManager::install(c('VariantAnnotation','SNPlocs.Hsapiens.dbSNP155.GRCh38'))
# library(VariantAnnotation)
# 讀為 VCF S4 物件 / Read as VCF S4 object
# vcf <- readVcf('data/sample.vcf.gz', genome = 'hg38')
# vcf
# rowRanges(vcf) # GRanges of variants
# info(vcf) # INFO fields as DataFrame
# geno(vcf)$GT[1:5, ] # genotype matrix
# header(vcf)
# 篩選範圍(chromosome × position)/ Subset by region
# library(GenomicRanges)
# region <- GRanges('chr17', IRanges(7670000, 7680000)) # TP53 區域
# vcf_tp53 <- vcf[overlapsAny(rowRanges(vcf), region)]
# 註解變異對基因組元素的影響 / Annotate variant impact
# library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
# loc <- locateVariants(vcf, txdb, AllVariants()) # exonic / intronic / intergenic / ...
# table(loc$LOCATION)
# 預測蛋白質改變 / Predict coding consequences
# library(BSgenome.Hsapiens.UCSC.hg38)
# coding <- predictCoding(vcf, txdb, seqSource = BSgenome.Hsapiens.UCSC.hg38)
# table(coding$CONSEQUENCE) # synonymous / nonsynonymous / nonsense / frameshift / ...
# 進階註解工具 / Advanced
# - VEP (command-line):最完整,offline 註解
# - SnpEff (command-line):類似 VEP
# - myvariant (R API):線上查詢 ClinVar / dbNSFP / gnomAD
cat('VariantAnnotation demo — install Bioc packages above to actually run.\n')
四、GWAS 全流程概覽
定型 (Genotyping)
SNP array (Illumina/Affymetrix) 或 WGS。原始輸出 PLINK .bed/.bim/.fam 或 VCF。
SNP array (Illumina/Affymetrix) or WGS. Outputs PLINK .bed/.bim/.fam or VCF.
QC 過濾
移除 call rate < 95% 的 SNP/樣本、MAF < 1%、HWE p < 1e-6、IBD 太相似(避免重抽親屬)。
Drop SNPs/samples with call rate < 95%, MAF < 1%, HWE p < 1e-6, related individuals (IBD).
群體分層校正
用 PCA 把祖源差異萃取成主成分,當共變項放進迴歸(避免假陽性)。
PCA on genotypes; include top PCs as covariates in regression to control population stratification.
關聯檢定
每個 SNP 對 phenotype 跑迴歸(連續→linear;二元→logistic)。校正 PCs、性別、年齡。
Regress phenotype on each SNP (linear for continuous, logistic for binary). Adjust for PCs, sex, age.
多重檢定校正
GWAS 顯著閾值 p < 5×10⁻⁸(Bonferroni for ~10⁶ 獨立 SNP)。
GWAS significance: p < 5×10⁻⁸ (Bonferroni for ~10⁶ independent SNPs).
視覺化
Manhattan plot + Q-Q plot;對 lead SNP 做 LD 區域圖、條件分析、micro-meta。
Manhattan + Q-Q; for lead SNPs do LocusZoom, conditional analysis, meta-analysis.
五、R 中跑 GWAS:SNPRelate / GWASTools / PLINK
# BiocManager::install(c('SNPRelate','GWASTools'))
# library(SNPRelate)
# 1) 把 PLINK / VCF 轉成 GDS(高效格式)
# snpgdsBED2GDS('data/cohort.bed','data/cohort.fam','data/cohort.bim',
# 'data/cohort.gds')
# snpgdsVCF2GDS('data/cohort.vcf.gz','data/cohort.gds', method='biallelic.only')
# 2) PCA on 基因型
# genofile <- snpgdsOpen('data/cohort.gds')
# pca <- snpgdsPCA(genofile, num.thread = 4)
# pop_pcs <- pca$eigenvect[, 1:5]
# # 把 pop_pcs 加進 phenotype 表,作為 GWAS 共變項
# 3) IBD:找親屬 / Kinship
# ibd <- snpgdsIBDMoM(genofile, maf = 0.05, missing.rate = 0.05)
# pairs <- snpgdsIBDSelection(ibd, kinship.cutoff = 0.0884)
# # 移除 kinship > 0.0884 的關係 (大於 second-degree relatives)
# 4) 簡單 GWAS:對每個 SNP 跑 logistic / linear regression
# library(GWASTools)
# 進階做法:用 GENESIS / SAIGE 處理混合模型
# 5) 最普及的是命令列 PLINK:
# plink2 --bfile cohort --pheno phenos.txt --covar covars.txt \
# --glm hide-covar --out gwas_results
# 然後在 R 中讀 gwas_results.glm.linear / .glm.logistic
cat('GWAS demo — production runs use PLINK + R for visualization.\n')
六、Manhattan plot 與 Q-Q plot
# install.packages('qqman')
library(qqman)
# 模擬 GWAS 結果 (chromosome, position, SNP, p)
set.seed(7)
gwas <- data.frame(
CHR = sample(1:22, 5000, replace = TRUE),
BP = sample(1e6:2.5e8, 5000),
SNP = paste0('rs', 1:5000),
P = c(runif(4990), 10^-runif(10, 8, 15))
)
# Manhattan plot
manhattan(gwas, chr = 'CHR', bp = 'BP', p = 'P', snp = 'SNP',
suggestiveline = -log10(1e-5),
genomewideline = -log10(5e-8),
annotatePval = 5e-8,
col = c('#1f4e8c','#888'))
# Q-Q plot:檢查群體分層或膨脹 (inflation)
qq(gwas$P)
# 計算 lambda inflation factor
chisq <- qchisq(1 - gwas$P, df = 1)
lambda <- median(chisq) / qchisq(0.5, df = 1)
cat('Lambda inflation =', round(lambda, 3),
ifelse(lambda > 1.05, '⚠ 可能未充分校正', 'OK'), '\n')
# 進階替代:CMplot 套件 (更靈活的 Manhattan + heatmap)
# install.packages('CMplot')
# CMplot(gwas, plot.type = 'm', threshold = 5e-8, file = 'jpg')
七、Manhattan plot 互動模擬
調整真實風險 SNP 數量與效應大小,看 Manhattan plot 上的「尖峰」如何浮現。
Adjust true causal SNPs & effect size, watch the Manhattan peaks emerge.
虛線:紅 = genome-wide significance (5×10⁻⁸);橘 = suggestive (1×10⁻⁵)
Dashed: red = genome-wide (5×10⁻⁸); orange = suggestive (1×10⁻⁵)
📝 自我檢測
1. GWAS 的「全基因組顯著閾值」是?
1. The "genome-wide significance" threshold in GWAS?
2. GWAS 中為什麼要把 PCA 主成分放進迴歸當共變項?
2. Why include PCA components as covariates in GWAS regression?
3. Q-Q plot 的尾部偏離對角線往上,且 lambda > 1.1,最可能原因?
3. Q-Q plot tail above diagonal, lambda > 1.1 — most likely cause?