Step 13: Variant & GWAS — R Bioinformatics Tutorial

VCF 概念

一、VCF 檔案格式

VCF (Variant Call Format) 是儲存變異 (SNP / indel / SV) 的標準格式。一行一個變異位點：

必填 8 欄：CHROM、POS、ID、REF、ALT、QUAL、FILTER、INFO
FORMAT 欄定義樣本資訊（如 GT、DP、GQ）
每個樣本一欄：0/0 同型 ref；0/1 雜合；1/1 同型 alt；./. missing

VCF (Variant Call Format) is the standard for variants (SNPs / indels / SVs). One row per locus:

8 mandatory fields: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO
FORMAT defines per-sample fields (GT, DP, GQ...)
One column per sample: 0/0 hom-ref, 0/1 het, 1/1 hom-alt, ./. missing

##fileformat=VCFv4.3
##contig=<ID=chr1,length=248956422>
##INFO=<ID=DP,Number=1,Type=Integer>
##FORMAT=<ID=GT,Number=1,Type=String>
##FORMAT=<ID=DP,Number=1,Type=Integer>
#CHROM  POS    ID         REF  ALT  QUAL  FILTER  INFO       FORMAT     S1       S2
chr1    1000   rs1        A    G    99    PASS    DP=42      GT:DP      0/1:30   1/1:25
chr1    1500   .          C    T    65    PASS    DP=18      GT:DP      0/0:18   0/1:14
chr2    2500   rs2        G    A    80    LowQual DP=11      GT:DP      ./.:0    0/1:11

讀 VCF

二、用 vcfR 讀寫 VCF

# install.packages('vcfR')
library(vcfR)

# 讀檔（支援 .gz）/ Read (supports .gz)
vcf <- read.vcfR('data/sample.vcf.gz', verbose = FALSE)
vcf            # 摘要：n variants × n samples

# 三大區段 / Three sections
queryMETA(vcf)            # header meta
head(getFIX(vcf))         # 8 個固定欄
head(extract.gt(vcf))     # 樣本 genotypes
head(vcf@gt)              # 原始 FORMAT 欄

# 過濾 / Filter
gt <- extract.gt(vcf, element = 'GT', as.numeric = FALSE)
dp <- extract.gt(vcf, element = 'DP', as.numeric = TRUE)

# 範例：過濾 DP < 10 設為 NA
gt[dp < 10] <- NA

# 變異到 tidy data.frame
df <- vcfR2tidy(vcf, single_frame = TRUE)$dat
head(df)

# 寫回 VCF / Write VCF
# write.vcf(vcf, 'data/filtered.vcf.gz')

VariantAnnotation

三、VariantAnnotation：Bioc 標準

# BiocManager::install(c('VariantAnnotation','SNPlocs.Hsapiens.dbSNP155.GRCh38'))
# library(VariantAnnotation)

# 讀為 VCF S4 物件 / Read as VCF S4 object
# vcf <- readVcf('data/sample.vcf.gz', genome = 'hg38')
# vcf
# rowRanges(vcf)              # GRanges of variants
# info(vcf)                   # INFO fields as DataFrame
# geno(vcf)$GT[1:5, ]         # genotype matrix
# header(vcf)

# 篩選範圍（chromosome × position）/ Subset by region
# library(GenomicRanges)
# region <- GRanges('chr17', IRanges(7670000, 7680000))    # TP53 區域
# vcf_tp53 <- vcf[overlapsAny(rowRanges(vcf), region)]

# 註解變異對基因組元素的影響 / Annotate variant impact
# library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
# loc <- locateVariants(vcf, txdb, AllVariants())     # exonic / intronic / intergenic / ...
# table(loc$LOCATION)

# 預測蛋白質改變 / Predict coding consequences
# library(BSgenome.Hsapiens.UCSC.hg38)
# coding <- predictCoding(vcf, txdb, seqSource = BSgenome.Hsapiens.UCSC.hg38)
# table(coding$CONSEQUENCE)   # synonymous / nonsynonymous / nonsense / frameshift / ...

# 進階註解工具 / Advanced
# - VEP (command-line)：最完整，offline 註解
# - SnpEff (command-line)：類似 VEP
# - myvariant (R API)：線上查詢 ClinVar / dbNSFP / gnomAD

cat('VariantAnnotation demo — install Bioc packages above to actually run.\n')

GWAS 流程

四、GWAS 全流程概覽

定型 (Genotyping)

SNP array (Illumina/Affymetrix) 或 WGS。原始輸出 PLINK .bed/.bim/.fam 或 VCF。

SNP array (Illumina/Affymetrix) or WGS. Outputs PLINK .bed/.bim/.fam or VCF.

QC 過濾

移除 call rate < 95% 的 SNP/樣本、MAF < 1%、HWE p < 1e-6、IBD 太相似（避免重抽親屬）。

Drop SNPs/samples with call rate < 95%, MAF < 1%, HWE p < 1e-6, related individuals (IBD).

群體分層校正

用 PCA 把祖源差異萃取成主成分，當共變項放進迴歸（避免假陽性）。

PCA on genotypes; include top PCs as covariates in regression to control population stratification.

關聯檢定

每個 SNP 對 phenotype 跑迴歸（連續→linear；二元→logistic）。校正 PCs、性別、年齡。

Regress phenotype on each SNP (linear for continuous, logistic for binary). Adjust for PCs, sex, age.

多重檢定校正

GWAS 顯著閾值 p < 5×10⁻⁸（Bonferroni for ~10⁶ 獨立 SNP）。

GWAS significance: p < 5×10⁻⁸ (Bonferroni for ~10⁶ independent SNPs).

視覺化

Manhattan plot + Q-Q plot；對 lead SNP 做 LD 區域圖、條件分析、micro-meta。

Manhattan + Q-Q; for lead SNPs do LocusZoom, conditional analysis, meta-analysis.

GWAS 工具

五、R 中跑 GWAS：SNPRelate / GWASTools / PLINK

# BiocManager::install(c('SNPRelate','GWASTools'))
# library(SNPRelate)

# 1) 把 PLINK / VCF 轉成 GDS（高效格式）
# snpgdsBED2GDS('data/cohort.bed','data/cohort.fam','data/cohort.bim',
#               'data/cohort.gds')
# snpgdsVCF2GDS('data/cohort.vcf.gz','data/cohort.gds', method='biallelic.only')

# 2) PCA on 基因型
# genofile <- snpgdsOpen('data/cohort.gds')
# pca <- snpgdsPCA(genofile, num.thread = 4)
# pop_pcs <- pca$eigenvect[, 1:5]
# # 把 pop_pcs 加進 phenotype 表，作為 GWAS 共變項

# 3) IBD：找親屬 / Kinship
# ibd <- snpgdsIBDMoM(genofile, maf = 0.05, missing.rate = 0.05)
# pairs <- snpgdsIBDSelection(ibd, kinship.cutoff = 0.0884)
# # 移除 kinship > 0.0884 的關係 (大於 second-degree relatives)

# 4) 簡單 GWAS：對每個 SNP 跑 logistic / linear regression
# library(GWASTools)
# 進階做法：用 GENESIS / SAIGE 處理混合模型

# 5) 最普及的是命令列 PLINK：
#   plink2 --bfile cohort --pheno phenos.txt --covar covars.txt \
#          --glm hide-covar --out gwas_results
# 然後在 R 中讀 gwas_results.glm.linear / .glm.logistic

cat('GWAS demo — production runs use PLINK + R for visualization.\n')

視覺化

六、Manhattan plot 與 Q-Q plot

# install.packages('qqman')
library(qqman)

# 模擬 GWAS 結果 (chromosome, position, SNP, p)
set.seed(7)
gwas <- data.frame(
  CHR = sample(1:22, 5000, replace = TRUE),
  BP  = sample(1e6:2.5e8, 5000),
  SNP = paste0('rs', 1:5000),
  P   = c(runif(4990), 10^-runif(10, 8, 15))
)

# Manhattan plot
manhattan(gwas, chr = 'CHR', bp = 'BP', p = 'P', snp = 'SNP',
          suggestiveline = -log10(1e-5),
          genomewideline = -log10(5e-8),
          annotatePval   = 5e-8,
          col = c('#1f4e8c','#888'))

# Q-Q plot：檢查群體分層或膨脹 (inflation)
qq(gwas$P)

# 計算 lambda inflation factor
chisq <- qchisq(1 - gwas$P, df = 1)
lambda <- median(chisq) / qchisq(0.5, df = 1)
cat('Lambda inflation =', round(lambda, 3),
    ifelse(lambda > 1.05, '⚠ 可能未充分校正', 'OK'), '\n')

# 進階替代：CMplot 套件 (更靈活的 Manhattan + heatmap)
# install.packages('CMplot')
# CMplot(gwas, plot.type = 'm', threshold = 5e-8, file = 'jpg')

互動

七、Manhattan plot 互動模擬

調整真實風險 SNP 數量與效應大小，看 Manhattan plot 上的「尖峰」如何浮現。

Adjust true causal SNPs & effect size, watch the Manhattan peaks emerge.

總 SNP 數 3000

真實風險 SNP 5

效應 (-log10 p) 10

— —

虛線：紅 = genome-wide significance (5×10⁻⁸)；橘 = suggestive (1×10⁻⁵)

Dashed: red = genome-wide (5×10⁻⁸); orange = suggestive (1×10⁻⁵)

📝 自我檢測

1. GWAS 的「全基因組顯著閾值」是？

1. The "genome-wide significance" threshold in GWAS?

A. p < 0.05A. p < 0.05

B. p < 0.01B. p < 0.01

C. p < 5×10⁻⁸（Bonferroni for ~10⁶ 獨立 SNP）C. p < 5×10⁻⁸ (Bonferroni for ~10⁶ independent SNPs)

D. FDR < 0.05D. FDR < 0.05

2. GWAS 中為什麼要把 PCA 主成分放進迴歸當共變項？

2. Why include PCA components as covariates in GWAS regression?

A. 沒原因，只是傳統A. No reason, just tradition

B. 為了讓計算更快B. To speed up computation

C. 校正群體分層──不同祖源族群會讓無關 SNP 假陽性顯著C. Adjust for population stratification — ancestry differences create spurious associations

D. 為了降低樣本數需求D. To reduce required sample size

3. Q-Q plot 的尾部偏離對角線往上，且 lambda > 1.1，最可能原因？

3. Q-Q plot tail above diagonal, lambda > 1.1 — most likely cause?

A. 結果完美A. Results are perfect

B. P-value 膨脹（如群體分層、cryptic relatedness 未校正）B. P-value inflation (e.g. uncorrected stratification or cryptic relatedness)

C. 樣本數太多C. Too many samples

D. SNP 太少D. Too few SNPs