一、DNA 甲基化是什麼?
DNA 甲基化是一種表觀遺傳修飾,主要發生在 CpG 位點的胞嘧啶 (5-methylcytosine, 5mC)。它調控基因表現、染色質結構,並在發育、老化、癌症中扮演關鍵角色。
常見定量值:
- β value:甲基化比例 (0-1),0 = 完全未甲基化、1 = 完全甲基化。直觀。
- M value:log₂(β / (1-β)),統計上更穩健(變異數齊一),用於 t-test、回歸。
- 實務口訣:分析用 M value、視覺化用 β value。
DNA methylation is an epigenetic modification, mostly at CpG cytosines (5-methylcytosine, 5mC). It regulates gene expression and chromatin structure, with key roles in development, aging, and cancer.
Common units:
- β value: methylation fraction (0-1). Intuitive.
- M value: log₂(β / (1-β)). Statistically better behaved (homoscedastic) — use for t-tests/regression.
- Rule of thumb: M for stats, β for plotting.
| 平台 | 探針數 | 特色 |
|---|---|---|
| Illumina 27K | 27,578 | 第一代 |
| Illumina 450K | ~485,000 | 老資料常見 |
| Illumina EPIC v1 | ~865,000 | 450K 升級版 |
| Illumina EPIC v2 | ~935,000 | 最新 |
| WGBS / RRBS | 全基因組 | 單鹼基 |
二、minfi:Illumina array 標準流程
# BiocManager::install(c('minfi','IlluminaHumanMethylationEPICmanifest',
# 'IlluminaHumanMethylationEPICanno.ilm10b4.hg19'))
# library(minfi); library(IlluminaHumanMethylationEPICmanifest)
# 1) 讀 IDAT 檔(每樣本兩個:紅、綠 channel)
# baseDir <- 'data/idat'
# targets <- read.metharray.sheet(baseDir) # 讀 sample sheet (samples.csv)
# rgSet <- read.metharray.exp(targets = targets)
# rgSet
# # RGChannelSet × n 樣本
# 2) QC 報告
# qcReport(rgSet, sampNames = targets$Sample_Name, pdf = 'results/qc.pdf')
# 3) 計算 detection p-values,剔除壞探針
# detP <- detectionP(rgSet)
# keep <- colMeans(detP) < 0.05 # 樣本層 QC
# rgSet <- rgSet[, keep]
# 4) 標準化 (推薦:preprocessFunnorm 用於跨組比較;preprocessQuantile 適合單組內)
# mset <- preprocessFunnorm(rgSet) # GenomicRatioSet
# # mset <- preprocessQuantile(rgSet)
# # mset <- preprocessNoob(rgSet) # background correction only
# 5) 取出 β / M
# beta <- getBeta(mset) # 0-1
# M <- getM(mset) # log2(β/(1-β))
# 6) 過濾:跨反應 (cross-reactive) 探針、SNP 干擾、X/Y 染色體
# library(minfiData)
# annot <- getAnnotation(mset)
# keep_pr <- !annot$Name %in% xreactive_probes &
# annot$chr %in% paste0('chr', 1:22)
# mset <- mset[keep_pr, ]
cat('minfi demo — install Bioc packages above to run on real .idat data.\n')
三、DMP(差異甲基化探針)與 DMR(區域)
- DMP (Differentially Methylated Position):單一 CpG 探針的差異。用 limma 配 M value。
- DMR (Differentially Methylated Region):相鄰多個 CpG 同向差異(更生物意義)。常用
DMRcate、bumphunter。
- DMP (Differentially Methylated Position): single-probe differences via limma on M values.
- DMR (Differentially Methylated Region): contiguous CpGs differing in the same direction (more biological). Use
DMRcateorbumphunter.
# DMP via limma / DMP detection
# library(limma); library(minfi)
# group <- factor(targets$Sample_Group, levels = c('ctrl','case'))
# design <- model.matrix(~ group)
# fit <- lmFit(M, design); fit <- eBayes(fit)
# dmp <- topTable(fit, coef = 2, number = Inf,
# genelist = annot[match(rownames(M), annot$Name), c('chr','pos','UCSC_RefGene_Name')])
# head(dmp[order(dmp$adj.P.Val), ])
# DMR via DMRcate
# library(DMRcate)
# myAnno <- cpg.annotate(object = M, datatype = 'array',
# what = 'M', arraytype = 'EPIC',
# analysis.type = 'differential',
# design = design, coef = 2)
# dmrs <- dmrcate(myAnno, lambda = 1000, C = 2) # lambda = 1kb 視窗, C = ≥2 cpg
# results.ranges <- extractRanges(dmrs)
# results.ranges # GRanges of DMRs
# 視覺化區域 / Visualize a DMR
# DMR.plot(ranges = results.ranges, dmr = 1,
# CpGs = beta, what = 'Beta',
# arraytype = 'EPIC', phen.col = c('blue','red')[group])
cat('DMP/DMR demo — install limma + DMRcate to actually run.\n')
四、ChAMP:一鍵流程
ChAMP 把 minfi 的所有步驟(讀檔 → QC → 標準化 → 批次校正 → DMP/DMR → 富集)包成一條 pipeline,適合初學者或快速跑分析。
ChAMP wraps the whole minfi pipeline (load → QC → norm → batch correct → DMP/DMR → enrichment) — great for beginners or quick exploratory runs.
# BiocManager::install('ChAMP')
# library(ChAMP)
# myLoad <- champ.load(directory = 'data/idat', arraytype = 'EPIC')
# champ.QC(beta = myLoad$beta, pheno = myLoad$pd$Sample_Group)
# myNorm <- champ.norm(beta = myLoad$beta, arraytype = 'EPIC', method = 'BMIQ')
# champ.SVD(beta = myNorm, pd = myLoad$pd) # 看主要變異來源
# myCombat <- champ.runCombat(beta = myNorm, pd = myLoad$pd, batchname = 'Slide')
# myDMP <- champ.DMP(beta = myCombat, pheno = myLoad$pd$Sample_Group)
# myDMR <- champ.DMR(beta = myCombat, pheno = myLoad$pd$Sample_Group, method = 'Bumphunter')
# myBlock <- champ.Block(beta = myCombat, pheno = myLoad$pd$Sample_Group)
# myGSEA <- champ.GSEA(beta = myCombat, DMP = myDMP[[1]], DMR = myDMR, arraytype = 'EPIC')
cat('ChAMP demo — install ChAMP for end-to-end pipeline.\n')
五、Bisulfite-seq (WGBS / RRBS) 工具
| 工具 | 用途 |
|---|---|
methylKit | 從 BAM/cov 計算與 DMC 偵測 |
bsseq | BSseq + smoothing |
DSS | 貝氏 DMR |
methrix | 超大型 WGBS |
# library(methylKit)
# files <- list('s1' = 'data/s1.cov', 's2' = 'data/s2.cov')
# myobj <- methRead(files, sample.id = list('s1','s2'),
# assembly = 'hg38', treatment = c(0,1),
# context = 'CpG', mincov = 4)
# myobj_norm <- normalizeCoverage(myobj)
# meth <- unite(myobj_norm)
# diff <- calculateDiffMeth(meth)
# diff_25 <- getMethylDiff(diff, difference = 25, qvalue = 0.01)
cat('WGBS demo — install methylKit / bsseq for real WGBS analysis.\n')
六、表觀遺傳年齡 (Epigenetic clock)
從 DNA 甲基化資料估計「生物年齡」──Horvath (2013) 用 353 個 CpG 預測年齡,誤差 < 4 歲。後續 Hannum、PhenoAge、GrimAge 增加疾病/壽命相關 CpG。生物年齡與實際年齡的差距 (age acceleration) 與多種疾病風險相關。
Estimate "biological age" from DNA methylation. Horvath (2013) used 353 CpGs with error < 4 years. Hannum, PhenoAge, GrimAge add disease/lifespan signatures. The gap between biological and chronological age (age acceleration) correlates with many disease risks.
# 線上工具:https://dnamage.genetics.ucla.edu (Horvath)
# R 套件 / R packages
# install.packages('methylclock')
# library(methylclock)
# ages <- DNAmAge(beta_matrix, clocks = c('Horvath','Hannum','PhenoAge','GrimAge'))
# head(ages)
# 自己對照 phenotype 看 age acceleration
# pheno$age_accel <- residuals(lm(ages$Horvath ~ pheno$Age))
# 然後與疾病/處死率做 Cox 等迴歸
cat('Epigenetic age demo — install methylclock for clock estimation.\n')
📝 自我檢測
1. 為什麼建議分析用 M value、視覺化用 β value?
1. Why use M for stats and β for plotting?
2. DMP 與 DMR 的關係?
2. Relationship between DMP and DMR?
3. 想分析 EPIC array 但只想用一條 pipeline 包辦所有步驟,最方便的選擇?
3. To analyze an EPIC array end-to-end with one pipeline — best choice?