STEP 14 / 16

DNA 甲基化分析

Illumina 450K/EPIC array、minfi、ChAMP──DMP/DMR 偵測與表觀年齡。

Illumina 450K/EPIC arrays, minfi, ChAMP — DMP/DMR detection & epigenetic age.

一、DNA 甲基化是什麼?

DNA 甲基化是一種表觀遺傳修飾,主要發生在 CpG 位點的胞嘧啶 (5-methylcytosine, 5mC)。它調控基因表現、染色質結構,並在發育、老化、癌症中扮演關鍵角色。

常見定量值

  • β value:甲基化比例 (0-1),0 = 完全未甲基化、1 = 完全甲基化。直觀。
  • M value:log₂(β / (1-β)),統計上更穩健(變異數齊一),用於 t-test、回歸。
  • 實務口訣:分析用 M value、視覺化用 β value。

DNA methylation is an epigenetic modification, mostly at CpG cytosines (5-methylcytosine, 5mC). It regulates gene expression and chromatin structure, with key roles in development, aging, and cancer.

Common units:

  • β value: methylation fraction (0-1). Intuitive.
  • M value: log₂(β / (1-β)). Statistically better behaved (homoscedastic) — use for t-tests/regression.
  • Rule of thumb: M for stats, β for plotting.
平台探針數特色
Illumina 27K27,578第一代
Illumina 450K~485,000老資料常見
Illumina EPIC v1~865,000450K 升級版
Illumina EPIC v2~935,000最新
WGBS / RRBS全基因組單鹼基

二、minfi:Illumina array 標準流程

# BiocManager::install(c('minfi','IlluminaHumanMethylationEPICmanifest',
#                        'IlluminaHumanMethylationEPICanno.ilm10b4.hg19'))
# library(minfi); library(IlluminaHumanMethylationEPICmanifest)

# 1) 讀 IDAT 檔(每樣本兩個:紅、綠 channel)
# baseDir <- 'data/idat'
# targets <- read.metharray.sheet(baseDir)        # 讀 sample sheet (samples.csv)
# rgSet   <- read.metharray.exp(targets = targets)
# rgSet
# # RGChannelSet × n 樣本

# 2) QC 報告
# qcReport(rgSet, sampNames = targets$Sample_Name, pdf = 'results/qc.pdf')

# 3) 計算 detection p-values,剔除壞探針
# detP <- detectionP(rgSet)
# keep <- colMeans(detP) < 0.05      # 樣本層 QC
# rgSet <- rgSet[, keep]

# 4) 標準化 (推薦:preprocessFunnorm 用於跨組比較;preprocessQuantile 適合單組內)
# mset <- preprocessFunnorm(rgSet)         # GenomicRatioSet
# # mset <- preprocessQuantile(rgSet)
# # mset <- preprocessNoob(rgSet)          # background correction only

# 5) 取出 β / M
# beta <- getBeta(mset)              # 0-1
# M    <- getM(mset)                 # log2(β/(1-β))

# 6) 過濾:跨反應 (cross-reactive) 探針、SNP 干擾、X/Y 染色體
# library(minfiData)
# annot <- getAnnotation(mset)
# keep_pr <- !annot$Name %in% xreactive_probes &
#            annot$chr %in% paste0('chr', 1:22)
# mset <- mset[keep_pr, ]

cat('minfi demo — install Bioc packages above to run on real .idat data.\n')

三、DMP(差異甲基化探針)與 DMR(區域)

  • DMP (Differentially Methylated Position):單一 CpG 探針的差異。用 limma 配 M value。
  • DMR (Differentially Methylated Region):相鄰多個 CpG 同向差異(更生物意義)。常用 DMRcatebumphunter
  • DMP (Differentially Methylated Position): single-probe differences via limma on M values.
  • DMR (Differentially Methylated Region): contiguous CpGs differing in the same direction (more biological). Use DMRcate or bumphunter.
# DMP via limma / DMP detection
# library(limma); library(minfi)
# group <- factor(targets$Sample_Group, levels = c('ctrl','case'))
# design <- model.matrix(~ group)
# fit <- lmFit(M, design); fit <- eBayes(fit)
# dmp <- topTable(fit, coef = 2, number = Inf,
#                 genelist = annot[match(rownames(M), annot$Name), c('chr','pos','UCSC_RefGene_Name')])
# head(dmp[order(dmp$adj.P.Val), ])

# DMR via DMRcate
# library(DMRcate)
# myAnno <- cpg.annotate(object = M, datatype = 'array',
#                        what = 'M', arraytype = 'EPIC',
#                        analysis.type = 'differential',
#                        design = design, coef = 2)
# dmrs <- dmrcate(myAnno, lambda = 1000, C = 2)     # lambda = 1kb 視窗, C = ≥2 cpg
# results.ranges <- extractRanges(dmrs)
# results.ranges        # GRanges of DMRs

# 視覺化區域 / Visualize a DMR
# DMR.plot(ranges = results.ranges, dmr = 1,
#          CpGs = beta, what = 'Beta',
#          arraytype = 'EPIC', phen.col = c('blue','red')[group])

cat('DMP/DMR demo — install limma + DMRcate to actually run.\n')

四、ChAMP:一鍵流程

ChAMP 把 minfi 的所有步驟(讀檔 → QC → 標準化 → 批次校正 → DMP/DMR → 富集)包成一條 pipeline,適合初學者或快速跑分析。

ChAMP wraps the whole minfi pipeline (load → QC → norm → batch correct → DMP/DMR → enrichment) — great for beginners or quick exploratory runs.

# BiocManager::install('ChAMP')
# library(ChAMP)

# myLoad <- champ.load(directory = 'data/idat', arraytype = 'EPIC')
# champ.QC(beta = myLoad$beta, pheno = myLoad$pd$Sample_Group)
# myNorm <- champ.norm(beta = myLoad$beta, arraytype = 'EPIC', method = 'BMIQ')
# champ.SVD(beta = myNorm, pd = myLoad$pd)              # 看主要變異來源
# myCombat <- champ.runCombat(beta = myNorm, pd = myLoad$pd, batchname = 'Slide')
# myDMP <- champ.DMP(beta = myCombat, pheno = myLoad$pd$Sample_Group)
# myDMR <- champ.DMR(beta = myCombat, pheno = myLoad$pd$Sample_Group, method = 'Bumphunter')
# myBlock <- champ.Block(beta = myCombat, pheno = myLoad$pd$Sample_Group)
# myGSEA <- champ.GSEA(beta = myCombat, DMP = myDMP[[1]], DMR = myDMR, arraytype = 'EPIC')

cat('ChAMP demo — install ChAMP for end-to-end pipeline.\n')

五、Bisulfite-seq (WGBS / RRBS) 工具

工具用途
methylKit從 BAM/cov 計算與 DMC 偵測
bsseqBSseq + smoothing
DSS貝氏 DMR
methrix超大型 WGBS
# library(methylKit)
# files <- list('s1' = 'data/s1.cov', 's2' = 'data/s2.cov')
# myobj <- methRead(files, sample.id = list('s1','s2'),
#                   assembly = 'hg38', treatment = c(0,1),
#                   context = 'CpG', mincov = 4)
# myobj_norm <- normalizeCoverage(myobj)
# meth <- unite(myobj_norm)
# diff <- calculateDiffMeth(meth)
# diff_25 <- getMethylDiff(diff, difference = 25, qvalue = 0.01)

cat('WGBS demo — install methylKit / bsseq for real WGBS analysis.\n')

六、表觀遺傳年齡 (Epigenetic clock)

從 DNA 甲基化資料估計「生物年齡」──Horvath (2013) 用 353 個 CpG 預測年齡,誤差 < 4 歲。後續 Hannum、PhenoAge、GrimAge 增加疾病/壽命相關 CpG。生物年齡與實際年齡的差距 (age acceleration) 與多種疾病風險相關。

Estimate "biological age" from DNA methylation. Horvath (2013) used 353 CpGs with error < 4 years. Hannum, PhenoAge, GrimAge add disease/lifespan signatures. The gap between biological and chronological age (age acceleration) correlates with many disease risks.

# 線上工具:https://dnamage.genetics.ucla.edu (Horvath)

# R 套件 / R packages
# install.packages('methylclock')
# library(methylclock)
# ages <- DNAmAge(beta_matrix, clocks = c('Horvath','Hannum','PhenoAge','GrimAge'))
# head(ages)

# 自己對照 phenotype 看 age acceleration
# pheno$age_accel <- residuals(lm(ages$Horvath ~ pheno$Age))
# 然後與疾病/處死率做 Cox 等迴歸

cat('Epigenetic age demo — install methylclock for clock estimation.\n')

📝 自我檢測

1. 為什麼建議分析用 M value、視覺化用 β value?

1. Why use M for stats and β for plotting?

A. M 比較好看A. M looks prettier
B. β 算起來比較快B. β is faster to compute
C. M 在 0/1 兩端較對稱、變異數齊一,符合 t-test 假設;β 直觀,適合視覺化C. M is symmetric near 0/1 and homoscedastic (better for t-test); β is intuitive for plots
D. 兩者無差別D. No real difference

2. DMP 與 DMR 的關係?

2. Relationship between DMP and DMR?

A. 完全等價A. Identical
B. DMP 看區域、DMR 看單點B. DMP = region, DMR = point
C. DMP 看單點 CpG、DMR 看相鄰 CpG 共同變化的區域,後者通常更具生物學意義C. DMP = single CpG, DMR = co-varying neighbors — DMR is usually more biologically meaningful
D. DMP 是新版、DMR 是舊版D. DMP is new; DMR is old

3. 想分析 EPIC array 但只想用一條 pipeline 包辦所有步驟,最方便的選擇?

3. To analyze an EPIC array end-to-end with one pipeline — best choice?

A. ChAMPA. ChAMP
B. DESeq2B. DESeq2
C. vcfRC. vcfR
D. tidyverseD. tidyverse