Step 14: DNA Methylation — R Bioinformatics Tutorial

基本概念

一、DNA 甲基化是什麼？

DNA 甲基化是一種表觀遺傳修飾，主要發生在 CpG 位點的胞嘧啶 (5-methylcytosine, 5mC)。它調控基因表現、染色質結構，並在發育、老化、癌症中扮演關鍵角色。

常見定量值：

β value：甲基化比例 (0-1)，0 = 完全未甲基化、1 = 完全甲基化。直觀。
M value：log₂(β / (1-β))，統計上更穩健（變異數齊一），用於 t-test、回歸。
實務口訣：分析用 M value、視覺化用 β value。

DNA methylation is an epigenetic modification, mostly at CpG cytosines (5-methylcytosine, 5mC). It regulates gene expression and chromatin structure, with key roles in development, aging, and cancer.

Common units:

β value: methylation fraction (0-1). Intuitive.
M value: log₂(β / (1-β)). Statistically better behaved (homoscedastic) — use for t-tests/regression.
Rule of thumb: M for stats, β for plotting.

平台	探針數	特色
Illumina 27K	27,578	第一代
Illumina 450K	~485,000	老資料常見
Illumina EPIC v1	~865,000	450K 升級版
Illumina EPIC v2	~935,000	最新
WGBS / RRBS	全基因組	單鹼基

minfi 流程

二、minfi：Illumina array 標準流程

# BiocManager::install(c('minfi','IlluminaHumanMethylationEPICmanifest',
#                        'IlluminaHumanMethylationEPICanno.ilm10b4.hg19'))
# library(minfi); library(IlluminaHumanMethylationEPICmanifest)

# 1) 讀 IDAT 檔（每樣本兩個：紅、綠 channel）
# baseDir <- 'data/idat'
# targets <- read.metharray.sheet(baseDir)        # 讀 sample sheet (samples.csv)
# rgSet   <- read.metharray.exp(targets = targets)
# rgSet
# # RGChannelSet × n 樣本

# 2) QC 報告
# qcReport(rgSet, sampNames = targets$Sample_Name, pdf = 'results/qc.pdf')

# 3) 計算 detection p-values，剔除壞探針
# detP <- detectionP(rgSet)
# keep <- colMeans(detP) < 0.05      # 樣本層 QC
# rgSet <- rgSet[, keep]

# 4) 標準化 (推薦：preprocessFunnorm 用於跨組比較；preprocessQuantile 適合單組內)
# mset <- preprocessFunnorm(rgSet)         # GenomicRatioSet
# # mset <- preprocessQuantile(rgSet)
# # mset <- preprocessNoob(rgSet)          # background correction only

# 5) 取出 β / M
# beta <- getBeta(mset)              # 0-1
# M    <- getM(mset)                 # log2(β/(1-β))

# 6) 過濾：跨反應 (cross-reactive) 探針、SNP 干擾、X/Y 染色體
# library(minfiData)
# annot <- getAnnotation(mset)
# keep_pr <- !annot$Name %in% xreactive_probes &
#            annot$chr %in% paste0('chr', 1:22)
# mset <- mset[keep_pr, ]

cat('minfi demo — install Bioc packages above to run on real .idat data.\n')

DMP / DMR

三、DMP（差異甲基化探針）與 DMR（區域）

DMP (Differentially Methylated Position)：單一 CpG 探針的差異。用 limma 配 M value。
DMR (Differentially Methylated Region)：相鄰多個 CpG 同向差異（更生物意義）。常用 DMRcate、bumphunter。

DMP (Differentially Methylated Position): single-probe differences via limma on M values.
DMR (Differentially Methylated Region): contiguous CpGs differing in the same direction (more biological). Use DMRcate or bumphunter.

# DMP via limma / DMP detection
# library(limma); library(minfi)
# group <- factor(targets$Sample_Group, levels = c('ctrl','case'))
# design <- model.matrix(~ group)
# fit <- lmFit(M, design); fit <- eBayes(fit)
# dmp <- topTable(fit, coef = 2, number = Inf,
#                 genelist = annot[match(rownames(M), annot$Name), c('chr','pos','UCSC_RefGene_Name')])
# head(dmp[order(dmp$adj.P.Val), ])

# DMR via DMRcate
# library(DMRcate)
# myAnno <- cpg.annotate(object = M, datatype = 'array',
#                        what = 'M', arraytype = 'EPIC',
#                        analysis.type = 'differential',
#                        design = design, coef = 2)
# dmrs <- dmrcate(myAnno, lambda = 1000, C = 2)     # lambda = 1kb 視窗, C = ≥2 cpg
# results.ranges <- extractRanges(dmrs)
# results.ranges        # GRanges of DMRs

# 視覺化區域 / Visualize a DMR
# DMR.plot(ranges = results.ranges, dmr = 1,
#          CpGs = beta, what = 'Beta',
#          arraytype = 'EPIC', phen.col = c('blue','red')[group])

cat('DMP/DMR demo — install limma + DMRcate to actually run.\n')

ChAMP

四、ChAMP：一鍵流程

ChAMP 把 minfi 的所有步驟（讀檔 → QC → 標準化 → 批次校正 → DMP/DMR → 富集）包成一條 pipeline，適合初學者或快速跑分析。

ChAMP wraps the whole minfi pipeline (load → QC → norm → batch correct → DMP/DMR → enrichment) — great for beginners or quick exploratory runs.

# BiocManager::install('ChAMP')
# library(ChAMP)

# myLoad <- champ.load(directory = 'data/idat', arraytype = 'EPIC')
# champ.QC(beta = myLoad$beta, pheno = myLoad$pd$Sample_Group)
# myNorm <- champ.norm(beta = myLoad$beta, arraytype = 'EPIC', method = 'BMIQ')
# champ.SVD(beta = myNorm, pd = myLoad$pd)              # 看主要變異來源
# myCombat <- champ.runCombat(beta = myNorm, pd = myLoad$pd, batchname = 'Slide')
# myDMP <- champ.DMP(beta = myCombat, pheno = myLoad$pd$Sample_Group)
# myDMR <- champ.DMR(beta = myCombat, pheno = myLoad$pd$Sample_Group, method = 'Bumphunter')
# myBlock <- champ.Block(beta = myCombat, pheno = myLoad$pd$Sample_Group)
# myGSEA <- champ.GSEA(beta = myCombat, DMP = myDMP[[1]], DMR = myDMR, arraytype = 'EPIC')

cat('ChAMP demo — install ChAMP for end-to-end pipeline.\n')

WGBS / RRBS

五、Bisulfite-seq (WGBS / RRBS) 工具

工具	用途
`methylKit`	從 BAM/cov 計算與 DMC 偵測
`bsseq`	BSseq + smoothing
`DSS`	貝氏 DMR
`methrix`	超大型 WGBS

# library(methylKit)
# files <- list('s1' = 'data/s1.cov', 's2' = 'data/s2.cov')
# myobj <- methRead(files, sample.id = list('s1','s2'),
#                   assembly = 'hg38', treatment = c(0,1),
#                   context = 'CpG', mincov = 4)
# myobj_norm <- normalizeCoverage(myobj)
# meth <- unite(myobj_norm)
# diff <- calculateDiffMeth(meth)
# diff_25 <- getMethylDiff(diff, difference = 25, qvalue = 0.01)

cat('WGBS demo — install methylKit / bsseq for real WGBS analysis.\n')

表觀年齡

六、表觀遺傳年齡 (Epigenetic clock)

從 DNA 甲基化資料估計「生物年齡」──Horvath (2013) 用 353 個 CpG 預測年齡，誤差 < 4 歲。後續 Hannum、PhenoAge、GrimAge 增加疾病/壽命相關 CpG。生物年齡與實際年齡的差距 (age acceleration) 與多種疾病風險相關。

Estimate "biological age" from DNA methylation. Horvath (2013) used 353 CpGs with error < 4 years. Hannum, PhenoAge, GrimAge add disease/lifespan signatures. The gap between biological and chronological age (age acceleration) correlates with many disease risks.

# 線上工具：https://dnamage.genetics.ucla.edu (Horvath)

# R 套件 / R packages
# install.packages('methylclock')
# library(methylclock)
# ages <- DNAmAge(beta_matrix, clocks = c('Horvath','Hannum','PhenoAge','GrimAge'))
# head(ages)

# 自己對照 phenotype 看 age acceleration
# pheno$age_accel <- residuals(lm(ages$Horvath ~ pheno$Age))
# 然後與疾病/處死率做 Cox 等迴歸

cat('Epigenetic age demo — install methylclock for clock estimation.\n')

📝 自我檢測

1. 為什麼建議分析用 M value、視覺化用 β value？

1. Why use M for stats and β for plotting?

A. M 比較好看A. M looks prettier

B. β 算起來比較快B. β is faster to compute

C. M 在 0/1 兩端較對稱、變異數齊一，符合 t-test 假設；β 直觀，適合視覺化C. M is symmetric near 0/1 and homoscedastic (better for t-test); β is intuitive for plots

D. 兩者無差別D. No real difference

2. DMP 與 DMR 的關係？

2. Relationship between DMP and DMR?

A. 完全等價A. Identical

B. DMP 看區域、DMR 看單點B. DMP = region, DMR = point

C. DMP 看單點 CpG、DMR 看相鄰 CpG 共同變化的區域，後者通常更具生物學意義C. DMP = single CpG, DMR = co-varying neighbors — DMR is usually more biologically meaningful

D. DMP 是新版、DMR 是舊版D. DMP is new; DMR is old

3. 想分析 EPIC array 但只想用一條 pipeline 包辦所有步驟，最方便的選擇？

3. To analyze an EPIC array end-to-end with one pipeline — best choice?

A. ChAMPA. ChAMP

B. DESeq2B. DESeq2

C. vcfRC. vcfR

D. tidyverseD. tidyverse