Step 11: Enrichment — R Bioinformatics Tutorial

兩大流派

一、ORA vs GSEA──兩種富集思路

—	ORA (Over-Representation)	GSEA (Gene Set Enrichment)
輸入	顯著基因 list（binary）	所有基因的排序統計量（連續）
原理	超幾何 / Fisher：pathway 在 list 中是否過度出現	KS：pathway 基因是否集中在排序兩端
優點	簡單、解讀快	不需切閾值；抓微小一致訊號
缺點	閾值敏感	較慢；ranking metric 影響大
適用	DEG 多 (50–500)	微弱訊號或全基因組分析

💡

實務建議：兩個都跑！ORA 給你「明顯命中」的 pathway，GSEA 補強「整體趨勢」。論文常常兩個都報。 Practical tip: run both! ORA shows clear hits, GSEA adds coordinated trends. Papers commonly report both.

ORA 範例

二、clusterProfiler：GO / KEGG ORA

# BiocManager::install(c('clusterProfiler','org.Hs.eg.db','enrichplot'))
library(clusterProfiler); library(org.Hs.eg.db); library(enrichplot)

# 1) 準備 / Prepare
sig_symbols <- c("TP53","BRCA1","EGFR","MYC","CDKN2A","BCL2","CCND1",
                 "ALK","RB1","NF1","PTEN","KRAS","ATM","CHEK2",
                 "MDM2","CDK4","CDK6","E2F1","RBL1","SMAD4")
universe    <- keys(org.Hs.eg.db, keytype = "SYMBOL")  # 全 background

# 2) GO biological process 富集 / GO BP enrichment
ego <- enrichGO(gene          = sig_symbols,
                universe      = universe,
                OrgDb         = org.Hs.eg.db,
                keyType       = "SYMBOL",
                ont           = "BP",          # BP / MF / CC / ALL
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.05,
                qvalueCutoff  = 0.10,
                readable      = TRUE)
head(ego)
as.data.frame(ego)[1:5, c("Description","GeneRatio","BgRatio","p.adjust","Count")]

# 3) KEGG pathway 富集 (需 ENTREZ ID)
entrez <- mapIds(org.Hs.eg.db, sig_symbols, "ENTREZID", "SYMBOL", multiVals = "first")
ekegg <- enrichKEGG(gene = na.omit(entrez), organism = "hsa",
                    pvalueCutoff = 0.05)
head(ekegg)

# 4) 視覺化 / Visualize (各種圖一鍵產出！)
dotplot(ego, showCategory = 15)         # 點圖：x=GeneRatio, color=padj, size=count
barplot(ego, showCategory = 15)
emapplot(pairwise_termsim(ego))         # 富集 term 之間相似度網路
cnetplot(ego, foldChange = NULL, showCategory = 5)   # term ↔ gene 網路
treeplot(pairwise_termsim(ego))         # 階層樹

# 存 / Save
# write.csv(as.data.frame(ego), 'results/ego_BP.csv', row.names = FALSE)

⚠️

universe 很重要！ 預設 universe 是「整個基因組」，但若你只測了 20,000 個基因（其他被過濾），universe 應該是這 20,000 個。錯誤的 universe 會大幅扭曲 p-value。 Universe matters! The default is "the whole genome", but if you only measured 20,000 genes (others filtered out), universe should be those 20,000. Wrong universe heavily skews p-values.

GSEA 範例

三、clusterProfiler：GSEA

# GSEA 需要「所有基因」與其「排序統計量」（如 log2FC、t-statistic）
# GSEA needs ALL genes with a ranking metric (log2FC, t-stat...)

# 1) 從 DESeq2 結果建立 ranked list
# library(clusterProfiler); library(org.Hs.eg.db); library(DESeq2)
# res <- results(dds)
# gene_list <- res$log2FoldChange
# names(gene_list) <- mapIds(org.Hs.eg.db, rownames(res), 'ENTREZID', 'ENSEMBL')
# gene_list <- sort(gene_list[!is.na(names(gene_list))], decreasing = TRUE)

# 模擬 ranked list / Simulated
set.seed(42)
gene_list <- sort(rnorm(500, sd = 1.5), decreasing = TRUE)
names(gene_list) <- as.character(sample(1000:9999, 500))

# 2) GSEA on GO BP
# gsea_go <- gseGO(geneList = gene_list, OrgDb = org.Hs.eg.db,
#                  keyType = 'ENTREZID', ont = 'BP',
#                  pvalueCutoff = 0.10)
# head(gsea_go)

# 3) GSEA on KEGG
# gsea_kg <- gseKEGG(geneList = gene_list, organism = 'hsa',
#                    pvalueCutoff = 0.10)

# 4) GSEA on MSigDB hallmark gene sets (需先下載 .gmt)
# library(msigdbr)
# h <- msigdbr(species = 'Homo sapiens', category = 'H')   # hallmark
# h_t2g <- h[, c('gs_name','entrez_gene')]
# gsea_h <- GSEA(geneList = gene_list, TERM2GENE = h_t2g, pvalueCutoff = 0.10)

# 5) 視覺化
# gseaplot2(gsea_go, geneSetID = 1)               # 經典 enrichment 曲線
# ridgeplot(gsea_go)                              # ridge density per pathway
# dotplot(gsea_go, split = '.sign') + facet_grid(.~.sign)   # 上調 vs 下調

cat('GSEA demo — install clusterProfiler / org.Hs.eg.db to actually run.\n')

💡

選 ranking metric：常用 log2FoldChange（簡單）或 sign(log2FC) × -log10(pvalue)（兼顧方向與顯著性，更穩）。務必去除 NA 並 sort(decreasing = TRUE)。 Pick the ranking metric: commonly log2FoldChange (simple) or sign(log2FC) × -log10(pvalue) (combines direction + significance). Must drop NAs and sort(decreasing = TRUE).

自訂 gene set

四、msigdbr 與自訂 gene set

MSigDB（Broad Institute 的標準 gene set 資料庫）有 50,000+ 集合，分八大類：

H Hallmark — 50 個精挑代表生物程序（推薦首選）
C1 染色體位置
C2 Curated（含 KEGG、Reactome、BioCarta、化學/基因擾動）
C3 Motif（TF / miRNA target）
C5 GO terms
C6 Oncogenic signatures
C7 Immunologic signatures
C8 Cell type signatures

MSigDB (Broad Institute) hosts 50,000+ gene sets in 8 collections:

H Hallmark — 50 curated processes (start here!)
C1 Positional
C2 Curated (KEGG, Reactome, BioCarta, chemical/genetic perturbations)
C3 Motif (TF / miRNA targets)
C5 GO
C6 Oncogenic signatures
C7 Immunologic signatures
C8 Cell-type signatures

# install.packages('msigdbr')
# library(msigdbr); library(clusterProfiler)

# msigdbr_show_species()                # 支援的物種
# h <- msigdbr(species = 'Homo sapiens', category = 'H')
# c2_kegg <- msigdbr(species = 'Homo sapiens', category = 'C2', subcategory = 'CP:KEGG')

# 把 msigdbr 轉成 clusterProfiler 用的 TERM2GENE 表
# t2g <- h[, c('gs_name','gene_symbol')]
# enricher(sig_symbols, TERM2GENE = t2g)
# GSEA(gene_list, TERM2GENE = t2g)

# 自訂 gene set：自製 .gmt 或 data.frame
# my_sets <- data.frame(
#   term  = c('PathA','PathA','PathB','PathB'),
#   gene  = c('TP53','BRCA1','MYC','EGFR')
# )
# enricher(sig_symbols, TERM2GENE = my_sets)

cat('Demo — install msigdbr + clusterProfiler to run.\n')

其他工具

五、其他富集分析工具

工具	特色
`topGO`	GO 專用，考慮 DAG 結構
`ReactomePA`	Reactome 專用
`fgsea`	極快 GSEA
`WebGestaltR`	WebGestalt R 版
`enrichR`	EnrichR API
`SPIA`	考慮 pathway 結構
`GOplot`	GO 視覺化

互動

六、ORA cutoff 敏感度

調整 padj 與 |log2FC| 閾值看「顯著基因 list」如何變化──富集結果的可重現性其實高度依賴閾值。

Adjust padj / |log2FC| thresholds and watch the "significant gene list" change — enrichment reproducibility hinges on cutoff choice.

padj 閾值 0.05

|log2FC| 閾值 1

— —

模擬：500 個基因，padj ~ Beta、log2FC ~ Normal。每組「pathway」隨機抽 30 個基因。

Simulated: 500 genes, padj ~ Beta, log2FC ~ Normal. Each "pathway" samples 30 random genes.

📝 自我檢測

1. 下列何種情境最適合 GSEA 而非 ORA？

1. Which scenario fits GSEA better than ORA?

A. 有 500 個 padj < 0.05 的 DEGA. 500 DEGs with padj < 0.05

B. 想直接看「顯著」哪些 pathway 命中B. Want to see which pathways are clearly hit

C. DEG 數量很少但懷疑某些 pathway 有「整體微小但一致」的趨勢C. Few DEGs but suspect coordinated subtle changes in pathways

D. 想跑得很快D. Want speed

2. enrichGO 的 universe 該設成什麼？

2. What should you pass as universe to enrichGO?

A. 永遠用預設（整個基因組）A. Always use default (whole genome)

B. 你實際測到（過濾後仍保留）的所有基因B. All genes you actually measured (after filter)

C. 顯著 DEG 自己C. The significant DEGs themselves

D. 不重要，怎麼設都行D. Doesn't matter

3. 想做 GSEA on Hallmark gene sets，最簡單的取得 gene sets 套件？

3. To run GSEA on Hallmark gene sets — easiest source?

A. 自己手動把 50 個 hallmark 抄下來A. Hand-copy all 50 hallmarks

B. org.Hs.eg.dbB. org.Hs.eg.db

C. msigdbr::msigdbr(species='Homo sapiens', category='H')C. msigdbr::msigdbr(species='Homo sapiens', category='H')

D. ggplot2D. ggplot2