STEP 11 / 16

功能富集分析

GO、KEGG、Reactome、GSEA──從一堆 DEG 看出生物學故事的工具。

GO, KEGG, Reactome, GSEA — turn a list of DEGs into biology.

一、ORA vs GSEA──兩種富集思路

ORA (Over-Representation)GSEA (Gene Set Enrichment)
輸入顯著基因 list(binary)所有基因的排序統計量(連續)
原理超幾何 / Fisher:pathway 在 list 中是否過度出現KS:pathway 基因是否集中在排序兩端
優點簡單、解讀快不需切閾值;抓微小一致訊號
缺點閾值敏感較慢;ranking metric 影響大
適用DEG 多 (50–500)微弱訊號或全基因組分析
💡
實務建議:兩個都跑!ORA 給你「明顯命中」的 pathway,GSEA 補強「整體趨勢」。論文常常兩個都報。 Practical tip: run both! ORA shows clear hits, GSEA adds coordinated trends. Papers commonly report both.

二、clusterProfiler:GO / KEGG ORA

# BiocManager::install(c('clusterProfiler','org.Hs.eg.db','enrichplot'))
library(clusterProfiler); library(org.Hs.eg.db); library(enrichplot)

# 1) 準備 / Prepare
sig_symbols <- c("TP53","BRCA1","EGFR","MYC","CDKN2A","BCL2","CCND1",
                 "ALK","RB1","NF1","PTEN","KRAS","ATM","CHEK2",
                 "MDM2","CDK4","CDK6","E2F1","RBL1","SMAD4")
universe    <- keys(org.Hs.eg.db, keytype = "SYMBOL")  # 全 background

# 2) GO biological process 富集 / GO BP enrichment
ego <- enrichGO(gene          = sig_symbols,
                universe      = universe,
                OrgDb         = org.Hs.eg.db,
                keyType       = "SYMBOL",
                ont           = "BP",          # BP / MF / CC / ALL
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.05,
                qvalueCutoff  = 0.10,
                readable      = TRUE)
head(ego)
as.data.frame(ego)[1:5, c("Description","GeneRatio","BgRatio","p.adjust","Count")]

# 3) KEGG pathway 富集 (需 ENTREZ ID)
entrez <- mapIds(org.Hs.eg.db, sig_symbols, "ENTREZID", "SYMBOL", multiVals = "first")
ekegg <- enrichKEGG(gene = na.omit(entrez), organism = "hsa",
                    pvalueCutoff = 0.05)
head(ekegg)

# 4) 視覺化 / Visualize (各種圖一鍵產出!)
dotplot(ego, showCategory = 15)         # 點圖:x=GeneRatio, color=padj, size=count
barplot(ego, showCategory = 15)
emapplot(pairwise_termsim(ego))         # 富集 term 之間相似度網路
cnetplot(ego, foldChange = NULL, showCategory = 5)   # term ↔ gene 網路
treeplot(pairwise_termsim(ego))         # 階層樹

# 存 / Save
# write.csv(as.data.frame(ego), 'results/ego_BP.csv', row.names = FALSE)
⚠️
universe 很重要! 預設 universe 是「整個基因組」,但若你只測了 20,000 個基因(其他被過濾),universe 應該是這 20,000 個。錯誤的 universe 會大幅扭曲 p-value。 Universe matters! The default is "the whole genome", but if you only measured 20,000 genes (others filtered out), universe should be those 20,000. Wrong universe heavily skews p-values.

三、clusterProfiler:GSEA

# GSEA 需要「所有基因」與其「排序統計量」(如 log2FC、t-statistic)
# GSEA needs ALL genes with a ranking metric (log2FC, t-stat...)

# 1) 從 DESeq2 結果建立 ranked list
# library(clusterProfiler); library(org.Hs.eg.db); library(DESeq2)
# res <- results(dds)
# gene_list <- res$log2FoldChange
# names(gene_list) <- mapIds(org.Hs.eg.db, rownames(res), 'ENTREZID', 'ENSEMBL')
# gene_list <- sort(gene_list[!is.na(names(gene_list))], decreasing = TRUE)

# 模擬 ranked list / Simulated
set.seed(42)
gene_list <- sort(rnorm(500, sd = 1.5), decreasing = TRUE)
names(gene_list) <- as.character(sample(1000:9999, 500))

# 2) GSEA on GO BP
# gsea_go <- gseGO(geneList = gene_list, OrgDb = org.Hs.eg.db,
#                  keyType = 'ENTREZID', ont = 'BP',
#                  pvalueCutoff = 0.10)
# head(gsea_go)

# 3) GSEA on KEGG
# gsea_kg <- gseKEGG(geneList = gene_list, organism = 'hsa',
#                    pvalueCutoff = 0.10)

# 4) GSEA on MSigDB hallmark gene sets (需先下載 .gmt)
# library(msigdbr)
# h <- msigdbr(species = 'Homo sapiens', category = 'H')   # hallmark
# h_t2g <- h[, c('gs_name','entrez_gene')]
# gsea_h <- GSEA(geneList = gene_list, TERM2GENE = h_t2g, pvalueCutoff = 0.10)

# 5) 視覺化
# gseaplot2(gsea_go, geneSetID = 1)               # 經典 enrichment 曲線
# ridgeplot(gsea_go)                              # ridge density per pathway
# dotplot(gsea_go, split = '.sign') + facet_grid(.~.sign)   # 上調 vs 下調

cat('GSEA demo — install clusterProfiler / org.Hs.eg.db to actually run.\n')
💡
選 ranking metric:常用 log2FoldChange(簡單)或 sign(log2FC) × -log10(pvalue)(兼顧方向與顯著性,更穩)。務必去除 NAsort(decreasing = TRUE) Pick the ranking metric: commonly log2FoldChange (simple) or sign(log2FC) × -log10(pvalue) (combines direction + significance). Must drop NAs and sort(decreasing = TRUE).

四、msigdbr 與自訂 gene set

MSigDB(Broad Institute 的標準 gene set 資料庫)有 50,000+ 集合,分八大類:

  • H Hallmark — 50 個精挑代表生物程序(推薦首選)
  • C1 染色體位置
  • C2 Curated(含 KEGG、Reactome、BioCarta、化學/基因擾動)
  • C3 Motif(TF / miRNA target)
  • C5 GO terms
  • C6 Oncogenic signatures
  • C7 Immunologic signatures
  • C8 Cell type signatures

MSigDB (Broad Institute) hosts 50,000+ gene sets in 8 collections:

  • H Hallmark — 50 curated processes (start here!)
  • C1 Positional
  • C2 Curated (KEGG, Reactome, BioCarta, chemical/genetic perturbations)
  • C3 Motif (TF / miRNA targets)
  • C5 GO
  • C6 Oncogenic signatures
  • C7 Immunologic signatures
  • C8 Cell-type signatures
# install.packages('msigdbr')
# library(msigdbr); library(clusterProfiler)

# msigdbr_show_species()                # 支援的物種
# h <- msigdbr(species = 'Homo sapiens', category = 'H')
# c2_kegg <- msigdbr(species = 'Homo sapiens', category = 'C2', subcategory = 'CP:KEGG')

# 把 msigdbr 轉成 clusterProfiler 用的 TERM2GENE 表
# t2g <- h[, c('gs_name','gene_symbol')]
# enricher(sig_symbols, TERM2GENE = t2g)
# GSEA(gene_list, TERM2GENE = t2g)

# 自訂 gene set:自製 .gmt 或 data.frame
# my_sets <- data.frame(
#   term  = c('PathA','PathA','PathB','PathB'),
#   gene  = c('TP53','BRCA1','MYC','EGFR')
# )
# enricher(sig_symbols, TERM2GENE = my_sets)

cat('Demo — install msigdbr + clusterProfiler to run.\n')

五、其他富集分析工具

工具特色
topGOGO 專用,考慮 DAG 結構
ReactomePAReactome 專用
fgsea極快 GSEA
WebGestaltRWebGestalt R 版
enrichREnrichR API
SPIA考慮 pathway 結構
GOplotGO 視覺化

六、ORA cutoff 敏感度

調整 padj 與 |log2FC| 閾值看「顯著基因 list」如何變化──富集結果的可重現性其實高度依賴閾值。

Adjust padj / |log2FC| thresholds and watch the "significant gene list" change — enrichment reproducibility hinges on cutoff choice.

模擬:500 個基因,padj ~ Beta、log2FC ~ Normal。每組「pathway」隨機抽 30 個基因。

Simulated: 500 genes, padj ~ Beta, log2FC ~ Normal. Each "pathway" samples 30 random genes.

📝 自我檢測

1. 下列何種情境最適合 GSEA 而非 ORA?

1. Which scenario fits GSEA better than ORA?

A. 有 500 個 padj < 0.05 的 DEGA. 500 DEGs with padj < 0.05
B. 想直接看「顯著」哪些 pathway 命中B. Want to see which pathways are clearly hit
C. DEG 數量很少但懷疑某些 pathway 有「整體微小但一致」的趨勢C. Few DEGs but suspect coordinated subtle changes in pathways
D. 想跑得很快D. Want speed

2. enrichGO 的 universe 該設成什麼?

2. What should you pass as universe to enrichGO?

A. 永遠用預設(整個基因組)A. Always use default (whole genome)
B. 你實際測到(過濾後仍保留)的所有基因B. All genes you actually measured (after filter)
C. 顯著 DEG 自己C. The significant DEGs themselves
D. 不重要,怎麼設都行D. Doesn't matter

3. 想做 GSEA on Hallmark gene sets,最簡單的取得 gene sets 套件?

3. To run GSEA on Hallmark gene sets — easiest source?

A. 自己手動把 50 個 hallmark 抄下來A. Hand-copy all 50 hallmarks
B. org.Hs.eg.dbB. org.Hs.eg.db
C. msigdbr::msigdbr(species='Homo sapiens', category='H')C. msigdbr::msigdbr(species='Homo sapiens', category='H')
D. ggplot2D. ggplot2