一、ORA vs GSEA──兩種富集思路
| — | ORA (Over-Representation) | GSEA (Gene Set Enrichment) |
|---|---|---|
| 輸入 | 顯著基因 list(binary) | 所有基因的排序統計量(連續) |
| 原理 | 超幾何 / Fisher:pathway 在 list 中是否過度出現 | KS:pathway 基因是否集中在排序兩端 |
| 優點 | 簡單、解讀快 | 不需切閾值;抓微小一致訊號 |
| 缺點 | 閾值敏感 | 較慢;ranking metric 影響大 |
| 適用 | DEG 多 (50–500) | 微弱訊號或全基因組分析 |
二、clusterProfiler:GO / KEGG ORA
# BiocManager::install(c('clusterProfiler','org.Hs.eg.db','enrichplot'))
library(clusterProfiler); library(org.Hs.eg.db); library(enrichplot)
# 1) 準備 / Prepare
sig_symbols <- c("TP53","BRCA1","EGFR","MYC","CDKN2A","BCL2","CCND1",
"ALK","RB1","NF1","PTEN","KRAS","ATM","CHEK2",
"MDM2","CDK4","CDK6","E2F1","RBL1","SMAD4")
universe <- keys(org.Hs.eg.db, keytype = "SYMBOL") # 全 background
# 2) GO biological process 富集 / GO BP enrichment
ego <- enrichGO(gene = sig_symbols,
universe = universe,
OrgDb = org.Hs.eg.db,
keyType = "SYMBOL",
ont = "BP", # BP / MF / CC / ALL
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.10,
readable = TRUE)
head(ego)
as.data.frame(ego)[1:5, c("Description","GeneRatio","BgRatio","p.adjust","Count")]
# 3) KEGG pathway 富集 (需 ENTREZ ID)
entrez <- mapIds(org.Hs.eg.db, sig_symbols, "ENTREZID", "SYMBOL", multiVals = "first")
ekegg <- enrichKEGG(gene = na.omit(entrez), organism = "hsa",
pvalueCutoff = 0.05)
head(ekegg)
# 4) 視覺化 / Visualize (各種圖一鍵產出!)
dotplot(ego, showCategory = 15) # 點圖:x=GeneRatio, color=padj, size=count
barplot(ego, showCategory = 15)
emapplot(pairwise_termsim(ego)) # 富集 term 之間相似度網路
cnetplot(ego, foldChange = NULL, showCategory = 5) # term ↔ gene 網路
treeplot(pairwise_termsim(ego)) # 階層樹
# 存 / Save
# write.csv(as.data.frame(ego), 'results/ego_BP.csv', row.names = FALSE)
三、clusterProfiler:GSEA
# GSEA 需要「所有基因」與其「排序統計量」(如 log2FC、t-statistic)
# GSEA needs ALL genes with a ranking metric (log2FC, t-stat...)
# 1) 從 DESeq2 結果建立 ranked list
# library(clusterProfiler); library(org.Hs.eg.db); library(DESeq2)
# res <- results(dds)
# gene_list <- res$log2FoldChange
# names(gene_list) <- mapIds(org.Hs.eg.db, rownames(res), 'ENTREZID', 'ENSEMBL')
# gene_list <- sort(gene_list[!is.na(names(gene_list))], decreasing = TRUE)
# 模擬 ranked list / Simulated
set.seed(42)
gene_list <- sort(rnorm(500, sd = 1.5), decreasing = TRUE)
names(gene_list) <- as.character(sample(1000:9999, 500))
# 2) GSEA on GO BP
# gsea_go <- gseGO(geneList = gene_list, OrgDb = org.Hs.eg.db,
# keyType = 'ENTREZID', ont = 'BP',
# pvalueCutoff = 0.10)
# head(gsea_go)
# 3) GSEA on KEGG
# gsea_kg <- gseKEGG(geneList = gene_list, organism = 'hsa',
# pvalueCutoff = 0.10)
# 4) GSEA on MSigDB hallmark gene sets (需先下載 .gmt)
# library(msigdbr)
# h <- msigdbr(species = 'Homo sapiens', category = 'H') # hallmark
# h_t2g <- h[, c('gs_name','entrez_gene')]
# gsea_h <- GSEA(geneList = gene_list, TERM2GENE = h_t2g, pvalueCutoff = 0.10)
# 5) 視覺化
# gseaplot2(gsea_go, geneSetID = 1) # 經典 enrichment 曲線
# ridgeplot(gsea_go) # ridge density per pathway
# dotplot(gsea_go, split = '.sign') + facet_grid(.~.sign) # 上調 vs 下調
cat('GSEA demo — install clusterProfiler / org.Hs.eg.db to actually run.\n')
log2FoldChange(簡單)或 sign(log2FC) × -log10(pvalue)(兼顧方向與顯著性,更穩)。務必去除 NA 並 sort(decreasing = TRUE)。
Pick the ranking metric: commonly log2FoldChange (simple) or sign(log2FC) × -log10(pvalue) (combines direction + significance). Must drop NAs and sort(decreasing = TRUE).
四、msigdbr 與自訂 gene set
MSigDB(Broad Institute 的標準 gene set 資料庫)有 50,000+ 集合,分八大類:
- H Hallmark — 50 個精挑代表生物程序(推薦首選)
- C1 染色體位置
- C2 Curated(含 KEGG、Reactome、BioCarta、化學/基因擾動)
- C3 Motif(TF / miRNA target)
- C5 GO terms
- C6 Oncogenic signatures
- C7 Immunologic signatures
- C8 Cell type signatures
MSigDB (Broad Institute) hosts 50,000+ gene sets in 8 collections:
- H Hallmark — 50 curated processes (start here!)
- C1 Positional
- C2 Curated (KEGG, Reactome, BioCarta, chemical/genetic perturbations)
- C3 Motif (TF / miRNA targets)
- C5 GO
- C6 Oncogenic signatures
- C7 Immunologic signatures
- C8 Cell-type signatures
# install.packages('msigdbr')
# library(msigdbr); library(clusterProfiler)
# msigdbr_show_species() # 支援的物種
# h <- msigdbr(species = 'Homo sapiens', category = 'H')
# c2_kegg <- msigdbr(species = 'Homo sapiens', category = 'C2', subcategory = 'CP:KEGG')
# 把 msigdbr 轉成 clusterProfiler 用的 TERM2GENE 表
# t2g <- h[, c('gs_name','gene_symbol')]
# enricher(sig_symbols, TERM2GENE = t2g)
# GSEA(gene_list, TERM2GENE = t2g)
# 自訂 gene set:自製 .gmt 或 data.frame
# my_sets <- data.frame(
# term = c('PathA','PathA','PathB','PathB'),
# gene = c('TP53','BRCA1','MYC','EGFR')
# )
# enricher(sig_symbols, TERM2GENE = my_sets)
cat('Demo — install msigdbr + clusterProfiler to run.\n')
五、其他富集分析工具
| 工具 | 特色 |
|---|---|
topGO | GO 專用,考慮 DAG 結構 |
ReactomePA | Reactome 專用 |
fgsea | 極快 GSEA |
WebGestaltR | WebGestalt R 版 |
enrichR | EnrichR API |
SPIA | 考慮 pathway 結構 |
GOplot | GO 視覺化 |
六、ORA cutoff 敏感度
調整 padj 與 |log2FC| 閾值看「顯著基因 list」如何變化──富集結果的可重現性其實高度依賴閾值。
Adjust padj / |log2FC| thresholds and watch the "significant gene list" change — enrichment reproducibility hinges on cutoff choice.
模擬:500 個基因,padj ~ Beta、log2FC ~ Normal。每組「pathway」隨機抽 30 個基因。
Simulated: 500 genes, padj ~ Beta, log2FC ~ Normal. Each "pathway" samples 30 random genes.
📝 自我檢測
1. 下列何種情境最適合 GSEA 而非 ORA?
1. Which scenario fits GSEA better than ORA?
2. enrichGO 的 universe 該設成什麼?
2. What should you pass as universe to enrichGO?
3. 想做 GSEA on Hallmark gene sets,最簡單的取得 gene sets 套件?
3. To run GSEA on Hallmark gene sets — easiest source?