STEP 10 / 16

Bulk RNA-seq 分析

DESeq2 / edgeR / limma-voom──從 count matrix 到 DEGs 的完整工作流。

DESeq2 / edgeR / limma-voom — counts to DEGs, the full workflow.

一、Bulk RNA-seq 分析全流程

FASTQ → BAM

對 reads 比對到參考基因組(HISAT2、STAR、Salmon)。本章不討論這步──通常在 Linux/HPC 完成,輸出 BAM 或 transcript abundance。

Align reads to reference (HISAT2, STAR, Salmon). Not covered here — done on Linux/HPC, outputs BAM or abundance.

BAM → counts matrix

Rsubread::featureCounts()tximport 把 BAM/Salmon 結果聚合成「基因 × 樣本」整數矩陣。

Aggregate BAM/Salmon to gene × sample integer matrix via Rsubread::featureCounts() or tximport.

QC + filter

移除低表現基因(如 rowSums(counts) < 10);PCA 看樣本群聚是否合理。

Filter low-expression genes (e.g. rowSums(counts) < 10); PCA to sanity-check sample groupings.

Normalization + DGE

DESeq2 / edgeR / limma-voom 三選一。本章主推 DESeq2。

Pick DESeq2 / edgeR / limma-voom. We focus on DESeq2.

視覺化

Volcano、MA plot、heatmap of top DEGs(見第 12 章)。

Volcano, MA plot, heatmap of top DEGs (see Step 12).

功能富集

用 clusterProfiler 做 GO / KEGG / GSEA(見第 11 章)。

GO / KEGG / GSEA via clusterProfiler (see Step 11).

二、DESeq2 vs edgeR vs limma-voom

DESeq2edgeRlimma-voom
模型Negative BinomialNegative BinomialLinear (after voom)
輸入Raw integer countsRaw integer countsRaw integer counts
標準化Median-of-ratiosTMMTMM + voom (mean-variance)
速度中等
樣本數建議3-1003-100100+ 大 cohort
特色自動 outlier;shrinkage LFCcomplex design 穩大資料快
💡
選哪個? 樣本 ≤ 50 → DESeq2(最普及,文件最詳);> 100 大 cohort → limma-voom(速度優勢)。三者結果通常 80–90% 重疊;遇到爭議結果可以三個都跑、取交集。 Which? n ≤ 50 → DESeq2 (most popular, best docs); n > 100 → limma-voom (much faster). Results overlap 80–90%; for borderline genes run all three and intersect.

三、DESeq2 完整範例

# BiocManager::install(c('DESeq2','airway'))
library(DESeq2); library(airway)
data(airway)            # 內建 SummarizedExperiment 範例
airway$dex <- relevel(airway$dex, ref = 'untrt')   # 設定 reference

## 1) 建立 DESeqDataSet
dds <- DESeqDataSet(airway, design = ~ cell + dex)   # cell 是批次校正
dds

## 2) 過濾低表現基因 (建議:至少 10 個樣本中有 >= 10 reads)
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep, ]
nrow(dds)

## 3) 執行整套分析(一行搞定!)
dds <- DESeq(dds)        # 估計 size factors → dispersion → GLM fit → Wald

## 4) 取出結果
res <- results(dds, name = 'dex_trt_vs_untrt')   # 或 contrast = c('dex','trt','untrt')
summary(res)
head(res)

## 5) shrunken log2FC (推薦寫進論文)
res_shrunk <- lfcShrink(dds, coef = 'dex_trt_vs_untrt', type = 'apeglm')
plotMA(res_shrunk, ylim = c(-5, 5))

## 6) 排序、篩選 / Sort & filter
res_df <- as.data.frame(res_shrunk)
res_df$gene <- rownames(res_df)
sig <- subset(res_df, padj < 0.05 & abs(log2FoldChange) > 1)
nrow(sig); head(sig[order(sig$padj), ])

## 7) 視覺化用的 transformed counts(log-like,已穩定變異)
vsd <- vst(dds, blind = FALSE)        # 推薦
# rld <- rlog(dds, blind = FALSE)     # 老方法,較慢

## 8) PCA
plotPCA(vsd, intgroup = c('dex','cell'))

## 9) 樣本 distance 熱圖
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
# pheatmap::pheatmap(sampleDistMatrix)

## 10) 存結果 / Save
# write.csv(res_df, 'results/dge_DESeq2.csv', row.names = FALSE)
# saveRDS(dds, 'results/dds.rds')

四、design formula 是設計的靈魂

design = ~ X + Y + Z 中,最後一項通常是「你想比較的條件」(DESeq2 預設用最後一項做 contrast)。前面項是要校正的共變項(如批次、性別)。

In design = ~ X + Y + Z, the last term is usually "the condition of interest" (DESeq2 contrasts on the last term by default). Earlier terms are covariates to adjust for (batch, sex...).

design意義
~ condition只有處理組差異
~ batch + condition校正批次後比較處理
~ sex + age + condition校正性別與年齡
~ patient + conditionpaired design:同病人不同時點
~ condition + condition:time條件 × 時間 交互作用
# 多 contrast 同時拿 / Multiple contrasts
res_AvB <- results(dds, contrast = c('condition', 'A', 'B'))   # A vs B
res_AvC <- results(dds, contrast = c('condition', 'A', 'C'))   # A vs C

# Likelihood Ratio Test (LRT):檢驗整個因子是否重要
# 適合「3+ 組」或「時序」資料
dds_lrt <- DESeq(dds, test = 'LRT', reduced = ~ batch)
res_lrt <- results(dds_lrt)

# 互動項 design / Interaction design
# design(dds) <- ~ time + condition + time:condition
# dds <- DESeq(dds)
# res_int <- results(dds, name = 'timeT2.conditionTrt')   # 看交互作用 p

五、edgeR 與 limma-voom 速覽

# BiocManager::install('edgeR')
library(edgeR)

y <- DGEList(counts = counts, group = group)
keep <- filterByExpr(y, group = group)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- calcNormFactors(y, method = "TMM")

design <- model.matrix(~ group)
y <- estimateDisp(y, design)

fit <- glmQLFit(y, design)               # quasi-likelihood (推薦)
qlf <- glmQLFTest(fit, coef = 2)
topTags(qlf, n = 10)
res <- topTags(qlf, n = Inf)$table
res$padj <- res$FDR              # edgeR 用 FDR 欄
# BiocManager::install('limma')
library(limma); library(edgeR)

y <- DGEList(counts = counts, group = group)
keep <- filterByExpr(y, group = group)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- calcNormFactors(y)

design <- model.matrix(~ group)
v <- voom(y, design, plot = TRUE)        # mean-variance 轉換

fit <- lmFit(v, design)
fit <- eBayes(fit)                          # empirical Bayes shrinkage
topTable(fit, coef = 2, number = 10)
res <- topTable(fit, coef = 2, number = Inf, sort.by = "P")
# 欄名:logFC, AveExpr, t, P.Value, adj.P.Val, B

七、為什麼用 Negative Binomial?

Poisson 假設「平均 = 變異」;但真實 RNA-seq counts 變異 >> 平均(過度離散 / overdispersion)。NB 加一個 dispersion φ 來描述這。下圖比較同樣平均下,調整 φ 觀察分佈如何變寬。

Poisson assumes "mean = variance"; real RNA-seq counts have variance ≫ mean (overdispersion). NB adds a dispersion φ. Below: same mean, vary φ to see how the distribution widens.

📝 自我檢測

1. DESeq2 的輸入應該是?

1. DESeq2 expects which input?

A. log2 TPMA. log2 TPM
B. Z-score 後的表現量B. Z-scored expression
C. 原始整數 counts(raw integer counts)C. Raw integer counts
D. FPKMD. FPKM

2. 想在比較處理 vs 對照時校正批次效應,design 應寫?

2. To compare treated vs control while adjusting for batch — design?

A. ~ conditionA. ~ condition
B. ~ condition + batchB. ~ condition + batch
C. ~ batch + condition(要比較的放最後)C. ~ batch + condition (variable of interest last)
D. ~ batch * conditionD. ~ batch * condition

3. 為什麼 DESeq2 用 Negative Binomial 而非 Poisson?

3. Why does DESeq2 use Negative Binomial instead of Poisson?

A. NB 比較好寫A. NB is easier to code
B. RNA-seq counts 過度離散,變異 >> 平均;NB 多一個 dispersion 參數來描述B. RNA-seq counts are overdispersed (variance ≫ mean); NB adds a dispersion parameter
C. Poisson 不能處理整數C. Poisson can't handle integers
D. NB 比較快D. NB is faster