Step 15: DE Statistics (limma / edgeR / DESeq2) — Statistical Inference Tutorial

概覽

DE 的核心問題

差異表現（Differential Expression, DE）的目標：在兩組（或多組）樣本之間，找出哪些基因的表現量有統計上顯著的差異。表面看是兩樣本 t 檢定的延伸，但 RNA-seq 資料有三個關鍵挑戰：

計數本質：reads 是非負整數、過度離散 → 不能用 Gaussian t 檢定，必須用 NB GLM。
基因數遠大於樣本數（m=20,000, n=6）→ 每基因的變異估計極不穩 → 必須借力 empirical Bayes 收縮。
大量多重檢定 → 必須 FDR 校正。

三大工具用三種策略解決這個問題：DESeq2 用 NB GLM + Wald / LRT + dispersion-trend 收縮；edgeR 用 NB GLM + quasi-likelihood F 檢定 + tag-wise dispersion 收縮；limma-voom 用對數轉換 + precision weight + 線性模型 + moderated t。三者數學基礎略異，但在大多數情境下結果高度一致。

Differential Expression (DE) asks: between two (or more) conditions, which genes change significantly? It looks like an extended t-test, but RNA-seq data has three critical wrinkles:

Count nature: reads are non-negative integers and overdispersed → Gaussian t-tests are inappropriate, NB GLM is needed.
m ≫ n (20,000 genes vs 6 samples) → per-gene variance is wildly unstable → empirical Bayes shrinkage is essential.
Massive multiple testing → FDR correction is mandatory.

The three workhorse tools handle these differently: DESeq2 uses NB GLM + Wald / LRT + dispersion-trend shrinkage; edgeR uses NB GLM + quasi-likelihood F-test + tag-wise dispersion shrinkage; limma-voom log-transforms, applies precision weights, fits a linear model, and computes a moderated t. The maths differ, but for most analyses they agree closely.

💡

選擇原則：大型樣本 (n > 12 / 組) — limma-voom 速度快、診斷工具最齊全；小型 (n=3–6 / 組) — DESeq2 / edgeR 的變異收縮更穩；scRNA-seq 偽-bulk — DESeq2 / edgeR；複雜設計（時間序列、互動） — limma + duplicateCorrelation 最靈活。 How to choose: large samples (n > 12 / group) → limma-voom is fast and has the richest diagnostics; small samples (n=3–6 / group) → DESeq2 / edgeR shrinkage is more reliable; pseudo-bulk scRNA-seq → DESeq2 / edgeR; complex designs (time-course, interactions) → limma with duplicateCorrelation is most flexible.

核心概念

一、三大工具的數學引擎

🧬

DESeq2

NB GLM：log μ = X β + log(s)，s 為 size factor。
Dispersion-mean trend：估計 trend → MLE → 朝 trend 收縮（empirical Bayes）。
Wald 檢定 (預設) 或 LRT（兩模型比較）。
Cook's distance 自動偵測極端 outlier 基因。
lfcShrink：apeglm / ashr / normal——讓低 count 基因的 LFC 不再亂跳。

NB GLM: log μ = X β + log(s), s = size factor.
Dispersion-mean trend: fit trend → MLE → shrink to trend (empirical Bayes).
Wald (default) or LRT (model comparison).
Cook's distance flags outlier genes automatically.
lfcShrink with apeglm / ashr / normal calms low-count LFCs.

⚙️

edgeR

NB GLM，TMM normalization（trimmed-mean）。
Tag-wise dispersion + common / trended 三層收縮。
Quasi-likelihood F 檢定（glmQLFTest）——對小樣本提供額外 Type I 控制。
適合複雜設計矩陣與 contrasts。

NB GLM with TMM (trimmed mean of M-values) normalization.
Tag-wise dispersion + common / trended three-tier shrinkage.
Quasi-likelihood F-test (glmQLFTest) — better small-sample Type I control.
Handles complex design matrices and contrasts elegantly.

📈

limma-voom

對 log2(CPM+0.5) 跑線性模型——重用 limma 的 Gaussian 工具。
voom：用 mean-variance trend 計算 precision weights。
moderated t：variance shrinkage → 借力其他基因 → 更穩。
速度最快，適合 n > 12 或時間序列／random effect 設計。

Linear model on log2(CPM+0.5) — reuses limma's Gaussian toolkit.
voom: uses mean-variance trend to compute precision weights.
moderated t: variance shrinkage borrowing across genes → more stable.
Fastest; suits n > 12 or time-course / random-effect designs.

二、變異與 LFC 收縮

對 count = 2 vs 50 的基因，原始 log2 FC ≈ 4.6，但這個估計變異極大——很可能是抽樣噪音。原始 LFC 在低 count 區域呈「火山圖底部噴泉」狀，把基因按 |LFC| 排名會把噪音排到前面。

收縮把每個基因的 LFC 朝整體分布的中心（通常是 0）拉近，拉近的程度與該基因 LFC 的不確定度成正比：高 count、穩定的基因幾乎不被收縮；低 count、雜訊大的基因被強力收縮。結果——MA plot 變得乾淨、排名穩定、可重複性提升。

DESeq2 的 lfcShrink(coef=, type="apeglm") 是當前最常見的選擇；ashr 支援多 contrast；normal（舊版預設）通常過度收縮，現已不推薦。

For a gene with counts 2 vs 50 the raw log2 FC ≈ 4.6, but this estimate has enormous variance — likely sampling noise. Raw LFCs produce a "fountain at the bottom of the volcano plot"; ranking genes by |LFC| pushes noise to the top.

Shrinkage pulls each gene's LFC toward the centre of the overall distribution (usually 0), with strength proportional to that gene's uncertainty: high-count, stable genes barely move; low-count, noisy genes are heavily shrunk. The result — cleaner MA plots, stable rankings, reproducible top-hits.

DESeq2's lfcShrink(coef=, type="apeglm") is the current default of choice; ashr handles multi-contrast scenarios; normal (the legacy default) over-shrinks and is no longer recommended.

⚠️

批次效應陷阱：絕對不要先用 ComBat 移除批次，再對殘差做 DE——這會破壞 SE 計算、造成假陽性。正確作法是把批次當共變數放進設計矩陣 ~ batch + condition，讓 DESeq2 / limma 在估計效應時同時校正批次。

Batch-effect trap: do NOT first remove batch via ComBat and then run DE on residuals — this breaks SE computation and inflates false positives. The correct approach is to add batch as a covariate in the design matrix ~ batch + condition, letting DESeq2 / limma adjust during estimation.

三、獨立過濾與設計

獨立過濾：在做 FDR 之前移除「幾乎沒被表達」的基因（如全部樣本 count < 10）——這個過濾準則（mean count）與 H₀ 下的 p-value 獨立，所以不破壞 FDR，但會大幅增加 BH 的檢力（因為 m 變小）。DESeq2 預設啟用，results(dds) 會自動找最佳閾值。

設計矩陣是 DE 分析的靈魂。常見模式：~ condition（單因子）；~ batch + condition（含批次校正）；~ condition * time（互動）；~ subject + condition（配對設計，每位個體自己當對照）；~ batch + condition + limma::duplicateCorrelation（技術重複的 random effect）。第一個變數通常是 nuisance（先校正掉），最後一個是要檢定的主效應。

Independent filtering: remove very-low-expression genes (e.g. count < 10 across all samples) before FDR. The filter (mean count) is independent of the null p-value, so FDR remains valid, but m shrinks — boosting BH power. DESeq2 enables it by default; results(dds) auto-selects the optimal threshold.

The design matrix is the heart of DE analysis. Common patterns: ~ condition (single factor); ~ batch + condition (with batch adjustment); ~ condition * time (interaction); ~ subject + condition (paired); ~ batch + condition + limma::duplicateCorrelation (random effect for technical replicates). Leading terms are usually nuisance variables to adjust away; the last term is the main effect of interest.

互動模擬

火山圖：原始 vs 收縮 LFC

下方為合成 RNA-seq 資料（5,000 基因，每組 n=4）。切換到 shrunken LFC 觀察：低 count 區的「噴泉」被壓平，高信號高表達的真實 DE 基因仍清晰可見。增加每組 n 觀察整體火山圖如何收窄、p-value 普遍降低。

Below is synthetic RNA-seq data (5,000 genes, n=4 per group). Toggle to shrunken LFC: the low-count fountain collapses while high-signal, well-expressed true DE genes stay clearly visible. Increase n to watch the volcano tighten and p-values drop.

DE @ FDR 5%: — n — —

n / group 4

% true DE 10

使用收縮 LFC

X: log2 FC | Y: −log10 p | 紅點：FDR < 5%

程式碼

實作：三大標準流程

# --- R --- 三大 DE 工具的標準骨架
# counts: gene × sample 整數矩陣; col_data: 樣本後設資料 (含 condition, batch)

# ===== DESeq2 =====
library(DESeq2)
dds <- DESeqDataSetFromMatrix(counts, col_data, design = ~ batch + condition)
dds <- DESeq(dds)                                 # size factor + dispersion + GLM
res <- results(dds, contrast = c("condition","KO","WT"))
res_shrunk <- lfcShrink(dds, coef="condition_KO_vs_WT", type="apeglm")
summary(res_shrunk); plotMA(res_shrunk)

# ===== edgeR (NB QL F) =====
library(edgeR)
dge <- DGEList(counts, group = col_data$condition)
keep <- filterByExpr(dge, group = col_data$condition); dge <- dge[keep, ]
dge <- calcNormFactors(dge)                          # TMM
design <- model.matrix(~ batch + condition, col_data)
dge <- estimateDisp(dge, design)
fit <- glmQLFit(dge, design)
qlf <- glmQLFTest(fit, coef = "conditionKO")
topTags(qlf, n = 20)

# ===== limma-voom =====
library(limma)
v   <- voom(dge, design, plot = TRUE)              # precision weights
fit <- lmFit(v, design)
fit <- eBayes(fit)                                 # moderated t
topTable(fit, coef = "conditionKO", adjust = "BH")

# --- Python --- pydeseq2 + rpy2 fallback
import pandas as pd, numpy as np
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# counts: genes × samples; metadata: samples × covariates
dds = DeseqDataSet(counts=counts.T, metadata=metadata, design_factors=["batch","condition"])
dds.deseq2()
stats = DeseqStats(dds, contrast=["condition","KO","WT"])
stats.summary()
stats.lfc_shrink(coeff="condition_KO_vs_WT")
res = stats.results_df                                # pandas DF

# scRNA-seq: scanpy / decoupler
import scanpy as sc
sc.tl.rank_genes_groups(adata, groupby="cluster", method="wilcoxon")
# 偽-bulk DE 仍推薦 DESeq2 / edgeR via rpy2

from rpy2.robjects import r, pandas2ri; pandas2ri.activate()
r.assign("counts", counts); r.assign("cd", metadata)
r('library(edgeR); dge <- DGEList(counts, group=cd$condition); ...')

🚫

常見錯誤：(1) 把原始 LFC 用於熱圖排名 → 低 count 雜訊衝到前面 → 必須先 lfcShrink。(2) 先 ComBat 校正再做 DE → SE 失準 → 應該把批次放進 design。(3) 對低 count 基因 (count < 10) 沒做獨立過濾就直接報 padj → BH 檢力被吃掉。(4) DESeq2 對單一複製 (n=1 vs n=1) 不會跑——它需要至少每組 n ≥ 2 才能估 dispersion。 Common mistakes: (1) ranking heatmaps by raw LFC — low-count noise floods the top — always lfcShrink first. (2) ComBat-correcting before DE — SE breaks — put batch in the design instead. (3) Skipping independent filtering for low-count genes — wastes BH power. (4) DESeq2 won't run with single replicates (n=1 vs n=1) — at least n ≥ 2 per group is needed for dispersion estimation.

📝 自我檢測

1. 為什麼 voom 需要對 RNA-seq counts 計算 precision weights，再交給 limma 的 Gaussian 工具？

1. Why does voom precision-weight RNA-seq counts before feeding them into limma's Gaussian toolbox?

A. 因為 limma 不能處理多重檢定A. Because limma cannot handle multiple testing

B. 因為 voom 不能用於 RNA-seqB. Because voom does not apply to RNA-seq

C. 因為 log-CPM 後不同基因的變異仍隨平均不同；precision weights 把這個 mean-variance trend 內建到線性模型C. Because after log-CPM, variance still depends on mean; precision weights encode this mean-variance trend into the linear model

D. 因為 limma 需要整數輸入D. Because limma requires integer input

2. edgeR 的 QL F 檢定與 DESeq2 的 Wald 檢定的主要實務差異？

2. Practical difference between edgeR's QL F-test and DESeq2's Wald test?

A. 兩者用不同的虛無假設A. They test different null hypotheses

B. QL F 用 quasi-likelihood 把 dispersion 不確定度納入分母，小樣本下 Type I 錯誤率較保守；Wald 在小樣本下偶爾過寬B. QL F uses quasi-likelihood to embed dispersion uncertainty into its denominator, giving tighter Type I control at small n; Wald can be a touch liberal at small n

C. QL F 不需要校正多重檢定C. QL F does not require multiple-testing correction

D. DESeq2 不能處理 batch effectD. DESeq2 cannot adjust for batch effects

3. 為什麼用 fold change 排名基因時必須先做 lfcShrink？

3. Why is lfcShrink essential when ranking genes by fold change?

A. lfcShrink 可以同時做 FDR 校正A. lfcShrink simultaneously performs FDR correction

B. 沒有它 DESeq2 無法跑B. DESeq2 will not run without it

C. 低 count 基因的原始 LFC 變異極大，會被排到頂端造成「噴泉」雜訊；shrinkage 依不確定度收縮 LFC 還原可靠排名C. Low-count genes have huge LFC variance and rise to the top as "fountain" noise; shrinkage tames LFC by its uncertainty, restoring a meaningful ranking

D. lfcShrink 會把 p-value 變更小D. lfcShrink makes p-values smaller