DE 的核心問題
差異表現(Differential Expression, DE)的目標:在兩組(或多組)樣本之間,找出哪些基因的表現量有統計上顯著的差異。表面看是兩樣本 t 檢定的延伸,但 RNA-seq 資料有三個關鍵挑戰:
- 計數本質:reads 是非負整數、過度離散 → 不能用 Gaussian t 檢定,必須用 NB GLM。
- 基因數遠大於樣本數(m=20,000, n=6)→ 每基因的變異估計極不穩 → 必須借力 empirical Bayes 收縮。
- 大量多重檢定 → 必須 FDR 校正。
三大工具用三種策略解決這個問題:DESeq2 用 NB GLM + Wald / LRT + dispersion-trend 收縮;edgeR 用 NB GLM + quasi-likelihood F 檢定 + tag-wise dispersion 收縮;limma-voom 用對數轉換 + precision weight + 線性模型 + moderated t。三者數學基礎略異,但在大多數情境下結果高度一致。
Differential Expression (DE) asks: between two (or more) conditions, which genes change significantly? It looks like an extended t-test, but RNA-seq data has three critical wrinkles:
- Count nature: reads are non-negative integers and overdispersed → Gaussian t-tests are inappropriate, NB GLM is needed.
- m ≫ n (20,000 genes vs 6 samples) → per-gene variance is wildly unstable → empirical Bayes shrinkage is essential.
- Massive multiple testing → FDR correction is mandatory.
The three workhorse tools handle these differently: DESeq2 uses NB GLM + Wald / LRT + dispersion-trend shrinkage; edgeR uses NB GLM + quasi-likelihood F-test + tag-wise dispersion shrinkage; limma-voom log-transforms, applies precision weights, fits a linear model, and computes a moderated t. The maths differ, but for most analyses they agree closely.
一、三大工具的數學引擎
DESeq2
- NB GLM:
log μ = X β + log(s),s為 size factor。 - Dispersion-mean trend:估計 trend → MLE → 朝 trend 收縮(empirical Bayes)。
- Wald 檢定 (預設) 或 LRT(兩模型比較)。
- Cook's distance 自動偵測極端 outlier 基因。
lfcShrink:apeglm / ashr / normal——讓低 count 基因的 LFC 不再亂跳。
- NB GLM:
log μ = X β + log(s),s= size factor. - Dispersion-mean trend: fit trend → MLE → shrink to trend (empirical Bayes).
- Wald (default) or LRT (model comparison).
- Cook's distance flags outlier genes automatically.
lfcShrinkwith apeglm / ashr / normal calms low-count LFCs.
edgeR
- NB GLM,TMM normalization(trimmed-mean)。
- Tag-wise dispersion + common / trended 三層收縮。
- Quasi-likelihood F 檢定(
glmQLFTest)——對小樣本提供額外 Type I 控制。 - 適合複雜設計矩陣與 contrasts。
- NB GLM with TMM (trimmed mean of M-values) normalization.
- Tag-wise dispersion + common / trended three-tier shrinkage.
- Quasi-likelihood F-test (
glmQLFTest) — better small-sample Type I control. - Handles complex design matrices and contrasts elegantly.
limma-voom
- 對
log2(CPM+0.5)跑線性模型——重用 limma 的 Gaussian 工具。 voom:用 mean-variance trend 計算 precision weights。- moderated t:variance shrinkage → 借力其他基因 → 更穩。
- 速度最快,適合 n > 12 或時間序列/random effect 設計。
- Linear model on
log2(CPM+0.5)— reuses limma's Gaussian toolkit. voom: uses mean-variance trend to compute precision weights.- moderated t: variance shrinkage borrowing across genes → more stable.
- Fastest; suits n > 12 or time-course / random-effect designs.
二、變異與 LFC 收縮
對 count = 2 vs 50 的基因,原始 log2 FC ≈ 4.6,但這個估計變異極大——很可能是抽樣噪音。原始 LFC 在低 count 區域呈「火山圖底部噴泉」狀,把基因按 |LFC| 排名會把噪音排到前面。
收縮把每個基因的 LFC 朝整體分布的中心(通常是 0)拉近,拉近的程度與該基因 LFC 的不確定度成正比:高 count、穩定的基因幾乎不被收縮;低 count、雜訊大的基因被強力收縮。結果——MA plot 變得乾淨、排名穩定、可重複性提升。
DESeq2 的 lfcShrink(coef=, type="apeglm") 是當前最常見的選擇;ashr 支援多 contrast;normal(舊版預設)通常過度收縮,現已不推薦。
For a gene with counts 2 vs 50 the raw log2 FC ≈ 4.6, but this estimate has enormous variance — likely sampling noise. Raw LFCs produce a "fountain at the bottom of the volcano plot"; ranking genes by |LFC| pushes noise to the top.
Shrinkage pulls each gene's LFC toward the centre of the overall distribution (usually 0), with strength proportional to that gene's uncertainty: high-count, stable genes barely move; low-count, noisy genes are heavily shrunk. The result — cleaner MA plots, stable rankings, reproducible top-hits.
DESeq2's lfcShrink(coef=, type="apeglm") is the current default of choice; ashr handles multi-contrast scenarios; normal (the legacy default) over-shrinks and is no longer recommended.
~ batch + condition,讓 DESeq2 / limma 在估計效應時同時校正批次。~ batch + condition, letting DESeq2 / limma adjust during estimation.三、獨立過濾與設計
獨立過濾:在做 FDR 之前移除「幾乎沒被表達」的基因(如全部樣本 count < 10)——這個過濾準則(mean count)與 H₀ 下的 p-value 獨立,所以不破壞 FDR,但會大幅增加 BH 的檢力(因為 m 變小)。DESeq2 預設啟用,results(dds) 會自動找最佳閾值。
設計矩陣是 DE 分析的靈魂。常見模式:~ condition(單因子);~ batch + condition(含批次校正);~ condition * time(互動);~ subject + condition(配對設計,每位個體自己當對照);~ batch + condition + limma::duplicateCorrelation(技術重複的 random effect)。第一個變數通常是 nuisance(先校正掉),最後一個是要檢定的主效應。
Independent filtering: remove very-low-expression genes (e.g. count < 10 across all samples) before FDR. The filter (mean count) is independent of the null p-value, so FDR remains valid, but m shrinks — boosting BH power. DESeq2 enables it by default; results(dds) auto-selects the optimal threshold.
The design matrix is the heart of DE analysis. Common patterns: ~ condition (single factor); ~ batch + condition (with batch adjustment); ~ condition * time (interaction); ~ subject + condition (paired); ~ batch + condition + limma::duplicateCorrelation (random effect for technical replicates). Leading terms are usually nuisance variables to adjust away; the last term is the main effect of interest.
火山圖:原始 vs 收縮 LFC
下方為合成 RNA-seq 資料(5,000 基因,每組 n=4)。切換到 shrunken LFC 觀察:低 count 區的「噴泉」被壓平,高信號高表達的真實 DE 基因仍清晰可見。增加每組 n 觀察整體火山圖如何收窄、p-value 普遍降低。
Below is synthetic RNA-seq data (5,000 genes, n=4 per group). Toggle to shrunken LFC: the low-count fountain collapses while high-signal, well-expressed true DE genes stay clearly visible. Increase n to watch the volcano tighten and p-values drop.
X: log2 FC | Y: −log10 p | 紅點:FDR < 5%
實作:三大標準流程
# --- R --- 三大 DE 工具的標準骨架 # counts: gene × sample 整數矩陣; col_data: 樣本後設資料 (含 condition, batch) # ===== DESeq2 ===== library(DESeq2) dds <- DESeqDataSetFromMatrix(counts, col_data, design = ~ batch + condition) dds <- DESeq(dds) # size factor + dispersion + GLM res <- results(dds, contrast = c("condition","KO","WT")) res_shrunk <- lfcShrink(dds, coef="condition_KO_vs_WT", type="apeglm") summary(res_shrunk); plotMA(res_shrunk) # ===== edgeR (NB QL F) ===== library(edgeR) dge <- DGEList(counts, group = col_data$condition) keep <- filterByExpr(dge, group = col_data$condition); dge <- dge[keep, ] dge <- calcNormFactors(dge) # TMM design <- model.matrix(~ batch + condition, col_data) dge <- estimateDisp(dge, design) fit <- glmQLFit(dge, design) qlf <- glmQLFTest(fit, coef = "conditionKO") topTags(qlf, n = 20) # ===== limma-voom ===== library(limma) v <- voom(dge, design, plot = TRUE) # precision weights fit <- lmFit(v, design) fit <- eBayes(fit) # moderated t topTable(fit, coef = "conditionKO", adjust = "BH")
# --- Python --- pydeseq2 + rpy2 fallback import pandas as pd, numpy as np from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats # counts: genes × samples; metadata: samples × covariates dds = DeseqDataSet(counts=counts.T, metadata=metadata, design_factors=["batch","condition"]) dds.deseq2() stats = DeseqStats(dds, contrast=["condition","KO","WT"]) stats.summary() stats.lfc_shrink(coeff="condition_KO_vs_WT") res = stats.results_df # pandas DF # scRNA-seq: scanpy / decoupler import scanpy as sc sc.tl.rank_genes_groups(adata, groupby="cluster", method="wilcoxon") # 偽-bulk DE 仍推薦 DESeq2 / edgeR via rpy2 from rpy2.robjects import r, pandas2ri; pandas2ri.activate() r.assign("counts", counts); r.assign("cd", metadata) r('library(edgeR); dge <- DGEList(counts, group=cd$condition); ...')
lfcShrink。(2) 先 ComBat 校正再做 DE → SE 失準 → 應該把批次放進 design。(3) 對低 count 基因 (count < 10) 沒做獨立過濾就直接報 padj → BH 檢力被吃掉。(4) DESeq2 對單一複製 (n=1 vs n=1) 不會跑——它需要至少每組 n ≥ 2 才能估 dispersion。
Common mistakes: (1) ranking heatmaps by raw LFC — low-count noise floods the top — always lfcShrink first. (2) ComBat-correcting before DE — SE breaks — put batch in the design instead. (3) Skipping independent filtering for low-count genes — wastes BH power. (4) DESeq2 won't run with single replicates (n=1 vs n=1) — at least n ≥ 2 per group is needed for dispersion estimation.
📝 自我檢測
1. 為什麼 voom 需要對 RNA-seq counts 計算 precision weights,再交給 limma 的 Gaussian 工具?
1. Why does voom precision-weight RNA-seq counts before feeding them into limma's Gaussian toolbox?
2. edgeR 的 QL F 檢定與 DESeq2 的 Wald 檢定的主要實務差異?
2. Practical difference between edgeR's QL F-test and DESeq2's Wald test?
3. 為什麼用 fold change 排名基因時必須先做 lfcShrink?
3. Why is lfcShrink essential when ranking genes by fold change?