GLM 三件套
廣義線性模型(GLM)用三個元件統一了非常多統計模型:
- 隨機元件:Y 屬於指數族(Normal、Bernoulli、Poisson、Gamma、NB…)
- 系統元件:線性預測子
η = Xβ - 連結函數 g:
g(E[Y]) = η——把均值映射到實數線
OLS = Normal + identity;logistic = Bernoulli + logit;count regression = Poisson/NB + log。本章焦點:計數資料的 GLM——這正是 RNA-seq 差異表達 (DE) 分析的數學基礎,DESeq2 與 edgeR 對每個基因擬合的就是 NB GLM。
The GLM framework unifies many regression models via three components:
- Random component: Y belongs to an exponential family (Normal, Bernoulli, Poisson, Gamma, NB, …)
- Systematic component: linear predictor
η = Xβ - Link function g:
g(E[Y]) = η— maps the mean onto the real line
OLS = Normal + identity; logistic = Bernoulli + logit; count regression = Poisson/NB + log. This chapter's focus: GLMs for count data — the mathematical foundation of RNA-seq differential expression. DESeq2 and edgeR fit a NB GLM per gene.
一、計數 GLM 四要點
Poisson GLM
log E[Y] = Xβ;Var = E[Y]- 適合:稀有事件、純技術變異、實驗良好控制
- 過度離散診斷:Pearson χ²/df > 1.5 警示
- quasi-Poisson:把 dispersion 當常數放大 SE,但無真實似然
log E[Y] = Xβ; Var = E[Y]- Good for: rare events, purely technical variation, controlled designs
- Overdispersion diagnostic: Pearson χ²/df > 1.5 is a flag
- quasi-Poisson scales SE by a constant; no true likelihood
NB GLM
- Var = μ + α·μ²,α = dispersion
- α → 0 退化為 Poisson
MASS::glm.nb估 α(theta = 1/α)- DESeq2、edgeR 用 shrunken α:跨基因借力
- Var = μ + α·μ², α = dispersion
- α → 0 reduces to Poisson
MASS::glm.nbestimates α (theta = 1/α)- DESeq2 / edgeR use shrunken α — empirical Bayes across genes
零膨脹 / Hurdle
- Zero-inflated:兩階段——「結構性零」+ 計數分布
- Hurdle:先二元(0 vs ≥1),再正計數
- 單細胞、生態學常見
- 但 ZINB vs NB 在低表現基因常難以區分;過度建模反而傷
- Zero-inflated: two-stage — "structural zero" mixture + count
- Hurdle: binary (0 vs ≥1) then positive count
- Common in scRNA-seq, ecology
- ZINB vs NB often indistinguishable for low-expressed genes; overfit hurts
Offset 與 Deviance
- Offset = 已知係數 = 1 的協變數,如
log(library_size)、log(person-years) - 把「計數」變成「率」而不耗自由度
- Deviance D = 2(logLsat − logLmodel),類似 RSS
- 巢狀模型比較用 D 差 → χ²;非巢狀用 AIC
- Offset = covariate with fixed coefficient 1, e.g.
log(library_size),log(person-years) - Converts "counts" to "rates" without spending df
- Deviance D = 2(logLsat − logLmodel), the GLM analog of RSS
- Nested models: difference in D ~ χ²; non-nested: AIC
為什麼必須用 NB?
下方模擬器生成 100 個基因 × 兩組各 5 個樣本的計數資料,真實來自 NB(μ=20, α=滑桿值)。對每個基因跑兩個 DE 檢定(兩組均值 vs χ²):(1)誤把資料當 Poisson;(2)正確用 NB。當 α=0 → 兩者一樣;α 升高 → Poisson 的 χ²/df 飆升,假陽性率從 5% 漲到極端值——這就是 DESeq2 / edgeR 存在的原因。
The simulator below generates 100 genes × two groups (n=5 each) of counts truly from NB(μ=20, α=slider). For each gene we run a DE test two ways: (1) assuming Poisson; (2) correctly using NB. When α = 0 the two match; as α grows, Poisson's χ²/df explodes and the false-positive rate shoots far past 5% — the entire reason DESeq2 / edgeR exist.
長條:100 基因 Pearson χ²/df
二、與 DE 工具的完整連線
DESeq2 / edgeR 的內部數學完全是本章的東西:
- 每個基因擬合
log μ = Xβ + log(size_factor)——後者就是log(library_size)類的 offset。 - NB 隨機元件處理 overdispersion;dispersion α 用 empirical Bayes 跨基因收縮(DESeq2 對 mean-dispersion 趨勢線收縮;edgeR 用 quantile-adjusted CML + Bayesian shrinkage)。
- 檢定:DESeq2 預設 Wald;可選 LR。edgeR 經典 quasi-likelihood F-test(
glmQLFit)對小樣本更穩。 - 邏輯 fold-change:log2FC = β / log(2)。DESeq2 可再做 LFC shrinkage(
apeglm,ashr)降低低計數基因的雜訊。
所以本章不只「另一種 GLM」——它就是 RNA-seq 統計學的引擎室。
DESeq2 / edgeR's internal math is exactly what this chapter covers:
- Per gene they fit
log μ = Xβ + log(size_factor)— the size factor is alog(library_size)-style offset. - NB random component handles overdispersion; dispersion α is empirically-Bayes shrunken across genes (DESeq2: shrinkage toward a mean-dispersion trend; edgeR: quantile-adjusted CML + Bayesian shrinkage).
- Tests: DESeq2 defaults to Wald, optionally LR. edgeR's classical quasi-likelihood F-test (
glmQLFit) is more robust at small n. - Log-fold-change: log2FC = β / log(2). DESeq2 can additionally shrink LFC (
apeglm,ashr) to tame noise at low counts.
This chapter is not "another GLM" — it is the engine room of RNA-seq statistics.
| 工具 | 模型 | 特點 | |||
|---|---|---|---|---|---|
| DESeq2 | NB GLM | size-factor offset + dispersion shrink + LFC shrink | DESeq2 | NB GLM | size factors + dispersion shrink + LFC shrink |
| edgeR (QL) | NB QL GLM | quasi-likelihood F;小樣本佳 | edgeR (QL) | NB QL GLM | quasi-likelihood F; great for small n |
| limma-voom | Normal LM + 加權 | 把 NB 轉成加權常態——速度快 | limma-voom | Weighted Normal LM | approximates NB via weights — fast |
| DESeq2 + glmGamPoi | NB GLM (fast) | scRNA-seq 大規模擬合 | DESeq2 + glmGamPoi | NB GLM (fast) | large-scale scRNA-seq fitting |
| ZINB-WaVE / NEBULA | ZINB / NB mixed | scRNA-seq dropout、 donor 隨機效應 | ZINB-WaVE / NEBULA | ZINB / NB mixed | scRNA dropout, donor RE |
實作:Poisson / NB / ZINB
# --- R --- 計數 GLM 全套 library(MASS); library(AER); library(pscl); library(broom) # 1) Poisson with offset for rate fit_p <- glm(count ~ x + offset(log(size)), family = poisson(link = "log"), data = df) # 2) 過度離散診斷 sum(residuals(fit_p, "pearson")^2) / fit_p$df.residual # χ²/df AER::dispersiontest(fit_p) # 正式檢定 # 3) 改用 NB GLM fit_nb <- MASS::glm.nb(count ~ x + offset(log(size)), data = df) fit_nb$theta # theta = 1/α broom::tidy(fit_nb, exponentiate = TRUE, conf.int = TRUE) # 4) Zero-inflated NB(單細胞、生態學) fit_zinb <- pscl::zeroinfl(count ~ x | x, data = df, dist = "negbin") # 5) DESeq2 同樣的概念——對 RNA-seq 計數矩陣 # library(DESeq2) # dds <- DESeqDataSetFromMatrix(counts, coldata, ~ condition) # dds <- DESeq(dds) # estimateSizeFactors + Disp + Wald # results(dds) |> lfcShrink(dds, coef = 2, type = "apeglm")
# --- Python --- import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf from statsmodels.discrete.count_model import ZeroInflatedNegativeBinomialP # 1) Poisson with offset fit_p = smf.glm("count ~ x", data=df, family=sm.families.Poisson(), offset=np.log(df["size"])).fit() # 2) Overdispersion 診斷 chi2 = (fit_p.resid_pearson**2).sum() chi2 / fit_p.df_resid # > 1.5 警示 # 3) NB GLM(α 須先估計) fit_nb = smf.glm("count ~ x", data=df, family=sm.families.NegativeBinomial(alpha=0.3), offset=np.log(df["size"])).fit() np.exp(fit_nb.params) # IRR (incidence rate ratio) # 4) ZINB zinb = ZeroInflatedNegativeBinomialP(y, X, exog_infl=X).fit() # 5) 進階:pydeseq2 — DESeq2 的 Python 移植(NB GLM + EB shrinkage) # from pydeseq2.dds import DeseqDataSet; dds.deseq2()
log(library_size) offset——估出來的是「絕對計數」而非「率」,跨樣本不可比。
Common mistakes: ① Fitting Poisson to RNA-seq counts → flood of false positives. ② quasi-Poisson has no true likelihood, so no AIC / LR — use deviance F for nested comparison. ③ Forcing ZINB on mildly overdispersed data — NB usually suffices; the extra ZI component just adds noise to SE. ④ Omitting log(library_size) offset — you'd be modeling raw counts, not rates, and cross-sample comparison breaks.
📝 自我檢測
1. 為什麼 Poisson GLM 不適合 RNA-seq?
1. Why does the Poisson GLM fail on RNA-seq counts?
2. 在 glm(count ~ x + offset(log(library_size)), family=poisson) 中加入 offset 是為了什麼?
2. What does offset(log(library_size)) accomplish in glm(..., family=poisson)?
3. 如何從 Poisson 擬合診斷過度離散?
3. How do you diagnose overdispersion from a Poisson fit?