Step 9: GLM — Poisson / NB / Zero-Inflated — Statistical Inference Tutorial

概覽

GLM 三件套

廣義線性模型（GLM）用三個元件統一了非常多統計模型：

隨機元件：Y 屬於指數族（Normal、Bernoulli、Poisson、Gamma、NB…）
系統元件：線性預測子 η = Xβ
連結函數 g：g(E[Y]) = η——把均值映射到實數線

OLS = Normal + identity；logistic = Bernoulli + logit；count regression = Poisson/NB + log。本章焦點：計數資料的 GLM——這正是 RNA-seq 差異表達 (DE) 分析的數學基礎，DESeq2 與 edgeR 對每個基因擬合的就是 NB GLM。

The GLM framework unifies many regression models via three components:

Random component: Y belongs to an exponential family (Normal, Bernoulli, Poisson, Gamma, NB, …)
Systematic component: linear predictor η = Xβ
Link function g: g(E[Y]) = η — maps the mean onto the real line

OLS = Normal + identity; logistic = Bernoulli + logit; count regression = Poisson/NB + log. This chapter's focus: GLMs for count data — the mathematical foundation of RNA-seq differential expression. DESeq2 and edgeR fit a NB GLM per gene.

💡

關鍵 takeaway：「為什麼 RNA-seq 不能用 t 檢定？」因為計數資料不是常態。「為什麼 DESeq2 用 NB 不用 Poisson？」因為基因表現的生物與技術變異 → 過度離散 (overdispersion)，Var/Mean > 1。用 Poisson 擬合過度離散資料 → SE 嚴重低估 → 大量假陽性。NB 把 dispersion 當成第二個參數修正之。 Key takeaway: Why can't we t-test RNA-seq? Counts aren't Gaussian. Why does DESeq2 use NB instead of Poisson? Because biological + technical variation produces overdispersion: Var/Mean > 1. Fitting Poisson to overdispersed data underestimates SE → flood of false positives. NB adds a second parameter for dispersion that fixes this.

核心概念

一、計數 GLM 四要點

📊

Poisson GLM

log E[Y] = Xβ；Var = E[Y]
適合：稀有事件、純技術變異、實驗良好控制
過度離散診斷：Pearson χ²/df > 1.5 警示
quasi-Poisson：把 dispersion 當常數放大 SE，但無真實似然

log E[Y] = Xβ; Var = E[Y]
Good for: rare events, purely technical variation, controlled designs
Overdispersion diagnostic: Pearson χ²/df > 1.5 is a flag
quasi-Poisson scales SE by a constant; no true likelihood

📈

NB GLM

Var = μ + α·μ²，α = dispersion
α → 0 退化為 Poisson
MASS::glm.nb 估 α（theta = 1/α）
DESeq2、edgeR 用 shrunken α：跨基因借力

Var = μ + α·μ², α = dispersion
α → 0 reduces to Poisson
MASS::glm.nb estimates α (theta = 1/α)
DESeq2 / edgeR use shrunken α — empirical Bayes across genes

🕳️

零膨脹 / Hurdle

Zero-inflated：兩階段——「結構性零」+ 計數分布
Hurdle：先二元（0 vs ≥1），再正計數
單細胞、生態學常見
但 ZINB vs NB 在低表現基因常難以區分；過度建模反而傷

Zero-inflated: two-stage — "structural zero" mixture + count
Hurdle: binary (0 vs ≥1) then positive count
Common in scRNA-seq, ecology
ZINB vs NB often indistinguishable for low-expressed genes; overfit hurts

🧮

Offset 與 Deviance

Offset = 已知係數 = 1 的協變數，如 log(library_size)、log(person-years)
把「計數」變成「率」而不耗自由度
Deviance D = 2(logL_sat − logL_model)，類似 RSS
巢狀模型比較用 D 差 → χ²；非巢狀用 AIC

Offset = covariate with fixed coefficient 1, e.g. log(library_size), log(person-years)
Converts "counts" to "rates" without spending df
Deviance D = 2(logL_sat − logL_model), the GLM analog of RSS
Nested models: difference in D ~ χ²; non-nested: AIC

互動模擬

為什麼必須用 NB？

下方模擬器生成 100 個基因 × 兩組各 5 個樣本的計數資料，真實來自 NB(μ=20, α=滑桿值)。對每個基因跑兩個 DE 檢定（兩組均值 vs χ²）：（1）誤把資料當 Poisson；（2）正確用 NB。當 α=0 → 兩者一樣；α 升高 → Poisson 的 χ²/df 飆升，假陽性率從 5% 漲到極端值——這就是 DESeq2 / edgeR 存在的原因。

The simulator below generates 100 genes × two groups (n=5 each) of counts truly from NB(μ=20, α=slider). For each gene we run a DE test two ways: (1) assuming Poisson; (2) correctly using NB. When α = 0 the two match; as α grows, Poisson's χ²/df explodes and the false-positive rate shoots far past 5% — the entire reason DESeq2 / edgeR exist.

— — —

真實 α 0.30

μ 20

長條：100 基因 Pearson χ²/df

DESeq2 / edgeR 連結

二、與 DE 工具的完整連線

DESeq2 / edgeR 的內部數學完全是本章的東西：

每個基因擬合 log μ = Xβ + log(size_factor)——後者就是 log(library_size) 類的 offset。
NB 隨機元件處理 overdispersion；dispersion α 用 empirical Bayes 跨基因收縮（DESeq2 對 mean-dispersion 趨勢線收縮；edgeR 用 quantile-adjusted CML + Bayesian shrinkage）。
檢定：DESeq2 預設 Wald；可選 LR。edgeR 經典 quasi-likelihood F-test（glmQLFit）對小樣本更穩。
邏輯 fold-change：log₂FC = β / log(2)。DESeq2 可再做 LFC shrinkage（apeglm, ashr）降低低計數基因的雜訊。

所以本章不只「另一種 GLM」——它就是 RNA-seq 統計學的引擎室。

DESeq2 / edgeR's internal math is exactly what this chapter covers:

Per gene they fit log μ = Xβ + log(size_factor) — the size factor is a log(library_size)-style offset.
NB random component handles overdispersion; dispersion α is empirically-Bayes shrunken across genes (DESeq2: shrinkage toward a mean-dispersion trend; edgeR: quantile-adjusted CML + Bayesian shrinkage).
Tests: DESeq2 defaults to Wald, optionally LR. edgeR's classical quasi-likelihood F-test (glmQLFit) is more robust at small n.
Log-fold-change: log₂FC = β / log(2). DESeq2 can additionally shrink LFC (apeglm, ashr) to tame noise at low counts.

This chapter is not "another GLM" — it is the engine room of RNA-seq statistics.

工具	模型	特點
DESeq2	NB GLM	size-factor offset + dispersion shrink + LFC shrink	DESeq2	NB GLM	size factors + dispersion shrink + LFC shrink
edgeR (QL)	NB QL GLM	quasi-likelihood F；小樣本佳	edgeR (QL)	NB QL GLM	quasi-likelihood F; great for small n
limma-voom	Normal LM + 加權	把 NB 轉成加權常態——速度快	limma-voom	Weighted Normal LM	approximates NB via weights — fast
DESeq2 + glmGamPoi	NB GLM (fast)	scRNA-seq 大規模擬合	DESeq2 + glmGamPoi	NB GLM (fast)	large-scale scRNA-seq fitting
ZINB-WaVE / NEBULA	ZINB / NB mixed	scRNA-seq dropout、 donor 隨機效應	ZINB-WaVE / NEBULA	ZINB / NB mixed	scRNA dropout, donor RE

程式碼

實作：Poisson / NB / ZINB

# --- R --- 計數 GLM 全套
library(MASS); library(AER); library(pscl); library(broom)

# 1) Poisson with offset for rate
fit_p <- glm(count ~ x + offset(log(size)),
              family = poisson(link = "log"), data = df)

# 2) 過度離散診斷
sum(residuals(fit_p, "pearson")^2) / fit_p$df.residual    # χ²/df
AER::dispersiontest(fit_p)                                  # 正式檢定

# 3) 改用 NB GLM
fit_nb <- MASS::glm.nb(count ~ x + offset(log(size)), data = df)
fit_nb$theta                                            # theta = 1/α
broom::tidy(fit_nb, exponentiate = TRUE, conf.int = TRUE)

# 4) Zero-inflated NB（單細胞、生態學）
fit_zinb <- pscl::zeroinfl(count ~ x | x, data = df, dist = "negbin")

# 5) DESeq2 同樣的概念——對 RNA-seq 計數矩陣
# library(DESeq2)
# dds <- DESeqDataSetFromMatrix(counts, coldata, ~ condition)
# dds <- DESeq(dds)        # estimateSizeFactors + Disp + Wald
# results(dds) |> lfcShrink(dds, coef = 2, type = "apeglm")

# --- Python ---
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.discrete.count_model import ZeroInflatedNegativeBinomialP

# 1) Poisson with offset
fit_p = smf.glm("count ~ x", data=df,
                 family=sm.families.Poisson(),
                 offset=np.log(df["size"])).fit()

# 2) Overdispersion 診斷
chi2 = (fit_p.resid_pearson**2).sum()
chi2 / fit_p.df_resid                          # > 1.5 警示

# 3) NB GLM（α 須先估計）
fit_nb = smf.glm("count ~ x", data=df,
                  family=sm.families.NegativeBinomial(alpha=0.3),
                  offset=np.log(df["size"])).fit()
np.exp(fit_nb.params)                           # IRR (incidence rate ratio)

# 4) ZINB
zinb = ZeroInflatedNegativeBinomialP(y, X, exog_infl=X).fit()

# 5) 進階：pydeseq2 — DESeq2 的 Python 移植（NB GLM + EB shrinkage）
# from pydeseq2.dds import DeseqDataSet; dds.deseq2()

🚫

常見錯誤：① 用 Poisson 跑 RNA-seq → 假陽性洪水。② quasi-Poisson 沒有「正規」似然，因此無 AIC / LR 檢定——只能用 deviance F 比較巢狀模型。③ ZINB 套到僅輕度過度離散的資料上——通常 NB 已足，ZINB 多估了一個機率部件反而抖動 SE。④ 不放 log(library_size) offset——估出來的是「絕對計數」而非「率」，跨樣本不可比。 Common mistakes: ① Fitting Poisson to RNA-seq counts → flood of false positives. ② quasi-Poisson has no true likelihood, so no AIC / LR — use deviance F for nested comparison. ③ Forcing ZINB on mildly overdispersed data — NB usually suffices; the extra ZI component just adds noise to SE. ④ Omitting log(library_size) offset — you'd be modeling raw counts, not rates, and cross-sample comparison breaks.

📝 自我檢測

1. 為什麼 Poisson GLM 不適合 RNA-seq？

1. Why does the Poisson GLM fail on RNA-seq counts?

A. 計數可能是負數A. Counts could be negative

B. Poisson 的 link 是 logitB. Poisson's link is logit

C. Poisson 假設 Var = Mean，但 RNA-seq 過度離散；SE 低估 → 假陽性洪水C. Poisson assumes Var = Mean, but RNA-seq is overdispersed; SE underestimated → false-positive flood

D. Poisson 沒有 MLED. Poisson has no MLE

2. 在 glm(count ~ x + offset(log(library_size)), family=poisson) 中加入 offset 是為了什麼？

2. What does offset(log(library_size)) accomplish in glm(..., family=poisson)?

A. 修正過度離散A. Corrects overdispersion

B. 把計數變成連續B. Makes counts continuous

C. 把模型從「計數」改為「率」(count / size)，係數可跨樣本比較，且不消耗自由度C. Reframes counts as rates (count / size), making coefficients comparable across samples without spending df

D. 加入互動項D. Adds an interaction term

3. 如何從 Poisson 擬合診斷過度離散？

3. How do you diagnose overdispersion from a Poisson fit?

A. 計算 Pearson χ² 殘差平方和除以殘差自由度（≈ 1 為合理；> 1.5 警示）；或 AER::dispersiontestA. Pearson χ² residuals² summed, divided by residual df (≈ 1 OK; > 1.5 flag); or AER::dispersiontest

B. 看 R²B. Look at R²

C. QQ plot 殘差是否常態C. QQ plot for normality of residuals

D. Shapiro-Wilk 檢定D. Shapiro-Wilk test