STEP 9 / 17

GLM:Poisson / NB / Zero-Inflated

廣義線性模型統一了 OLS、logistic、count regression——也是 DESeq2 / edgeR 的數學引擎。

The GLM framework unifies OLS, logistic, and count regression — and is the mathematical engine inside DESeq2 / edgeR.

GLM 三件套

廣義線性模型(GLM)用三個元件統一了非常多統計模型:

  1. 隨機元件:Y 屬於指數族(Normal、Bernoulli、Poisson、Gamma、NB…)
  2. 系統元件:線性預測子 η = Xβ
  3. 連結函數 gg(E[Y]) = η——把均值映射到實數線

OLS = Normal + identity;logistic = Bernoulli + logit;count regression = Poisson/NB + log。本章焦點:計數資料的 GLM——這正是 RNA-seq 差異表達 (DE) 分析的數學基礎,DESeq2edgeR每個基因擬合的就是 NB GLM。

The GLM framework unifies many regression models via three components:

  1. Random component: Y belongs to an exponential family (Normal, Bernoulli, Poisson, Gamma, NB, …)
  2. Systematic component: linear predictor η = Xβ
  3. Link function g: g(E[Y]) = η — maps the mean onto the real line

OLS = Normal + identity; logistic = Bernoulli + logit; count regression = Poisson/NB + log. This chapter's focus: GLMs for count data — the mathematical foundation of RNA-seq differential expression. DESeq2 and edgeR fit a NB GLM per gene.

💡
關鍵 takeaway:「為什麼 RNA-seq 不能用 t 檢定?」因為計數資料不是常態。「為什麼 DESeq2 用 NB 不用 Poisson?」因為基因表現的生物與技術變異 → 過度離散 (overdispersion),Var/Mean > 1。用 Poisson 擬合過度離散資料 → SE 嚴重低估 → 大量假陽性。NB 把 dispersion 當成第二個參數修正之。 Key takeaway: Why can't we t-test RNA-seq? Counts aren't Gaussian. Why does DESeq2 use NB instead of Poisson? Because biological + technical variation produces overdispersion: Var/Mean > 1. Fitting Poisson to overdispersed data underestimates SE → flood of false positives. NB adds a second parameter for dispersion that fixes this.

一、計數 GLM 四要點

📊

Poisson GLM

  • log E[Y] = Xβ;Var = E[Y]
  • 適合:稀有事件、純技術變異、實驗良好控制
  • 過度離散診斷:Pearson χ²/df > 1.5 警示
  • quasi-Poisson:把 dispersion 當常數放大 SE,但無真實似然
  • log E[Y] = Xβ; Var = E[Y]
  • Good for: rare events, purely technical variation, controlled designs
  • Overdispersion diagnostic: Pearson χ²/df > 1.5 is a flag
  • quasi-Poisson scales SE by a constant; no true likelihood
📈

NB GLM

  • Var = μ + α·μ²,α = dispersion
  • α → 0 退化為 Poisson
  • MASS::glm.nb 估 α(theta = 1/α)
  • DESeq2、edgeR 用 shrunken α:跨基因借力
  • Var = μ + α·μ², α = dispersion
  • α → 0 reduces to Poisson
  • MASS::glm.nb estimates α (theta = 1/α)
  • DESeq2 / edgeR use shrunken α — empirical Bayes across genes
🕳️

零膨脹 / Hurdle

  • Zero-inflated:兩階段——「結構性零」+ 計數分布
  • Hurdle:先二元(0 vs ≥1),再正計數
  • 單細胞、生態學常見
  • 但 ZINB vs NB 在低表現基因常難以區分;過度建模反而傷
  • Zero-inflated: two-stage — "structural zero" mixture + count
  • Hurdle: binary (0 vs ≥1) then positive count
  • Common in scRNA-seq, ecology
  • ZINB vs NB often indistinguishable for low-expressed genes; overfit hurts
🧮

Offset 與 Deviance

  • Offset = 已知係數 = 1 的協變數,如 log(library_size)log(person-years)
  • 把「計數」變成「率」而耗自由度
  • Deviance D = 2(logLsat − logLmodel),類似 RSS
  • 巢狀模型比較用 D 差 → χ²;非巢狀用 AIC
  • Offset = covariate with fixed coefficient 1, e.g. log(library_size), log(person-years)
  • Converts "counts" to "rates" without spending df
  • Deviance D = 2(logLsat − logLmodel), the GLM analog of RSS
  • Nested models: difference in D ~ χ²; non-nested: AIC

為什麼必須用 NB?

下方模擬器生成 100 個基因 × 兩組各 5 個樣本的計數資料,真實來自 NB(μ=20, α=滑桿值)。對每個基因跑兩個 DE 檢定(兩組均值 vs χ²):(1)誤把資料當 Poisson;(2)正確用 NB。當 α=0 → 兩者一樣;α 升高 → Poisson 的 χ²/df 飆升,假陽性率從 5% 漲到極端值——這就是 DESeq2 / edgeR 存在的原因。

The simulator below generates 100 genes × two groups (n=5 each) of counts truly from NB(μ=20, α=slider). For each gene we run a DE test two ways: (1) assuming Poisson; (2) correctly using NB. When α = 0 the two match; as α grows, Poisson's χ²/df explodes and the false-positive rate shoots far past 5% — the entire reason DESeq2 / edgeR exist.

長條:100 基因 Pearson χ²/df

實作:Poisson / NB / ZINB

# --- R --- 計數 GLM 全套
library(MASS); library(AER); library(pscl); library(broom)

# 1) Poisson with offset for rate
fit_p <- glm(count ~ x + offset(log(size)),
              family = poisson(link = "log"), data = df)

# 2) 過度離散診斷
sum(residuals(fit_p, "pearson")^2) / fit_p$df.residual    # χ²/df
AER::dispersiontest(fit_p)                                  # 正式檢定

# 3) 改用 NB GLM
fit_nb <- MASS::glm.nb(count ~ x + offset(log(size)), data = df)
fit_nb$theta                                            # theta = 1/α
broom::tidy(fit_nb, exponentiate = TRUE, conf.int = TRUE)

# 4) Zero-inflated NB(單細胞、生態學)
fit_zinb <- pscl::zeroinfl(count ~ x | x, data = df, dist = "negbin")

# 5) DESeq2 同樣的概念——對 RNA-seq 計數矩陣
# library(DESeq2)
# dds <- DESeqDataSetFromMatrix(counts, coldata, ~ condition)
# dds <- DESeq(dds)        # estimateSizeFactors + Disp + Wald
# results(dds) |> lfcShrink(dds, coef = 2, type = "apeglm")
# --- Python ---
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.discrete.count_model import ZeroInflatedNegativeBinomialP

# 1) Poisson with offset
fit_p = smf.glm("count ~ x", data=df,
                 family=sm.families.Poisson(),
                 offset=np.log(df["size"])).fit()

# 2) Overdispersion 診斷
chi2 = (fit_p.resid_pearson**2).sum()
chi2 / fit_p.df_resid                          # > 1.5 警示

# 3) NB GLM(α 須先估計)
fit_nb = smf.glm("count ~ x", data=df,
                  family=sm.families.NegativeBinomial(alpha=0.3),
                  offset=np.log(df["size"])).fit()
np.exp(fit_nb.params)                           # IRR (incidence rate ratio)

# 4) ZINB
zinb = ZeroInflatedNegativeBinomialP(y, X, exog_infl=X).fit()

# 5) 進階:pydeseq2 — DESeq2 的 Python 移植(NB GLM + EB shrinkage)
# from pydeseq2.dds import DeseqDataSet; dds.deseq2()
🚫
常見錯誤:① 用 Poisson 跑 RNA-seq → 假陽性洪水。② quasi-Poisson 沒有「正規」似然,因此無 AIC / LR 檢定——只能用 deviance F 比較巢狀模型。③ ZINB 套到僅輕度過度離散的資料上——通常 NB 已足,ZINB 多估了一個機率部件反而抖動 SE。④ 不放 log(library_size) offset——估出來的是「絕對計數」而非「率」,跨樣本不可比。 Common mistakes: ① Fitting Poisson to RNA-seq counts → flood of false positives. ② quasi-Poisson has no true likelihood, so no AIC / LR — use deviance F for nested comparison. ③ Forcing ZINB on mildly overdispersed data — NB usually suffices; the extra ZI component just adds noise to SE. ④ Omitting log(library_size) offset — you'd be modeling raw counts, not rates, and cross-sample comparison breaks.

📝 自我檢測

1. 為什麼 Poisson GLM 不適合 RNA-seq?

1. Why does the Poisson GLM fail on RNA-seq counts?

A. 計數可能是負數A. Counts could be negative
B. Poisson 的 link 是 logitB. Poisson's link is logit
C. Poisson 假設 Var = Mean,但 RNA-seq 過度離散;SE 低估 → 假陽性洪水C. Poisson assumes Var = Mean, but RNA-seq is overdispersed; SE underestimated → false-positive flood
D. Poisson 沒有 MLED. Poisson has no MLE

2. 在 glm(count ~ x + offset(log(library_size)), family=poisson) 中加入 offset 是為了什麼?

2. What does offset(log(library_size)) accomplish in glm(..., family=poisson)?

A. 修正過度離散A. Corrects overdispersion
B. 把計數變成連續B. Makes counts continuous
C. 把模型從「計數」改為「率」(count / size),係數可跨樣本比較,且不消耗自由度C. Reframes counts as rates (count / size), making coefficients comparable across samples without spending df
D. 加入互動項D. Adds an interaction term

3. 如何從 Poisson 擬合診斷過度離散?

3. How do you diagnose overdispersion from a Poisson fit?

A. 計算 Pearson χ² 殘差平方和除以殘差自由度(≈ 1 為合理;> 1.5 警示);或 AER::dispersiontestA. Pearson χ² residuals² summed, divided by residual df (≈ 1 OK; > 1.5 flag); or AER::dispersiontest
B. 看 R²B. Look at R²
C. QQ plot 殘差是否常態C. QQ plot for normality of residuals
D. Shapiro-Wilk 檢定D. Shapiro-Wilk test