Step 13: Hierarchical & Empirical Bayes (limma shrinkage)

概覽

從 James-Stein 到 limma：收縮為何贏

1961 年 Stein 證明了一個令所有頻率學家震驚的結果：當你要同時估計 k ≥ 3 個獨立常態均值，把每個樣本均值朝整體中心收縮得到的估計量，在所有真值組合下、均勻地比 MLE 有更小的均方誤差 (MSE)。MLE 雖然無偏，卻不是最佳——這就是James-Stein paradox。

同樣的邏輯讓 limma::eBayes() 成為微陣列／RNA-seq 差異表達分析的事實標準：每個基因 g 的樣本變異 s_g² 在 n=3 時極不穩定，t 統計量會被小變異基因「人為放大」假陽性；eBayes 將 s_g² 朝整體變異趨勢線收縮，得到 moderated t——分母更穩定、推論更可靠。

這一章把階層模型 (full Bayesian) 與經驗貝氏 (EB) 並列：兩者都靠 partial pooling / borrow strength，只是 EB 用資料估計超先驗參數，不傳播超參數的不確定性。

In 1961 Stein proved a result that shocked frequentists: when jointly estimating k ≥ 3 independent normal means, shrinking each sample mean toward the overall centre uniformly beats the MLE in mean squared error for every true parameter vector. The MLE is unbiased but inadmissible — this is James-Stein paradox.

The same logic makes limma::eBayes() the de-facto standard for microarray / RNA-seq differential expression. With n = 3 per group the per-gene variance s_g² is wildly unstable, and t-statistics get artificially inflated for low-variance genes. eBayes shrinks s_g² toward the global mean-variance trend, producing the moderated t — a stabler denominator and far more reliable inference.

This chapter sets hierarchical (full Bayes) next to empirical Bayes (EB): both partial-pool by borrowing strength; EB plugs in hyperparameters estimated from the data instead of propagating their uncertainty.

💡

核心原則：「我的估計量無偏」不是金科玉律。當你同時估計許多參數時，偏差換變異的交換幾乎總是划算——這就是收縮的精神，也是基因組學裡 partial pooling 普及的主因。 Core principle: "my estimator is unbiased" is not a sacred goal. When you estimate many parameters at once, trading bias for variance almost always pays — the essence of shrinkage and the reason partial pooling pervades genomics.

核心概念

一、階層、EB、shrinkage

🏛️

階層模型 (partial pooling)

每個 g 的參數 θ_g 來自一個共同的母體分布 θ_g ~ N(μ, τ²)，其中 μ, τ² 也有自己的先驗（hyperprior）。後驗 p(θ_g|y) 自動把「單獨估計 y_g」與「整體平均 μ̂」做加權平均——這就是 partial pooling。

極限：τ → 0 退化為 complete pooling（所有基因一個值）；τ → ∞ 退化為 no pooling（每個基因獨立）。「適度收縮」是兩端之間最優。

Each gene's parameter θ_g is drawn from a common population θ_g ~ N(μ, τ²) with its own hyperprior. The posterior p(θ_g|y) automatically takes a weighted average of "fit gene g alone" and "the overall μ̂" — that is partial pooling.

Limits: τ → 0 → complete pooling (one value for all); τ → ∞ → no pooling (each gene independent). Moderate shrinkage sits in the middle and is provably better.

📐

James-Stein paradox

對 X_i ~ N(θ_i, 1), i=1..k (k ≥ 3)，James-Stein 估計量

θ̂_JS = (1 − (k−2)/Σ X_i²) · X

在任何真值組合下都有比 MLE = X 更低的 MSE。違反直覺處：即使各組獨立、看似無關，「同時收縮」仍嚴格更好。

這提供了 limma / DESeq2 shrinkage 的理論基礎——只是把「均值朝中心收縮」改為「變異 / log-FC 朝整體趨勢收縮」。

For X_i ~ N(θ_i, 1), i = 1..k (k ≥ 3), the James-Stein estimator

θ̂_JS = (1 − (k−2)/Σ X_i²) · X

has strictly lower MSE than the MLE for every θ. Counter-intuitive bit: even when the groups are independent and seemingly unrelated, joint shrinkage strictly improves total MSE.

This is the theoretical backbone for limma / DESeq2 shrinkage — just applied to variances or log-FCs instead of means.

🔬

limma 的 eBayes() 到底做什麼？

對 expression matrix E[g, sample] 做基因內 OLS，得到 β̂_g 與 s_g²。
EB shrinkage 變異：假設 s_g² ~ s₀² · χ²/d₀（scaled inverse-χ²），從所有基因估出 s₀² 與超先驗自由度 d₀。
後驗變異 = 加權平均：s̃_g² = (d₀ · s₀² + d_g · s_g²) / (d₀ + d_g)。
moderated t = β̂_g / (s̃_g · sqrt(v_g))，自由度 d₀ + d_g — 比原 t 多出 d₀ 度自由度。
結果：n=3 的實驗也能跑「等效於 n=3+d₀」的穩定 t 檢定。

Per-gene OLS on the expression matrix E[g, sample] gives β̂_g and s_g².
EB shrinks the variance: assume s_g² ~ s₀² · χ²/d₀ (scaled inverse-χ²); estimate s₀² and prior df d₀ from all genes.
Posterior variance = weighted average: s̃_g² = (d₀ · s₀² + d_g · s_g²) / (d₀ + d_g).
moderated t = β̂_g / (s̃_g · √v_g), df = d₀ + d_g — i.e. extra d₀ degrees of freedom borrowed from the genome.
Net effect: even an n = 3 experiment behaves like "n = 3 + d₀" for the variance estimator.

🔧

ashr 與 lfcShrink

ashr (Adaptive Shrinkage)：直接對效應大小 β̂_g 做 EB，先驗用非對稱混合常態，靈活適應稀疏 vs 密集真效應分布。
DESeq2 的 lfcShrink(type="apeglm")：使用 apeglm 對 NB GLM 的 log-FC 做 EB 收縮，減少低 count 基因的雜訊噴發。
產出的 LFC 適合做 effect-size 排序、ranked GSEA。

ashr (Adaptive SHRinkage): EB on the effect sizes β̂_g directly, with a flexible asymmetric mixture-normal prior — adapts to sparse vs dense true effects.
DESeq2's lfcShrink(type="apeglm"): EB shrinks the NB-GLM log-FCs, taming the noisy explosions at low counts.
Shrunken LFCs are ideal for effect-size ranking and ranked-GSEA.

⚠️

EB ≠ 完整貝氏：EB 把超先驗參數當作「點估計」插回去，忽略了它本身的不確定性。結果是 EB 的可信區間系統性偏窄——對 SNP n=3 的高雜訊資料尤其明顯。需嚴格 UQ 時改用 hierarchical full Bayes（brms / Stan）。

EB ≠ full Bayes. EB plugs in point estimates of hyperparameters and ignores their uncertainty. Consequence: EB credible intervals are systematically too narrow, especially for very noisy small-n data. For rigorous UQ, switch to hierarchical full Bayes (brms / Stan).

互動模擬

eBayes 變異收縮視覺化

每個圓點是一個基因的 (均值, 原始變異)。拉動 d₀ 滑桿增加先驗自由度，觀察每個 s_g² 被拉向整體中心 s₀²；連動下方的 t-統計量散布——收縮後 t-stat 不再被低變異基因人為放大。

Each dot is a gene at (mean, raw variance). Push the d₀ slider to inject more prior df and watch every s_g² migrate toward the global s₀². The t-statistic panel updates in real time — moderated t no longer explodes at low-variance genes.

———

先驗自由度 d₀ 4

每組 n 3

X: 原始 log(s_g²)；Y: 收縮後 log(s̃_g²)

決策引導

二、EB 還是 full Bayes？

🌳 收縮策略決策樹

Q1:

n_per_group ≤ 5 的微陣列／RNA-seq？→ limma::eBayes() 或 DESeq2/edgeR 內建變異收縮——幾乎強制。

Q2:

關心 log-FC 排序 (GSEA, ranking, visualization)？→ lfcShrink(type="apeglm") 或 ashr。

Q3:

研究設計有強烈分組（多 batch、多組織、多病人）？→ 階層模型 (brms / lme4)，避免一概收縮到全基因平均。

Q4:

需要嚴謹的 UQ（臨床決策、發報告區間）？→ full Bayesian (Stan)，不是 EB——EB 區間偏窄。

Q5:

少量基因 (k < 100)？→ EB 超參數估計不穩——慎用，或併入弱先驗的 Stan 模型。

Q1:

n_per_group ≤ 5 microarray / RNA-seq? → limma::eBayes() or DESeq2/edgeR's built-in variance shrinkage — virtually mandatory.

Q2:

Need ranked / displayed log-FCs (GSEA, MA plots)? → lfcShrink(type="apeglm") or ashr.

Q3:

Strong structure in design (batches, tissues, donors)? → Hierarchical model (brms / lme4) — avoid pooling toward one global mean.

Q4:

Rigorous UQ (clinical decision, reported intervals)? → Full Bayesian (Stan), not EB — EB intervals are too narrow.

Q5:

Few genes (k < 100)? → EB hyperparameter estimates are unstable — use cautiously or roll into a Stan model with weak priors.

場景	方法	收縮什麼
微陣列 DE (n=3 vs 3)	limma::eBayes(trend=TRUE)	基因內變異 s_g²	Microarray DE (n=3 vs 3)	limma::eBayes(trend=TRUE)	per-gene variance s_g²
RNA-seq DE，需要顯示 LFC	DESeq2 + lfcShrink(apeglm)	log fold-change β̂_g	RNA-seq DE with LFC display	DESeq2 + lfcShrink(apeglm)	log fold-change β̂_g
非對稱效應分布 (sparse hits)	ashr::ash()	β̂_g 與 lfsr (false sign rate)	Asymmetric / sparse effects	ashr::ash()	β̂_g and lfsr
多病人 / 多組織階層	brms / Stan hierarchical	θ_g 朝亞群均值收縮	Multi-donor / multi-tissue	brms / Stan hierarchical	θ_g toward subgroup means
少基因 + 臨床決策	full Bayes + 弱先驗	完整 hyperparameter UQ	Few genes + clinical decision	Full Bayes + weak priors	full hyperparameter UQ

程式碼

實作：limma eBayes、ashr、PyMC 階層

# --- R --- 經典 limma + ashr
library(limma); library(ashr)

# 1) 標準 limma 流程
design  <- model.matrix(~ group, data=metadata)
fit     <- lmFit(expr, design)                      # expr: log2 expression matrix
contr   <- makeContrasts(groupB - groupA, levels=design)
fit2    <- contrasts.fit(fit, contr)
fit2    <- eBayes(fit2, trend=TRUE)                # 平均-變異 trended prior
topTable(fit2, coef=1, number=20, adjust="BH")

# 2) 看看 eBayes 學到的超參數
fit2$df.prior         # d_0 — 從基因組借來的「額外」自由度
fit2$s2.prior         # s_0^2 — 收縮的目標
plotSA(fit2)          # 平均 vs 變異 + trend

# 3) 直接對效應大小做 ashr
ash_out <- ash(fit2$coefficients[, 1], fit2$stdev.unscaled[, 1] * sqrt(fit2$s2.post))
head(ash_out$result[, c("PosteriorMean", "lfsr")])

# 4) DESeq2 對應：lfcShrink
# dds <- DESeqDataSetFromMatrix(...); dds <- DESeq(dds)
# res <- lfcShrink(dds, coef="group_B_vs_A", type="apeglm")

# --- Python ---
import numpy as np, statsmodels.api as sm
import pymc as pm, arviz as az

# 1) 手動 EB 變異收縮 (簡化版 limma)
# expr: (G genes, N samples), design: (N, p)
G, N = expr.shape; p = design.shape[1]
betas, s2 = np.zeros((G, p)), np.zeros(G)
for g in range(G):
    res = sm.OLS(expr[g], design).fit()
    betas[g], s2[g] = res.params, res.mse_resid
d_g = N - p
# 估計 d_0, s_0^2 via method-of-moments on log(s2) (Smyth 2004)
z = np.log(s2)
e = z - np.log(d_g) + np.special.digamma(d_g/2)        # adjusted moments
e_mean = e.mean(); e_var = e.var(ddof=1)
# solve for d_0 from trigamma(d_0/2) = e_var - trigamma(d_g/2)
from scipy.special import polygamma
from scipy.optimize import brentq
target = e_var - polygamma(1, d_g/2)
d0 = brentq(lambda d: polygamma(1, d/2) - target, 0.01, 1000) if target > 0 else np.inf
s0_2 = np.exp(e_mean + polygamma(0, d0/2) - np.log(d0))
# shrunken variance
s2_post = (d0 * s0_2 + d_g * s2) / (d0 + d_g)

# 2) 階層 PyMC：θ_g | μ, τ ~ Normal(μ, τ)
with pm.Model() as hb:
    mu   = pm.Normal("mu", 0, 2)
    tau  = pm.HalfNormal("tau", 1)
    z    = pm.Normal("z", 0, 1, shape=G)
    theta= pm.Deterministic("theta", mu + tau*z)
    sigma= pm.HalfNormal("sigma", 1)
    pm.Normal("y", mu=theta[gene_idx], sigma=sigma, observed=y_obs)
    idata = pm.sample(target_accept=0.95)

# 3) ashr：透過 rpy2
# from rpy2.robjects.packages import importr; ashr = importr("ashr")
# res = ashr.ash(betas, ses)

🚫

常見錯誤：n=2 vs 2 直接套 t.test 並 BH 校正——這幾乎一定會生出大量假陽性。原始 s_g 自由度只有 1，分母可以任意小。改用 limma + eBayes 立刻穩定下來，這也是領域共識。 Common mistake: running per-gene t.test + BH on an n=2 vs 2 study. The raw s_g has only 1 df and can be arbitrarily small, inflating t and producing massive false positives. Always use limma + eBayes (or DESeq2 / edgeR) — this is field consensus.

📝 自我檢測

1. 既然每個基因的 MLE 無偏，為什麼收縮估計量在 MSE 上仍然更好？

1. Each gene's MLE is unbiased — why does shrinkage still improve total MSE?

A. 收縮降低偏差A. Shrinkage reduces bias

B. 收縮引入小偏差，但變異下降更多，總 MSE 仍降；James-Stein 對 k ≥ 3 證明這恆成立B. Shrinkage introduces a small bias but reduces variance more — total MSE drops; James-Stein proves this for k ≥ 3

C. 收縮使估計量也無偏C. Shrinkage keeps the estimator unbiased

D. 收縮不影響 MSE，只是視覺化技巧D. Shrinkage does not affect MSE, it is purely cosmetic

2. limma 的 eBayes() 實際上收縮什麼？這帶來什麼好處？

2. What does limma's eBayes() actually shrink, and what does that buy you?

A. 收縮 log fold-change 朝零，避免高 LFC 基因A. Shrinks log fold-changes toward zero to suppress high-LFC genes

B. 收縮 p-value 朝 0.5B. Shrinks p-values toward 0.5

C. 收縮每個基因的變異 s_g² 朝全基因組趨勢，獲得穩定 moderated t 與 d_0 額外自由度C. Shrinks the per-gene variance s_g² toward a global trend, yielding stable moderated t and extra d₀ degrees of freedom

D. 收縮 design matrixD. Shrinks the design matrix

3. 什麼時候 EB 不夠、需要完整階層貝氏？

3. When is empirical Bayes inadequate compared with full hierarchical Bayes?

A. 需要正確的不確定性量化（如臨床區間），或基因數小、超參數本身估計不穩時A. When proper uncertainty quantification matters (e.g. clinical intervals) or when k is small and the hyperparameter estimates are themselves unstable

B. 一律不需要 — EB 永遠等價於 full BayesB. Never — EB is always equivalent to full Bayes

C. 只在沒有電腦時C. Only when you lack a computer

D. 沒人用 full BayesD. Nobody uses full Bayes