STEP 2 / 17

點估計與抽樣分布

從 MoM 到 MLE,從 Fisher 資訊到 Cramér-Rao 下限——以及為何「有偏卻收縮」的估計在 RNA-seq 上反而勝過 MLE。

From MoM to MLE, from Fisher information to the Cramér-Rao lower bound — and why biased shrinkage estimators beat the MLE in RNA-seq MSE.

為什麼要分清楚「估計量」與「估計值」?

估計量 (estimator) 是一個函數──把資料 X 映射到參數空間,例如 X̄ 是 μ 的估計量;估計值 (estimate) 是把實際資料代入後得到的數字,例如 x̄ = 4.32。談分布、偏誤、變異數的對象永遠是估計量;談「我這次測得多少」的對象則是估計值。

本章建立四根支柱:(1) 動差法 (MoM) 與最大概似 (MLE),(2) MLE 的一致性與漸近常態性,(3) 偏誤-變異數-MSE 分解、Fisher 資訊與 Cramér-Rao 下限 (CRLB),(4) 為何 DESeq2/edgeR/limma 寧可用有偏的收縮估計,也不要 MLE。

An estimator is a function mapping data X to the parameter space — e.g. X̄ is an estimator of μ. An estimate is the number you get after plugging the actual data in, e.g. x̄ = 4.32. Distributions, bias, and variance are properties of the estimator; "what I measured this time" is the estimate.

The four pillars: (1) method of moments (MoM) and maximum likelihood (MLE), (2) MLE consistency and asymptotic normality, (3) bias-variance-MSE decomposition, Fisher information and the Cramér-Rao lower bound (CRLB), (4) why DESeq2 / edgeR / limma deliberately choose a biased shrinkage estimator over the MLE.

💡
核心原則:「無偏」不等於「最好」。判定優劣的是均方誤差 MSE = bias² + variance──稍微的偏誤若能換來大量變異數下降,整體誤差更小。James-Stein 與 limma 的 moderated 統計量正是這個哲學的實現。 Core principle: "Unbiased" is not "best". Quality is judged by MSE = bias² + variance — a tiny bias that buys a huge variance reduction wins. James-Stein and limma's moderated statistics are exactly this trade in practice.

一、兩大估計法:MoM 與 MLE

⚖️

動差法 (MoM)

樣本動差等同理論動差,反解參數。例:NB 平均 μ̂ = x̄,方差 σ̂² = s²,由 σ² = μ + αμ² 解出 α̂。優點:不需迭代、計算快、能當 MLE 的起始值。缺點:通常變異數較大、對重尾敏感。

Equate sample moments to theoretical moments and invert. e.g. for NB, set μ̂ = x̄, σ̂² = s², solve σ² = μ + αμ² for α̂. Pros: closed-form, fast, makes a great MLE warm start. Cons: generally higher variance, sensitive to heavy tails.

🎯

最大概似 (MLE)

選擇使資料概似函數 L(θ) = ∏ f(xᵢ; θ) 最大的 θ̂。實作上對 ℓ(θ) = log L(θ) 求極值或解 ∂ℓ/∂θ = 0 (score equation)。性質:一致 (θ̂ →ᵖ θ)、漸近常態 √n(θ̂ − θ) → N(0, I(θ)⁻¹)、漸近有效 (達到 CRLB)。

Choose θ̂ that maximises the likelihood L(θ) = ∏ f(xᵢ; θ). Implementations maximise ℓ(θ) = log L(θ) or solve the score equation ∂ℓ/∂θ = 0. Properties: consistent (θ̂ →ᵖ θ), asymptotically normal √n(θ̂ − θ) → N(0, I(θ)⁻¹), asymptotically efficient (attains the CRLB).

二、Fisher 資訊與 CRLB

Fisher 資訊 I(θ) = −E[∂²ℓ/∂θ²] 量化每筆觀測對 θ 的「銳利度」。CRLB:任何無偏估計量 T 的變異數滿足 Var(T) ≥ 1 / (n·I(θ))。對 Gaussian 均值 (σ² 已知),I(μ) = 1/σ²,故 Var(X̄) ≥ σ²/n──而 X̄ 剛好達到,所以是有效估計量 (UMVUE)

偏誤-變異數分解: MSE(θ̂) = E[(θ̂ − θ)²] = Var(θ̂) + Bias(θ̂)²。CRLB 只規範「無偏」家族;放棄無偏限制後,有偏估計量可低於 CRLB──這就是收縮估計的數學空間。

Fisher information I(θ) = −E[∂²ℓ/∂θ²] measures how sharply each observation pins down θ. CRLB: for any unbiased estimator T, Var(T) ≥ 1 / (n·I(θ)). For Gaussian mean (σ² known) I(μ) = 1/σ² so Var(X̄) ≥ σ²/n — and X̄ achieves it, making it the UMVUE.

Bias-variance decomposition: MSE(θ̂) = Var(θ̂) + Bias(θ̂)². CRLB constrains only the unbiased family; allowing bias lets you dip below the CRLB — the mathematical room where shrinkage lives.

⚠️
非線性變換下 MLE 有偏: MLE 對等變變換不變 (e.g. ĉ = g(θ̂)),但若 g 是非線性 (例如 odds ratio p/(1−p)、log-fold-change 後指數還原),樣本期望會與 g(θ) 偏離 — 需用 delta method 校正或直接 bootstrap。
MLE is biased through nonlinear transforms: MLE is equivariant under transforms (ĉ = g(θ̂)), but if g is nonlinear (odds ratio p/(1−p), exponentiating a log-fold-change) the expectation departs from g(θ) — correct via the delta method or bootstrap.

抽樣分布模擬器

對 N(μ=5, σ=2) 重複抽 B = 500 次樣本,每次計算 μ̂_MLE = x̄ 並畫直方圖。拖動 n 觀察:(1) 分布逐漸集中於真值 5 (一致性)、(2) 標準差以 σ/√n 收縮、(3) 形狀漸近常態 (asymptotic normality)。

Repeatedly sample from N(μ=5, σ=2), B = 500 times. Each replicate computes μ̂_MLE = x̄; we histogram them. Drag n to see: (1) the distribution concentrates on the true 5 (consistency), (2) SD shrinks like σ/√n, (3) the shape becomes Gaussian (asymptotic normality).

n = 10SE ≈ —x̄ of x̄ ≈ —

X:估計值 μ̂;Y:頻率。紅線:真值。

三、收縮估計:limma 與 DESeq2 為何不用 MLE?

多參數問題 (e.g. 20,000 個基因的變異數 σ²ⱼ) 中,每個 σ²ⱼ 的 per-gene MLE 在 n=3 樣本下變異極大;James-Stein (1961) 證明把估計值「往全體均值收縮」一律 dominates MLE (MSE 更低,p ≥ 3)。limma 的 squeezeVar、DESeq2 的 estimateDispersionsMAP 都是同一精神:把 per-gene MLE 用全體 (或趨勢) 作為先驗壓回去。

DESeq2 的兩步流程:estimateDispersionsGeneEst 算 per-gene MLE α̂ⱼ → 擬合 dispersion-mean trend → estimateDispersionsMAP 用 trend 當先驗做 MAP 收縮。低表達基因樣本少、MLE 雜訊大,收縮帶來最大改善。

In multi-parameter problems (e.g. 20,000 per-gene variances σ²ⱼ) the per-gene MLE has huge variance at n=3. James-Stein (1961) proved that shrinking estimates toward the group mean strictly dominates the MLE in total MSE (for p ≥ 3). limma's squeezeVar and DESeq2's estimateDispersionsMAP apply this idea: pull per-gene MLEs toward the overall (or trend) prior.

DESeq2's two-step: estimateDispersionsGeneEst computes per-gene MLE α̂ⱼ → fit a dispersion-vs-mean trend → estimateDispersionsMAP does MAP shrinkage with that trend as prior. Low-count genes (noisy MLE) gain the most from shrinkage.

估計量 偏誤 變異數 MSE 排名
Per-gene MLE α̂≈ 0極高最差Per-gene MLE α̂≈ 0very highworst
Trend (fitted)偏向 trend極低Fitted trendtrend-biasedvery lowmiddle
MAP shrinkage (DESeq2)最佳MAP shrinkage (DESeq2)smalllowbest
🚫
常見錯誤:對 NB dispersion 在 n=3 條件下取 per-gene MLE,然後直接代入 Wald 檢定──低 count 基因會被噪音放大成假陽性。解:必須做 dispersion shrinkage (DESeq2 預設、edgeR 的 estimateGLMTrendedDisp)。 Common mistake: taking per-gene MLE NB dispersion at n=3, then feeding it straight into Wald tests — low-count genes become noise-amplified false positives. Fix: dispersion shrinkage (DESeq2 default; edgeR's estimateGLMTrendedDisp).

實作:MLE 與 bootstrap

# --- R --- MLE 與漸近 SE
# 1) 寫負對數概似函數 (negative log-likelihood)
nll <- function(theta, x) {
  mu <- theta[1]; sigma <- exp(theta[2])   # reparam 保證 sigma > 0
  -sum(dnorm(x, mu, sigma, log=TRUE))
}
fit <- optim(c(0, 0), nll, x=x, hessian=TRUE, method="BFGS")
mle  <- fit$par                                  # [μ̂, log σ̂]
se   <- sqrt(diag(solve(fit$hessian)))     # asymptotic SE

# 2) 捷徑:MASS::fitdistr 直接擬合家族
library(MASS)
fitdistr(x, "normal")$estimate
fitdistr(counts, "negative binomial")$estimate   # mu, size

# 3) Bootstrap SE
B <- 2000
boot_mu <- replicate(B, mean(sample(x, length(x), replace=TRUE)))
sd(boot_mu)                                      # bootstrap SE of μ̂
# --- Python --- MLE 與漸近 SE
import numpy as np
from scipy import optimize, stats

# 1) 自寫 NLL 與 optimize.minimize
def nll(theta, x):
    mu, log_sigma = theta
    sigma = np.exp(log_sigma)
    return -np.sum(stats.norm.logpdf(x, mu, sigma))

res = optimize.minimize(nll, x0=[0, 0], args=(x,), method="BFGS")
mle  = res.x
# 漸近 SE:Hessian 的反矩陣對角線開根號
import numdifftools as nd
H  = nd.Hessian(lambda t: nll(t, x))(mle)
se = np.sqrt(np.diag(np.linalg.inv(H)))

# 2) statsmodels GLM / GenericLikelihoodModel: 自動回傳 SE + Wald
import statsmodels.api as sm
nb = sm.GLM(y, X, family=sm.families.NegativeBinomial()).fit()
nb.summary()

# 3) Bootstrap SE
boot_mu = np.array([np.mean(np.random.choice(x, len(x), replace=True)) for _ in range(2000)])
boot_mu.std()

📝 自我檢測

1. 為什麼 MLE 估出來的母體變異數 (用 1/n 而非 1/(n−1)) 是有偏低估

1. Why is the MLE of population variance (using 1/n rather than 1/(n−1)) biased downward?

A. MLE 在大樣本下永遠有偏A. MLE is always biased in large samples
B. 我們用樣本均值 x̄ 取代未知 μ,「估計」自身會吸收一個自由度B. We replace unknown μ by the sample mean x̄, which absorbs one degree of freedom
C. 因為 σ² 必須 > 0C. Because σ² must be > 0
D. 因為 CLT 失效D. Because CLT fails

2. 對 Gaussian (σ² 已知) 的均值 μ,CRLB 等於?

2. For Gaussian mean μ (σ² known), the CRLB equals?

A. σ² / nA. σ² / n
B. σ² · nB. σ² · n
C. 1 / σ²C. 1 / σ²
D. n / σD. n / σ

3. 何時有偏估計可以打敗無偏估計?

3. When can a biased estimator beat an unbiased one?

A. 永遠不行,無偏一定最好A. Never — unbiased is always best
B. 只在小樣本下B. Only in small samples
C. 當 bias² 增加幅度 < 變異數下降幅度 → 整體 MSE 更小C. When the increase in bias² is less than the drop in variance — overall MSE drops
D. 當資料量無限D. When the data is infinite