為什麼要分清楚「估計量」與「估計值」?
估計量 (estimator) 是一個函數──把資料 X 映射到參數空間,例如 X̄ 是 μ 的估計量;估計值 (estimate) 是把實際資料代入後得到的數字,例如 x̄ = 4.32。談分布、偏誤、變異數的對象永遠是估計量;談「我這次測得多少」的對象則是估計值。
本章建立四根支柱:(1) 動差法 (MoM) 與最大概似 (MLE),(2) MLE 的一致性與漸近常態性,(3) 偏誤-變異數-MSE 分解、Fisher 資訊與 Cramér-Rao 下限 (CRLB),(4) 為何 DESeq2/edgeR/limma 寧可用有偏的收縮估計,也不要 MLE。
An estimator is a function mapping data X to the parameter space — e.g. X̄ is an estimator of μ. An estimate is the number you get after plugging the actual data in, e.g. x̄ = 4.32. Distributions, bias, and variance are properties of the estimator; "what I measured this time" is the estimate.
The four pillars: (1) method of moments (MoM) and maximum likelihood (MLE), (2) MLE consistency and asymptotic normality, (3) bias-variance-MSE decomposition, Fisher information and the Cramér-Rao lower bound (CRLB), (4) why DESeq2 / edgeR / limma deliberately choose a biased shrinkage estimator over the MLE.
一、兩大估計法:MoM 與 MLE
動差法 (MoM)
用樣本動差等同理論動差,反解參數。例:NB 平均 μ̂ = x̄,方差 σ̂² = s²,由 σ² = μ + αμ² 解出 α̂。優點:不需迭代、計算快、能當 MLE 的起始值。缺點:通常變異數較大、對重尾敏感。
Equate sample moments to theoretical moments and invert. e.g. for NB, set μ̂ = x̄, σ̂² = s², solve σ² = μ + αμ² for α̂. Pros: closed-form, fast, makes a great MLE warm start. Cons: generally higher variance, sensitive to heavy tails.
最大概似 (MLE)
選擇使資料概似函數 L(θ) = ∏ f(xᵢ; θ) 最大的 θ̂。實作上對 ℓ(θ) = log L(θ) 求極值或解 ∂ℓ/∂θ = 0 (score equation)。性質:一致 (θ̂ →ᵖ θ)、漸近常態 √n(θ̂ − θ) → N(0, I(θ)⁻¹)、漸近有效 (達到 CRLB)。
Choose θ̂ that maximises the likelihood L(θ) = ∏ f(xᵢ; θ). Implementations maximise ℓ(θ) = log L(θ) or solve the score equation ∂ℓ/∂θ = 0. Properties: consistent (θ̂ →ᵖ θ), asymptotically normal √n(θ̂ − θ) → N(0, I(θ)⁻¹), asymptotically efficient (attains the CRLB).
二、Fisher 資訊與 CRLB
Fisher 資訊 I(θ) = −E[∂²ℓ/∂θ²] 量化每筆觀測對 θ 的「銳利度」。CRLB:任何無偏估計量 T 的變異數滿足 Var(T) ≥ 1 / (n·I(θ))。對 Gaussian 均值 (σ² 已知),I(μ) = 1/σ²,故 Var(X̄) ≥ σ²/n──而 X̄ 剛好達到,所以是有效估計量 (UMVUE)。
偏誤-變異數分解: MSE(θ̂) = E[(θ̂ − θ)²] = Var(θ̂) + Bias(θ̂)²。CRLB 只規範「無偏」家族;放棄無偏限制後,有偏估計量可低於 CRLB──這就是收縮估計的數學空間。
Fisher information I(θ) = −E[∂²ℓ/∂θ²] measures how sharply each observation pins down θ. CRLB: for any unbiased estimator T, Var(T) ≥ 1 / (n·I(θ)). For Gaussian mean (σ² known) I(μ) = 1/σ² so Var(X̄) ≥ σ²/n — and X̄ achieves it, making it the UMVUE.
Bias-variance decomposition: MSE(θ̂) = Var(θ̂) + Bias(θ̂)². CRLB constrains only the unbiased family; allowing bias lets you dip below the CRLB — the mathematical room where shrinkage lives.
ĉ = g(θ̂)),但若 g 是非線性 (例如 odds ratio p/(1−p)、log-fold-change 後指數還原),樣本期望會與 g(θ) 偏離 — 需用 delta method 校正或直接 bootstrap。ĉ = g(θ̂)), but if g is nonlinear (odds ratio p/(1−p), exponentiating a log-fold-change) the expectation departs from g(θ) — correct via the delta method or bootstrap.抽樣分布模擬器
對 N(μ=5, σ=2) 重複抽 B = 500 次樣本,每次計算 μ̂_MLE = x̄ 並畫直方圖。拖動 n 觀察:(1) 分布逐漸集中於真值 5 (一致性)、(2) 標準差以 σ/√n 收縮、(3) 形狀漸近常態 (asymptotic normality)。
Repeatedly sample from N(μ=5, σ=2), B = 500 times. Each replicate computes μ̂_MLE = x̄; we histogram them. Drag n to see: (1) the distribution concentrates on the true 5 (consistency), (2) SD shrinks like σ/√n, (3) the shape becomes Gaussian (asymptotic normality).
X:估計值 μ̂;Y:頻率。紅線:真值。
三、收縮估計:limma 與 DESeq2 為何不用 MLE?
在多參數問題 (e.g. 20,000 個基因的變異數 σ²ⱼ) 中,每個 σ²ⱼ 的 per-gene MLE 在 n=3 樣本下變異極大;James-Stein (1961) 證明把估計值「往全體均值收縮」一律 dominates MLE (MSE 更低,p ≥ 3)。limma 的 squeezeVar、DESeq2 的 estimateDispersionsMAP 都是同一精神:把 per-gene MLE 用全體 (或趨勢) 作為先驗壓回去。
DESeq2 的兩步流程:estimateDispersionsGeneEst 算 per-gene MLE α̂ⱼ → 擬合 dispersion-mean trend → estimateDispersionsMAP 用 trend 當先驗做 MAP 收縮。低表達基因樣本少、MLE 雜訊大,收縮帶來最大改善。
In multi-parameter problems (e.g. 20,000 per-gene variances σ²ⱼ) the per-gene MLE has huge variance at n=3. James-Stein (1961) proved that shrinking estimates toward the group mean strictly dominates the MLE in total MSE (for p ≥ 3). limma's squeezeVar and DESeq2's estimateDispersionsMAP apply this idea: pull per-gene MLEs toward the overall (or trend) prior.
DESeq2's two-step: estimateDispersionsGeneEst computes per-gene MLE α̂ⱼ → fit a dispersion-vs-mean trend → estimateDispersionsMAP does MAP shrinkage with that trend as prior. Low-count genes (noisy MLE) gain the most from shrinkage.
| 估計量 | 偏誤 | 變異數 | MSE 排名 | ||||
|---|---|---|---|---|---|---|---|
| Per-gene MLE α̂ | ≈ 0 | 極高 | 最差 | Per-gene MLE α̂ | ≈ 0 | very high | worst |
| Trend (fitted) | 偏向 trend | 極低 | 中 | Fitted trend | trend-biased | very low | middle |
| MAP shrinkage (DESeq2) | 小 | 低 | 最佳 | MAP shrinkage (DESeq2) | small | low | best |
estimateGLMTrendedDisp)。
Common mistake: taking per-gene MLE NB dispersion at n=3, then feeding it straight into Wald tests — low-count genes become noise-amplified false positives. Fix: dispersion shrinkage (DESeq2 default; edgeR's estimateGLMTrendedDisp).
實作:MLE 與 bootstrap
# --- R --- MLE 與漸近 SE # 1) 寫負對數概似函數 (negative log-likelihood) nll <- function(theta, x) { mu <- theta[1]; sigma <- exp(theta[2]) # reparam 保證 sigma > 0 -sum(dnorm(x, mu, sigma, log=TRUE)) } fit <- optim(c(0, 0), nll, x=x, hessian=TRUE, method="BFGS") mle <- fit$par # [μ̂, log σ̂] se <- sqrt(diag(solve(fit$hessian))) # asymptotic SE # 2) 捷徑:MASS::fitdistr 直接擬合家族 library(MASS) fitdistr(x, "normal")$estimate fitdistr(counts, "negative binomial")$estimate # mu, size # 3) Bootstrap SE B <- 2000 boot_mu <- replicate(B, mean(sample(x, length(x), replace=TRUE))) sd(boot_mu) # bootstrap SE of μ̂
# --- Python --- MLE 與漸近 SE import numpy as np from scipy import optimize, stats # 1) 自寫 NLL 與 optimize.minimize def nll(theta, x): mu, log_sigma = theta sigma = np.exp(log_sigma) return -np.sum(stats.norm.logpdf(x, mu, sigma)) res = optimize.minimize(nll, x0=[0, 0], args=(x,), method="BFGS") mle = res.x # 漸近 SE:Hessian 的反矩陣對角線開根號 import numdifftools as nd H = nd.Hessian(lambda t: nll(t, x))(mle) se = np.sqrt(np.diag(np.linalg.inv(H))) # 2) statsmodels GLM / GenericLikelihoodModel: 自動回傳 SE + Wald import statsmodels.api as sm nb = sm.GLM(y, X, family=sm.families.NegativeBinomial()).fit() nb.summary() # 3) Bootstrap SE boot_mu = np.array([np.mean(np.random.choice(x, len(x), replace=True)) for _ in range(2000)]) boot_mu.std()
📝 自我檢測
1. 為什麼 MLE 估出來的母體變異數 (用 1/n 而非 1/(n−1)) 是有偏低估?
1. Why is the MLE of population variance (using 1/n rather than 1/(n−1)) biased downward?
2. 對 Gaussian (σ² 已知) 的均值 μ,CRLB 等於?
2. For Gaussian mean μ (σ² known), the CRLB equals?
3. 何時有偏估計可以打敗無偏估計?
3. When can a biased estimator beat an unbiased one?