為什麼需要 MCMC?
除了共軛模型,幾乎所有實用的貝氏模型都沒有封閉解。p(θ|y) = p(y|θ)p(θ) / p(y) 的分母 p(y) = ∫ p(y|θ)p(θ) dθ 是高維積分;MCMC 不去算這個積分,而是從後驗分布抽樣,用樣本逼近任何後驗摘要。
三個演算家族:Metropolis-Hastings (MH) 是最一般化的提議—接受機制,但對高度相關的後驗收斂極慢;Gibbs sampling 在能寫出條件後驗時很方便,但同樣困於相關性;Hamiltonian Monte Carlo (HMC) / NUTS 利用對數後驗的梯度沿著等能線移動,在高維與強相關後驗下效率遠勝前兩者——這是 Stan、PyMC、NumPyro 的預設引擎。
Outside conjugate models almost every useful Bayesian posterior has no closed form. The normalizer p(y) = ∫ p(y|θ)p(θ) dθ in p(θ|y) = p(y|θ)p(θ)/p(y) is a high-dimensional integral. MCMC sidesteps it: instead of integrating, it draws samples from the posterior and approximates any summary by Monte Carlo.
Three families: Metropolis-Hastings (MH) is the most general but mixes painfully slowly under correlated posteriors; Gibbs is handy when conditional posteriors are tractable but suffers the same correlation problem; Hamiltonian Monte Carlo (HMC) / NUTS uses the gradient of the log-posterior to move along level sets, vastly outperforming the others in high dimensions and strong correlations — the default engine in Stan, PyMC, NumPyro.
一、演算法與收斂徵兆
Metropolis-Hastings / Gibbs
- MH:從提議分布抽樣,按
min(1, π(θ*)q(θ|θ*) / π(θ)q(θ*|θ))接受。 - Gibbs:輪流從每個條件後驗
p(θⱼ|θ₋ⱼ, y)取樣。 - 優點:簡單、通用;缺點:當後驗強相關,鏈條像「走步」一樣黏在一條主軸上,ESS 極低。
- MH: propose θ*, accept with prob
min(1, π(θ*)q(θ|θ*) / π(θ)q(θ*|θ)). - Gibbs: cycle through full conditionals
p(θⱼ|θ₋ⱼ, y). - Pros: simple, general. Cons: under correlated posteriors the chain shuffles along one axis at a crawl — terrible ESS.
HMC 與 NUTS
- 用
∇ log π(θ)構造哈密頓動力,沿等能線跳遠距、低自相關。 - NUTS (No-U-Turn Sampler) 自動調整路徑長度,免去手動 tuning。
- 關鍵代價:必須能微分 — 不接受離散參數(須邊際化)。
- Stan / PyMC / NumPyro 的預設;高維與強相關下唯一可行。
- Uses
∇ log π(θ)to follow Hamiltonian trajectories — long jumps with low autocorrelation. - NUTS adapts path length automatically — no manual tuning.
- Catch: needs differentiability — discrete params must be marginalized out.
- The default in Stan / PyMC / NumPyro; the only viable choice in high-D, correlated geometry.
收斂診斷三件套
- R-hat (Ȓ):鏈間 vs 鏈內變異比,目標
< 1.01。 - ESS (有效樣本量):分 bulk ESS(中段密度)與 tail ESS(5/95 百分位);至少
> 400。 - Trace plot / rank plot:肉眼看「毛毛蟲」是否糾纏在一起;rank plot 比 trace 更敏感。
- R-hat (Ȓ): between- vs within-chain variance; target
< 1.01. - ESS: split into bulk ESS (central density) and tail ESS (5/95 percentiles); aim for
> 400each. - Trace / rank plots: chains should overlap like tangled caterpillars; rank plots are more sensitive than traces.
Divergent transitions
- HMC 在後驗幾何極端尖銳 (funnel) 時,積分誤差爆炸;Stan 會回報「divergent」。
- 不能忽略——它指出某個區域抽樣偏差,後驗摘要可能錯。
- 對策:(1) non-centered parametrization(階層模型的關鍵技巧);(2) 提高
adapt_delta(0.95 / 0.99);(3) 對正參數取 log。
- When posterior geometry has sharp features (funnels), HMC's integrator blows up and Stan reports a "divergent".
- Never ignore them — they flag biased exploration of some region.
- Fixes: (1) non-centered parametrization (key trick for hierarchical models); (2) raise
adapt_deltato 0.95 or 0.99; (3) log-transform positive params.
HMC vs Metropolis:強相關後驗探索
後驗是一個強相關的二維高斯(ρ≈0.95)。選擇取樣器、調整步長,然後按「Step」——Metropolis 沿主軸蠕動 (高拒絕率),HMC 沿等能線一跳就跨越長軸。觀察右上角的接受率與ESS 估計。
The target is a strongly correlated 2-D Gaussian (ρ≈0.95). Pick a sampler, tune step size, hit Step — Metropolis crawls along the major axis with many rejections; HMC traverses the long axis in a single leap. Watch acceptance rate and a rough ESS in the panel.
灰色等高線:目標後驗;紅點:鏈條軌跡
二、收到警告怎麼辦?
🌳 MCMC 警告排除流程
iter(warmup + sampling 各加倍);若仍不收斂,模型可能不可辨識。adapt_delta;② 階層 σ 改 non-centered;③ 對正參數取 log;④ 檢查先驗是否過鬆。max_treedepth 即可;但常伴隨 funnel 幾何,宜重參數化。adapt_delta; non-center hierarchical σ; log-transform positive params; check overly-wide priors.max_treedepth, but it often co-occurs with funnels; reparameterize.| 症狀 | 可能原因 | 處方 | |||
|---|---|---|---|---|---|
| R-hat 一致 > 1.05 | 非識別性 / 先驗矛盾 | 重新建模、加強先驗 | R-hat > 1.05 across params | Non-identifiable / conflicting priors | Reparameterize, tighten priors |
| 階層模型大量 divergences | Neal's funnel | non-centered parametrization | Many divergences in hierarchical model | Neal's funnel | Non-centered parametrization |
| ESS 很低、trace 黏成一坨 | 強相關後驗、Metropolis 步長不當 | 改用 HMC / NUTS | Low ESS, traces glued together | Highly correlated posterior, bad MH step | Switch to HMC / NUTS |
| 接受率 100% 或 1% | 步長太小 / 太大 | 調整步長,HMC 用 NUTS 自適應 | Accept rate ~100% or ~1% | Step size too small / large | Re-tune; let NUTS auto-adapt |
| 兩鏈分到兩個眾數 | 多峰後驗 | 回到模型——是否真該多峰?考慮 mixture 表示 | Two chains in different modes | Multimodal posterior | Revisit the model — is multimodality real? Try mixture formulation |
實作:brms / PyMC + 診斷
# --- R --- brms 階層模型 + 收斂診斷 library(brms); library(posterior); library(bayesplot) fit <- brm(y ~ x + (1|batch), data=df, family=gaussian(), chains=4, iter=2000, warmup=1000, cores=4, control=list(adapt_delta=0.95, max_treedepth=12)) summary(fit) # R-hat 與 bulk/tail ESS 都在表格中 plot(fit) # trace + 密度 # 顯式診斷 draws <- as_draws_df(fit) summarise_draws(draws, rhat, ess_bulk, ess_tail) # 視覺化 mcmc_trace(fit, pars=c("b_x", "sd_batch__Intercept")) mcmc_rank_overlay(fit, pars="b_x") # rank plot 更敏感 # Divergence 檢查 nuts_params(fit) |> subset(Parameter == "divergent__") |> sum() # 非中心化(brms 自動處理;rstan 寫法) # mu_g = mu0 + tau * z_g; z_g ~ N(0, 1)
# --- Python --- PyMC NUTS + ArviZ 診斷 import pymc as pm, arviz as az, numpy as np with pm.Model() as m: mu0 = pm.Normal("mu0", 0, 5) tau = pm.HalfNormal("tau", 1) z = pm.Normal("z", 0, 1, shape=n_batch) # non-centered mu_g = pm.Deterministic("mu_g", mu0 + tau * z) sigma = pm.HalfNormal("sigma", 1) pm.Normal("y", mu=mu_g[batch_idx], sigma=sigma, observed=y) idata = pm.sample(2000, tune=1000, chains=4, target_accept=0.95, random_seed=42) # 1) 表格診斷 az.summary(idata, hdi_prob=0.95) # r_hat, ess_bulk, ess_tail # 2) 視覺化 az.plot_trace(idata); az.plot_rank(idata) az.plot_energy(idata) # Bayesian Fraction of Missing Info # 3) Divergences n_div = idata.sample_stats.diverging.sum().item() print("divergent transitions:", n_div) # 4) 對問題參數對 pair-plot:看 funnel az.plot_pair(idata, var_names=["tau", "z"], divergences=True)
📝 自我檢測
1. R-hat (Ȓ) 衡量什麼?可接受的閾值是?
1. What does R-hat measure, and what threshold is acceptable?
2. Divergent transitions 為什麼重要?如何減少?
2. Why do divergent transitions matter, and how do you reduce them?
3. NUTS 在何種情況下顯著優於 Gibbs sampling?
3. When does NUTS clearly outperform Gibbs sampling?