Step 11: MCMC, HMC & Convergence Diagnostics — Statistical Inference Tutorial

概覽

為什麼需要 MCMC？

除了共軛模型，幾乎所有實用的貝氏模型都沒有封閉解。p(θ|y) = p(y|θ)p(θ) / p(y) 的分母 p(y) = ∫ p(y|θ)p(θ) dθ 是高維積分；MCMC 不去算這個積分，而是從後驗分布抽樣，用樣本逼近任何後驗摘要。

三個演算家族：Metropolis-Hastings (MH) 是最一般化的提議—接受機制，但對高度相關的後驗收斂極慢；Gibbs sampling 在能寫出條件後驗時很方便，但同樣困於相關性；Hamiltonian Monte Carlo (HMC) / NUTS 利用對數後驗的梯度沿著等能線移動，在高維與強相關後驗下效率遠勝前兩者——這是 Stan、PyMC、NumPyro 的預設引擎。

Outside conjugate models almost every useful Bayesian posterior has no closed form. The normalizer p(y) = ∫ p(y|θ)p(θ) dθ in p(θ|y) = p(y|θ)p(θ)/p(y) is a high-dimensional integral. MCMC sidesteps it: instead of integrating, it draws samples from the posterior and approximates any summary by Monte Carlo.

Three families: Metropolis-Hastings (MH) is the most general but mixes painfully slowly under correlated posteriors; Gibbs is handy when conditional posteriors are tractable but suffers the same correlation problem; Hamiltonian Monte Carlo (HMC) / NUTS uses the gradient of the log-posterior to move along level sets, vastly outperforming the others in high dimensions and strong correlations — the default engine in Stan, PyMC, NumPyro.

💡

核心原則：「抽到的樣本看起來像」並不等於「後驗已收斂」。單一鏈條看起來再漂亮，也不能斷言收斂。 至少跑 4 條鏈、檢查 R-hat、ESS、divergences——這是現代貝氏的最低衛生標準。 Core principle: "draws look plausible" is not the same as "the posterior has converged". A single chain that looks fine is not converged. Run ≥4 chains and check R-hat, ESS, and divergences — the minimum hygiene of modern Bayesian inference.

核心概念

一、演算法與收斂徵兆

🎰

Metropolis-Hastings / Gibbs

MH：從提議分布抽樣，按 min(1, π(θ*)q(θ|θ*) / π(θ)q(θ*|θ)) 接受。
Gibbs：輪流從每個條件後驗 p(θⱼ|θ₋ⱼ, y) 取樣。
優點：簡單、通用；缺點：當後驗強相關，鏈條像「走步」一樣黏在一條主軸上，ESS 極低。

MH: propose θ*, accept with prob min(1, π(θ*)q(θ|θ*) / π(θ)q(θ*|θ)).
Gibbs: cycle through full conditionals p(θⱼ|θ₋ⱼ, y).
Pros: simple, general. Cons: under correlated posteriors the chain shuffles along one axis at a crawl — terrible ESS.

⚡

HMC 與 NUTS

用 ∇ log π(θ) 構造哈密頓動力，沿等能線跳遠距、低自相關。
NUTS (No-U-Turn Sampler) 自動調整路徑長度，免去手動 tuning。
關鍵代價：必須能微分 — 不接受離散參數（須邊際化）。
Stan / PyMC / NumPyro 的預設；高維與強相關下唯一可行。

Uses ∇ log π(θ) to follow Hamiltonian trajectories — long jumps with low autocorrelation.
NUTS adapts path length automatically — no manual tuning.
Catch: needs differentiability — discrete params must be marginalized out.
The default in Stan / PyMC / NumPyro; the only viable choice in high-D, correlated geometry.

📏

收斂診斷三件套

R-hat (Ȓ)：鏈間 vs 鏈內變異比，目標 < 1.01。
ESS (有效樣本量)：分 bulk ESS（中段密度）與 tail ESS（5/95 百分位）；至少 > 400。
Trace plot / rank plot：肉眼看「毛毛蟲」是否糾纏在一起；rank plot 比 trace 更敏感。

R-hat (Ȓ): between- vs within-chain variance; target < 1.01.
ESS: split into bulk ESS (central density) and tail ESS (5/95 percentiles); aim for > 400 each.
Trace / rank plots: chains should overlap like tangled caterpillars; rank plots are more sensitive than traces.

💥

Divergent transitions

HMC 在後驗幾何極端尖銳 (funnel) 時，積分誤差爆炸；Stan 會回報「divergent」。
不能忽略——它指出某個區域抽樣偏差，後驗摘要可能錯。
對策：(1) non-centered parametrization（階層模型的關鍵技巧）；(2) 提高 adapt_delta (0.95 / 0.99)；(3) 對正參數取 log。

When posterior geometry has sharp features (funnels), HMC's integrator blows up and Stan reports a "divergent".
Never ignore them — they flag biased exploration of some region.
Fixes: (1) non-centered parametrization (key trick for hierarchical models); (2) raise adapt_delta to 0.95 or 0.99; (3) log-transform positive params.

⚠️

warmup ≠ burn-in：HMC 的 warmup 同時做兩件事——讓鏈條離開初始位置，並調整步長與質量矩陣。warmup 樣本不能用；它們不來自固定的提議分布。預設 1000 warmup + 1000 sampling 是常見配置。

Warmup ≠ burn-in. HMC's warmup phase does two jobs — moves the chain away from initialization and tunes step size and mass matrix. Warmup draws must be discarded because they don't come from a fixed proposal. The usual setup is 1000 warmup + 1000 sampling.

互動模擬

HMC vs Metropolis：強相關後驗探索

後驗是一個強相關的二維高斯（ρ≈0.95）。選擇取樣器、調整步長，然後按「Step」——Metropolis 沿主軸蠕動 (高拒絕率)，HMC 沿等能線一跳就跨越長軸。觀察右上角的接受率與ESS 估計。

The target is a strongly correlated 2-D Gaussian (ρ≈0.95). Pick a sampler, tune step size, hit Step — Metropolis crawls along the major axis with many rejections; HMC traverses the long axis in a single leap. Watch acceptance rate and a rough ESS in the panel.

———

取樣器

步長 ε 0.25

HMC L 20

灰色等高線：目標後驗；紅點：鏈條軌跡

決策引導

二、收到警告怎麼辦？

🌳 MCMC 警告排除流程

Q1:

R-hat > 1.01？→ 是 → 加長 iter（warmup + sampling 各加倍）；若仍不收斂，模型可能不可辨識。

Q2:

ESS_bulk < 400？→ 是 → 增加抽樣次數；考慮非中心化、log 轉換。

Q3:

ESS_tail < ESS_bulk / 2？→ 是 → 尾巴混合差——對極端百分位數的推論不可靠，特別小心。

Q4:

HMC divergences > 0？→ 是 → ① 提高 adapt_delta；② 階層 σ 改 non-centered；③ 對正參數取 log；④ 檢查先驗是否過鬆。

Q5:

達到最大 treedepth？→ 是 → 多為效率問題，提高 max_treedepth 即可；但常伴隨 funnel 幾何，宜重參數化。

Q1:

R-hat > 1.01? → Yes → double warmup + sampling iterations; if still not converged, model may be non-identifiable.

Q2:

ESS_bulk < 400? → Yes → draw more samples; consider non-centered parametrization or log-transforms.

Q3:

ESS_tail < ESS_bulk / 2? → Yes → tails mix poorly — extreme quantile inference is unreliable.

Q4:

Divergent transitions? → Yes → raise adapt_delta; non-center hierarchical σ; log-transform positive params; check overly-wide priors.

Q5:

Max treedepth saturated? → Yes → usually an efficiency issue — bump max_treedepth, but it often co-occurs with funnels; reparameterize.

症狀	可能原因	處方
R-hat 一致 > 1.05	非識別性 / 先驗矛盾	重新建模、加強先驗	R-hat > 1.05 across params	Non-identifiable / conflicting priors	Reparameterize, tighten priors
階層模型大量 divergences	Neal's funnel	non-centered parametrization	Many divergences in hierarchical model	Neal's funnel	Non-centered parametrization
ESS 很低、trace 黏成一坨	強相關後驗、Metropolis 步長不當	改用 HMC / NUTS	Low ESS, traces glued together	Highly correlated posterior, bad MH step	Switch to HMC / NUTS
接受率 100% 或 1%	步長太小 / 太大	調整步長，HMC 用 NUTS 自適應	Accept rate ~100% or ~1%	Step size too small / large	Re-tune; let NUTS auto-adapt
兩鏈分到兩個眾數	多峰後驗	回到模型——是否真該多峰？考慮 mixture 表示	Two chains in different modes	Multimodal posterior	Revisit the model — is multimodality real? Try mixture formulation

程式碼

實作：brms / PyMC + 診斷

# --- R --- brms 階層模型 + 收斂診斷
library(brms); library(posterior); library(bayesplot)

fit <- brm(y ~ x + (1|batch), data=df, family=gaussian(),
           chains=4, iter=2000, warmup=1000, cores=4,
           control=list(adapt_delta=0.95, max_treedepth=12))

summary(fit)                       # R-hat 與 bulk/tail ESS 都在表格中
plot(fit)                          # trace + 密度

# 顯式診斷
draws <- as_draws_df(fit)
summarise_draws(draws, rhat, ess_bulk, ess_tail)

# 視覺化
mcmc_trace(fit, pars=c("b_x", "sd_batch__Intercept"))
mcmc_rank_overlay(fit, pars="b_x")   # rank plot 更敏感

# Divergence 檢查
nuts_params(fit) |> subset(Parameter == "divergent__") |> sum()

# 非中心化（brms 自動處理；rstan 寫法）
# mu_g = mu0 + tau * z_g; z_g ~ N(0, 1)

# --- Python --- PyMC NUTS + ArviZ 診斷
import pymc as pm, arviz as az, numpy as np

with pm.Model() as m:
    mu0   = pm.Normal("mu0", 0, 5)
    tau   = pm.HalfNormal("tau", 1)
    z     = pm.Normal("z", 0, 1, shape=n_batch)   # non-centered
    mu_g  = pm.Deterministic("mu_g", mu0 + tau * z)
    sigma = pm.HalfNormal("sigma", 1)
    pm.Normal("y", mu=mu_g[batch_idx], sigma=sigma, observed=y)
    idata = pm.sample(2000, tune=1000, chains=4,
                      target_accept=0.95, random_seed=42)

# 1) 表格診斷
az.summary(idata, hdi_prob=0.95)            # r_hat, ess_bulk, ess_tail

# 2) 視覺化
az.plot_trace(idata); az.plot_rank(idata)
az.plot_energy(idata)                          # Bayesian Fraction of Missing Info

# 3) Divergences
n_div = idata.sample_stats.diverging.sum().item()
print("divergent transitions:", n_div)

# 4) 對問題參數對 pair-plot：看 funnel
az.plot_pair(idata, var_names=["tau", "z"], divergences=True)

🚫

常見錯誤：「我看 trace 是個漂亮毛毛蟲，所以沒問題。」單鏈 trace 漂亮不代表已收斂——不同鏈可能落入不同眾數。永遠多鏈、永遠看 R-hat 與 ESS。 Common mistake: "The trace looks like a nice fuzzy caterpillar, we're fine." A single tidy trace does not imply convergence — different chains may land in different modes. Always run multiple chains and check R-hat and ESS.

📝 自我檢測

1. R-hat (Ȓ) 衡量什麼？可接受的閾值是？

1. What does R-hat measure, and what threshold is acceptable?

A. 樣本量；> 1000 可接受A. Sample size; > 1000 is OK

B. 後驗均值的 SE；< 0.05 可接受B. SE of posterior mean; < 0.05 OK

C. 鏈間 vs 鏈內變異比；< 1.01 可接受C. Ratio of between-chain to within-chain variance; < 1.01 acceptable

D. 接受率；> 0.5 可接受D. Acceptance rate; > 0.5 acceptable

2. Divergent transitions 為什麼重要？如何減少？

2. Why do divergent transitions matter, and how do you reduce them?

A. 顯示後驗幾何有問題、可能造成偏差；用 non-centered 與更高 adapt_delta 緩解A. They signal pathological posterior geometry — fixes include non-centered parametrization and a higher adapt_delta

B. 它們只是雜訊，可忽略B. They are mere noise and can be ignored

C. 永遠用更多 iter 就能解決C. Simply increasing iter always fixes them

D. 只發生在 Metropolis，與 HMC 無關D. They occur only in Metropolis, not HMC

3. NUTS 在何種情況下顯著優於 Gibbs sampling？

3. When does NUTS clearly outperform Gibbs sampling?

A. 參數空間是離散的A. The parameter space is discrete

B. 高維、後驗強相關或具非線性等高線時B. High-dimensional, strongly correlated, or nonlinearly curved posteriors

C. 一維 Beta-Binomial 模型C. A one-dimensional Beta-Binomial model

D. 永遠優於——沒有例外D. Always — no exceptions