STEP 12 / 17

後驗摘要、貝氏因子、PPC

從可信區間到 PPC 與 LOO:把後驗化為可用的結論,並用後驗預測檢查抓出模型誤設。

From credible intervals to PPC and LOO — turning posteriors into decisions and catching model misspecification with posterior predictive checks.

後驗到底要怎麼讀?

後驗分布本身包含所有關於 θ 的事後資訊。實務上我們需要把它壓縮成幾個摘要:點估計、區間估計、方向機率、模型比較。每一種摘要回答不同的問題,混用會造成解讀錯誤。

本章關鍵:(1) 可信區間 (credible interval, CrI) 是真正的「機率落在這個範圍」陳述,與頻率派 CI 不同;(2) 對偏態後驗,HDI 比 ETI 更合適;(3) 方向機率 P(θ > 0 | y) 是貝氏取代 p-value 的自然摘要;(4) 貝氏因子提供模型比較,但對先驗極度敏感;(5) 後驗預測檢查 (PPC)LOO-CV 是真正抓得到「模型錯了」的工具。

The posterior contains all the information about θ after seeing data. In practice we compress it into summaries: point estimates, intervals, direction probabilities, model comparisons. Each answers a different question, and mixing them up leads to misinterpretation.

Key ideas this chapter: (1) the credible interval (CrI) is a genuine "probability θ lies here" statement — the interpretation people wrongly attach to frequentist CIs; (2) for skewed posteriors, HDI is preferable to ETI; (3) the direction probability P(θ > 0 | y) is the natural Bayesian replacement for a p-value; (4) Bayes factors compare models but are exquisitely sensitive to priors; (5) posterior predictive checks (PPC) and LOO-CV are the tools that actually catch a wrong model.

💡
核心原則:「給定資料與模型,95% 機率 θ 落在這個區間」是可信區間的正確解讀。頻率派的 95% CI 並不能這樣說。 這是貝氏推論最受歡迎的賣點,也是學生最容易把兩者混為一談的地方。 Core principle: "Given the data and the model, there is a 95% probability θ lies in this interval" is the correct reading of a credible interval. A frequentist 95% CI does not support that reading. This is Bayesian inference's biggest selling point — and where students most often conflate the two.

一、四種後驗摘要

📏

ETI vs HDI 區間

  • ETI (Equal-Tailed Interval):取 2.5% 與 97.5% 百分位數,左右尾各切 2.5%。對稱、易計算。
  • HDI (Highest Density Interval):取「最窄」的 95% 區間——保證區間內每點密度 ≥ 區間外。
  • 對稱後驗 → 兩者相同;偏態 / 截斷後驗 → HDI 更貼近高密度區,更可解釋。
  • ETI: cut 2.5% off each tail. Symmetric, trivial to compute.
  • HDI: the narrowest 95% interval — every point inside has density ≥ every point outside.
  • For symmetric posteriors, ETI = HDI; for skewed / bounded posteriors, HDI tracks the high-density mass.
🎯

方向機率與 ROPE

  • P(θ > 0 | y):直接陳述「效應為正的後驗機率」——比 p-value 更易解讀。
  • ROPE (Region of Practical Equivalence):先定義「實務上等於零」的小區間 (例如 |β| < 0.1),再算 P(θ ∈ ROPE | y)
  • 實作:bayestestR::p_direction, rope;ArviZ 沒有官方 ROPE,自己用 idata 算即可。
  • P(θ > 0 | y): "posterior probability the effect is positive" — far easier to read than a p-value.
  • ROPE: define a "practically zero" range (e.g. |β| < 0.1), then compute P(θ ∈ ROPE | y).
  • Tools: bayestestR::p_direction, rope; ArviZ has no built-in ROPE, but compute it on draws directly.
⚖️

貝氏因子

  • BF₁₀ = p(y|M₁) / p(y|M₀):兩個模型邊際概似的比。
  • Jeffreys: BF > 3 弱證據;> 10 強;> 100 極強。
  • 對先驗極端敏感 (Lindley's paradox)——拉寬先驗會直接懲罰較複雜模型,使 BF 任意偏向 H₀。
  • 邊際概似積分難算,常用 bridge sampling、Savage-Dickey ratio。
  • BF₁₀ = p(y|M₁) / p(y|M₀) — ratio of marginal likelihoods.
  • Jeffreys scale: BF > 3 weak; > 10 strong; > 100 decisive.
  • Wildly sensitive to priors (Lindley's paradox): widening the prior penalizes the richer model and pushes BF toward H₀.
  • Marginal likelihoods are hard integrals — bridge sampling, Savage-Dickey ratio.
🔁

PPC 與 LOO-CV

  • PPC:從後驗預測抽 y_rep,與觀測 y 比較分布特徵——抓 model misfit 的最強工具。
  • 選測試統計量要對應問題:count 模型用「零比例」、「最大值」、「Var/Mean」。
  • LOO-CV (PSIS):以 Pareto-smoothed importance sampling 近似留一交叉驗證的 elpd。
  • k_hat > 0.7 表示某觀察過度影響估計——用 reloo 重抽或調整。
  • PPC: draw y_rep from the posterior predictive and compare to observed y — the single best tool for detecting misfit.
  • Pick test statistics aligned with the question: zero-fraction, max, Var/Mean for counts.
  • LOO-CV (PSIS): Pareto-smoothed importance sampling approximates leave-one-out elpd.
  • k_hat > 0.7 flags an observation that the importance-sampling approximation can't handle — use reloo.
⚠️
R-hat 通過 ≠ 模型正確:收斂診斷只告訴你「鏈條探索了後驗」;PPC 告訴你「後驗本身是不是針對對的模型」。兩個檢查的層次完全不同。即使所有 R-hat = 1.00、ESS 滿格,PPC 仍可能顯示嚴重 misfit——這就是為何 PPC 不可省略。
R-hat passing does not mean the model is right. Convergence diagnostics tell you the chains explored the posterior; PPC tells you whether the posterior corresponds to the right model. Even with R-hat = 1.00 and full ESS, PPC can still expose a badly misspecified likelihood — that's why you cannot skip it.

PPC 模擬器:常態 vs NB

觀察資料是過度離散的計數 (NB 抽樣)。切換「擬合模型」:誤設的常態模型會生出對稱、可為負的 y_rep,與觀察直方圖明顯不一致;NB 模型則覆蓋得當。這正是 PPC 在實務上能立刻發現問題的場景

The observed data are overdispersed counts (drawn from NB). Toggle the fitted model: a misspecified Normal generates symmetric, possibly negative y_rep that clearly clash with the observed histogram, whereas NB covers it well. This is the situation where PPC instantly flags a problem.

灰柱:觀察 y;彩色線:8 條 y_rep

二、要報哪個摘要與比較?

🌳 後驗摘要決策樹

Q1:
要陳述「效應方向」?P(θ > 0 | y) 與後驗中位數。
Q2:
要陳述「效應大小範圍」? 後驗對稱用 ETI,偏態用 HDI
Q3:
要比較「兩個模型」? 預測能力 → LOO / WAIC;先驗有實質意義 → 貝氏因子(並做先驗敏感性)。
Q4:
要檢驗「模型是否擬合好」? 永遠先做 PPC——選擇與問題相關的測試統計量。
Q5:
「實務上是否與零等價」? 先定 ROPE,報 P(θ ∈ ROPE)
Q1:
Want to state direction of effect? Report P(θ > 0 | y) + posterior median.
Q2:
Want an effect-size range? ETI for symmetric posteriors, HDI for skewed/bounded ones.
Q3:
Compare two models? predictive performance → LOO / WAIC; meaningful priors → Bayes factor (always run a prior-sensitivity check).
Q4:
Check model fit? always start with PPC using a problem-relevant test statistic.
Q5:
"Practically equivalent to zero"? define a ROPE and report P(θ ∈ ROPE).
問題 摘要 / 工具 注意
效應方向P(θ>0|y), 中位數勿與 p-value 混淆Direction of effectP(θ>0|y), mediando not conflate with p-value
效應幅度區間95% HDI / ETI偏態後驗用 HDIEffect-size range95% HDI / ETIHDI for skewed posteriors
模型比較 (預測)LOO / WAIC + SEk_hat > 0.7 處需 relooModel comparison (predictive)LOO / WAIC + SErefit with reloo when k_hat > 0.7
模型比較 (假說)Bayes factor (bridge / SD)先驗敏感、Lindley paradoxModel comparison (hypothesis)Bayes factor (bridge / SD)prior-sensitive, Lindley's paradox
擬合好壞PPC + 對應 test stattest stat 要切題Goodness of fitPPC + targeted test statchoose test stats aligned with the question

實作:HDI、PPC、LOO、Bayes Factor

# --- R --- brms / tidybayes / loo / bayestestR
library(brms); library(tidybayes); library(bayesplot)
library(loo);  library(bayestestR); library(bridgesampling)

# 1) 後驗摘要:median + HDI
posterior_summary(fit)                       # 預設報 ETI
fit |> spread_draws(b_x) |> median_hdci(b_x, .width=.95)

# 2) 方向機率與 ROPE
p_direction(fit)
rope(fit, range=c(-0.1, 0.1))

# 3) PPC:密度疊圖 + 測試統計量
y     <- df$y
yrep  <- posterior_predict(fit, draws=100)
ppc_dens_overlay(y, yrep[1:50, ])
ppc_stat(y, yrep, stat = function(x) mean(x == 0))   # 零比例
ppc_stat(y, yrep, stat = "max")

# 4) LOO 與模型比較
loo1 <- loo(fit_normal)
loo2 <- loo(fit_nb)
loo_compare(loo1, loo2)                       # 差距 / SE
pareto_k_table(loo2)

# 5) Bayes factor (bridge sampling)
bf_obj <- bayesfactor_models(fit1, fit2, denominator=2)
# --- Python --- PyMC + ArviZ
import pymc as pm, arviz as az, numpy as np

with model:
    idata = pm.sample()
    pm.sample_posterior_predictive(idata, extend_inferencedata=True)

# 1) HDI / 摘要
az.summary(idata, hdi_prob=0.95)
hdi = az.hdi(idata, hdi_prob=0.95)

# 2) 方向機率與 ROPE(手動)
draws = idata.posterior["beta"].values.flatten()
p_dir = (draws > 0).mean()
p_rope = ((draws > -0.1) & (draws < 0.1)).mean()

# 3) PPC
az.plot_ppc(idata, num_pp_samples=50)
# 自訂 test stat
y     = idata.observed_data["y"].values
y_rep = idata.posterior_predictive["y"].values.reshape(-1, y.size)
zero_obs  = (y == 0).mean()
zero_rep  = (y_rep == 0).mean(axis=1)
p_bayes_p = (zero_rep >= zero_obs).mean()       # bayesian p-value

# 4) LOO 與比較
loo_nb     = az.loo(idata_nb)
loo_normal = az.loo(idata_normal)
az.compare({"NB": idata_nb, "Normal": idata_normal}, ic="loo")

# 5) Bayes factor (Savage-Dickey ratio at θ=0)
# prior_density_at_0 / posterior_density_at_0
🚫
常見錯誤:用 BF 比較模型時把先驗放得「很弱」希望「公平」。其實這會系統性偏向較簡單的 H₀。BF 報導時一定要說清楚先驗、並做敏感性分析;只看單一 BF 的論文應被質疑。 Common mistake: using very wide priors to "be fair" when computing a Bayes factor. This systematically tips BF toward the simpler H₀ (Lindley's paradox). Always disclose priors and report a sensitivity analysis — any paper reporting a single BF without this should be treated with suspicion.

📝 自我檢測

1. 95% 可信區間 (CrI) 的正確解讀是?

1. What is the correct interpretation of a 95% credible interval?

A. 重複實驗 100 次,約 95 次會包含真值A. If we repeated the experiment many times, ~95% of intervals would contain the true value
B. 給定資料與模型,θ 有 95% 機率落在這個區間B. Given the data and model, θ lies in this interval with 95% probability
C. p-value ≤ 0.05C. p-value ≤ 0.05
D. 估計量的 95% 信賴度D. The 95% confidence level of the estimator

2. 為何 Bayes factor 對先驗高度敏感?這是特性還是缺點?

2. Why is the Bayes factor so sensitive to the prior — is that a feature or a bug?

A. 與先驗無關,是純粹資料量A. It is independent of the prior, purely data-driven
B. 因為 MCMC 抽樣誤差B. It is just MCMC noise
C. 邊際概似 ∫p(y|θ)p(θ)dθ 會直接懲罰先驗質量分散到不合理區的模型;既是特性,也是常被誤用的缺陷C. The marginal likelihood ∫p(y|θ)p(θ)dθ penalises models that spread prior mass over implausible regions — a real feature, and a frequently misused pitfall
D. 因為計算太貴D. Because it is expensive to compute

3. PPC 能抓到、但 R-hat 無法察覺的問題是?

3. What does a posterior predictive check catch that R-hat does not?

A. 概似模型誤設 (例如把 NB 計數套常態)A. Misspecification of the likelihood (e.g. fitting a Normal to NB counts)
B. 鏈條混合差B. Poor mixing of chains
C. 步長太大C. Excessive step size
D. divergent transitionsD. Divergent transitions