Step 12: Posterior Summaries, Bayes Factors, PPC — Statistical Inference Tutorial

概覽

後驗到底要怎麼讀？

後驗分布本身包含所有關於 θ 的事後資訊。實務上我們需要把它壓縮成幾個摘要：點估計、區間估計、方向機率、模型比較。每一種摘要回答不同的問題，混用會造成解讀錯誤。

本章關鍵：(1) 可信區間 (credible interval, CrI) 是真正的「機率落在這個範圍」陳述，與頻率派 CI 不同；(2) 對偏態後驗，HDI 比 ETI 更合適；(3) 方向機率 P(θ > 0 | y) 是貝氏取代 p-value 的自然摘要；(4) 貝氏因子提供模型比較，但對先驗極度敏感；(5) 後驗預測檢查 (PPC) 與 LOO-CV 是真正抓得到「模型錯了」的工具。

The posterior contains all the information about θ after seeing data. In practice we compress it into summaries: point estimates, intervals, direction probabilities, model comparisons. Each answers a different question, and mixing them up leads to misinterpretation.

Key ideas this chapter: (1) the credible interval (CrI) is a genuine "probability θ lies here" statement — the interpretation people wrongly attach to frequentist CIs; (2) for skewed posteriors, HDI is preferable to ETI; (3) the direction probability P(θ > 0 | y) is the natural Bayesian replacement for a p-value; (4) Bayes factors compare models but are exquisitely sensitive to priors; (5) posterior predictive checks (PPC) and LOO-CV are the tools that actually catch a wrong model.

💡

核心原則：「給定資料與模型，95% 機率 θ 落在這個區間」是可信區間的正確解讀。頻率派的 95% CI 並不能這樣說。 這是貝氏推論最受歡迎的賣點，也是學生最容易把兩者混為一談的地方。 Core principle: "Given the data and the model, there is a 95% probability θ lies in this interval" is the correct reading of a credible interval. A frequentist 95% CI does not support that reading. This is Bayesian inference's biggest selling point — and where students most often conflate the two.

核心概念

一、四種後驗摘要

📏

ETI vs HDI 區間

ETI (Equal-Tailed Interval)：取 2.5% 與 97.5% 百分位數，左右尾各切 2.5%。對稱、易計算。
HDI (Highest Density Interval)：取「最窄」的 95% 區間——保證區間內每點密度 ≥ 區間外。
對稱後驗 → 兩者相同；偏態 / 截斷後驗 → HDI 更貼近高密度區，更可解釋。

ETI: cut 2.5% off each tail. Symmetric, trivial to compute.
HDI: the narrowest 95% interval — every point inside has density ≥ every point outside.
For symmetric posteriors, ETI = HDI; for skewed / bounded posteriors, HDI tracks the high-density mass.

🎯

方向機率與 ROPE

P(θ > 0 | y)：直接陳述「效應為正的後驗機率」——比 p-value 更易解讀。
ROPE (Region of Practical Equivalence)：先定義「實務上等於零」的小區間 (例如 |β| < 0.1)，再算 P(θ ∈ ROPE | y)。
實作：bayestestR::p_direction, rope；ArviZ 沒有官方 ROPE，自己用 idata 算即可。

P(θ > 0 | y): "posterior probability the effect is positive" — far easier to read than a p-value.
ROPE: define a "practically zero" range (e.g. |β| < 0.1), then compute P(θ ∈ ROPE | y).
Tools: bayestestR::p_direction, rope; ArviZ has no built-in ROPE, but compute it on draws directly.

⚖️

貝氏因子

BF₁₀ = p(y|M₁) / p(y|M₀)：兩個模型邊際概似的比。
Jeffreys: BF > 3 弱證據；> 10 強；> 100 極強。
對先驗極端敏感 (Lindley's paradox)——拉寬先驗會直接懲罰較複雜模型，使 BF 任意偏向 H₀。
邊際概似積分難算，常用 bridge sampling、Savage-Dickey ratio。

BF₁₀ = p(y|M₁) / p(y|M₀) — ratio of marginal likelihoods.
Jeffreys scale: BF > 3 weak; > 10 strong; > 100 decisive.
Wildly sensitive to priors (Lindley's paradox): widening the prior penalizes the richer model and pushes BF toward H₀.
Marginal likelihoods are hard integrals — bridge sampling, Savage-Dickey ratio.

🔁

PPC 與 LOO-CV

PPC：從後驗預測抽 y_rep，與觀測 y 比較分布特徵——抓 model misfit 的最強工具。
選測試統計量要對應問題：count 模型用「零比例」、「最大值」、「Var/Mean」。
LOO-CV (PSIS)：以 Pareto-smoothed importance sampling 近似留一交叉驗證的 elpd。
k_hat > 0.7 表示某觀察過度影響估計——用 reloo 重抽或調整。

PPC: draw y_rep from the posterior predictive and compare to observed y — the single best tool for detecting misfit.
Pick test statistics aligned with the question: zero-fraction, max, Var/Mean for counts.
LOO-CV (PSIS): Pareto-smoothed importance sampling approximates leave-one-out elpd.
k_hat > 0.7 flags an observation that the importance-sampling approximation can't handle — use reloo.

⚠️

R-hat 通過 ≠ 模型正確：收斂診斷只告訴你「鏈條探索了後驗」；PPC 告訴你「後驗本身是不是針對對的模型」。兩個檢查的層次完全不同。即使所有 R-hat = 1.00、ESS 滿格，PPC 仍可能顯示嚴重 misfit——這就是為何 PPC 不可省略。

R-hat passing does not mean the model is right. Convergence diagnostics tell you the chains explored the posterior; PPC tells you whether the posterior corresponds to the right model. Even with R-hat = 1.00 and full ESS, PPC can still expose a badly misspecified likelihood — that's why you cannot skip it.

互動模擬

PPC 模擬器：常態 vs NB

觀察資料是過度離散的計數 (NB 抽樣)。切換「擬合模型」：誤設的常態模型會生出對稱、可為負的 y_rep，與觀察直方圖明顯不一致；NB 模型則覆蓋得當。這正是 PPC 在實務上能立刻發現問題的場景。

The observed data are overdispersed counts (drawn from NB). Toggle the fitted model: a misspecified Normal generates symmetric, possibly negative y_rep that clearly clash with the observed histogram, whereas NB covers it well. This is the situation where PPC instantly flags a problem.

———

擬合模型

真實 μ 8

真實 α 0.6

灰柱：觀察 y；彩色線：8 條 y_rep

決策引導

二、要報哪個摘要與比較？

🌳 後驗摘要決策樹

Q1:

要陳述「效應方向」？→ 報 P(θ > 0 | y) 與後驗中位數。

Q2:

要陳述「效應大小範圍」？→ 後驗對稱用 ETI，偏態用 HDI。

Q3:

要比較「兩個模型」？→ 預測能力 → LOO / WAIC；先驗有實質意義 → 貝氏因子（並做先驗敏感性）。

Q4:

要檢驗「模型是否擬合好」？→ 永遠先做 PPC——選擇與問題相關的測試統計量。

Q5:

「實務上是否與零等價」？→ 先定 ROPE，報 P(θ ∈ ROPE)。

Q1:

Want to state direction of effect? → Report P(θ > 0 | y) + posterior median.

Q2:

Want an effect-size range? → ETI for symmetric posteriors, HDI for skewed/bounded ones.

Q3:

Compare two models? → predictive performance → LOO / WAIC; meaningful priors → Bayes factor (always run a prior-sensitivity check).

Q4:

Check model fit? → always start with PPC using a problem-relevant test statistic.

Q5:

"Practically equivalent to zero"? → define a ROPE and report P(θ ∈ ROPE).

問題	摘要 / 工具	注意
效應方向	P(θ>0\|y), 中位數	勿與 p-value 混淆	Direction of effect	P(θ>0\|y), median	do not conflate with p-value
效應幅度區間	95% HDI / ETI	偏態後驗用 HDI	Effect-size range	95% HDI / ETI	HDI for skewed posteriors
模型比較 (預測)	LOO / WAIC + SE	k_hat > 0.7 處需 reloo	Model comparison (predictive)	LOO / WAIC + SE	refit with reloo when k_hat > 0.7
模型比較 (假說)	Bayes factor (bridge / SD)	先驗敏感、Lindley paradox	Model comparison (hypothesis)	Bayes factor (bridge / SD)	prior-sensitive, Lindley's paradox
擬合好壞	PPC + 對應 test stat	test stat 要切題	Goodness of fit	PPC + targeted test stat	choose test stats aligned with the question

程式碼

實作：HDI、PPC、LOO、Bayes Factor

# --- R --- brms / tidybayes / loo / bayestestR
library(brms); library(tidybayes); library(bayesplot)
library(loo);  library(bayestestR); library(bridgesampling)

# 1) 後驗摘要：median + HDI
posterior_summary(fit)                       # 預設報 ETI
fit |> spread_draws(b_x) |> median_hdci(b_x, .width=.95)

# 2) 方向機率與 ROPE
p_direction(fit)
rope(fit, range=c(-0.1, 0.1))

# 3) PPC：密度疊圖 + 測試統計量
y     <- df$y
yrep  <- posterior_predict(fit, draws=100)
ppc_dens_overlay(y, yrep[1:50, ])
ppc_stat(y, yrep, stat = function(x) mean(x == 0))   # 零比例
ppc_stat(y, yrep, stat = "max")

# 4) LOO 與模型比較
loo1 <- loo(fit_normal)
loo2 <- loo(fit_nb)
loo_compare(loo1, loo2)                       # 差距 / SE
pareto_k_table(loo2)

# 5) Bayes factor (bridge sampling)
bf_obj <- bayesfactor_models(fit1, fit2, denominator=2)

# --- Python --- PyMC + ArviZ
import pymc as pm, arviz as az, numpy as np

with model:
    idata = pm.sample()
    pm.sample_posterior_predictive(idata, extend_inferencedata=True)

# 1) HDI / 摘要
az.summary(idata, hdi_prob=0.95)
hdi = az.hdi(idata, hdi_prob=0.95)

# 2) 方向機率與 ROPE（手動）
draws = idata.posterior["beta"].values.flatten()
p_dir = (draws > 0).mean()
p_rope = ((draws > -0.1) & (draws < 0.1)).mean()

# 3) PPC
az.plot_ppc(idata, num_pp_samples=50)
# 自訂 test stat
y     = idata.observed_data["y"].values
y_rep = idata.posterior_predictive["y"].values.reshape(-1, y.size)
zero_obs  = (y == 0).mean()
zero_rep  = (y_rep == 0).mean(axis=1)
p_bayes_p = (zero_rep >= zero_obs).mean()       # bayesian p-value

# 4) LOO 與比較
loo_nb     = az.loo(idata_nb)
loo_normal = az.loo(idata_normal)
az.compare({"NB": idata_nb, "Normal": idata_normal}, ic="loo")

# 5) Bayes factor (Savage-Dickey ratio at θ=0)
# prior_density_at_0 / posterior_density_at_0

🚫

常見錯誤：用 BF 比較模型時把先驗放得「很弱」希望「公平」。其實這會系統性偏向較簡單的 H₀。BF 報導時一定要說清楚先驗、並做敏感性分析；只看單一 BF 的論文應被質疑。 Common mistake: using very wide priors to "be fair" when computing a Bayes factor. This systematically tips BF toward the simpler H₀ (Lindley's paradox). Always disclose priors and report a sensitivity analysis — any paper reporting a single BF without this should be treated with suspicion.

📝 自我檢測

1. 95% 可信區間 (CrI) 的正確解讀是？

1. What is the correct interpretation of a 95% credible interval?

A. 重複實驗 100 次，約 95 次會包含真值A. If we repeated the experiment many times, ~95% of intervals would contain the true value

B. 給定資料與模型，θ 有 95% 機率落在這個區間B. Given the data and model, θ lies in this interval with 95% probability

C. p-value ≤ 0.05C. p-value ≤ 0.05

D. 估計量的 95% 信賴度D. The 95% confidence level of the estimator

2. 為何 Bayes factor 對先驗高度敏感？這是特性還是缺點？

2. Why is the Bayes factor so sensitive to the prior — is that a feature or a bug?

A. 與先驗無關，是純粹資料量A. It is independent of the prior, purely data-driven

B. 因為 MCMC 抽樣誤差B. It is just MCMC noise

C. 邊際概似 ∫p(y|θ)p(θ)dθ 會直接懲罰先驗質量分散到不合理區的模型；既是特性，也是常被誤用的缺陷C. The marginal likelihood ∫p(y|θ)p(θ)dθ penalises models that spread prior mass over implausible regions — a real feature, and a frequently misused pitfall

D. 因為計算太貴D. Because it is expensive to compute

3. PPC 能抓到、但 R-hat 無法察覺的問題是？

3. What does a posterior predictive check catch that R-hat does not?

A. 概似模型誤設 (例如把 NB 計數套常態)A. Misspecification of the likelihood (e.g. fitting a Normal to NB counts)

B. 鏈條混合差B. Poor mixing of chains

C. 步長太大C. Excessive step size

D. divergent transitionsD. Divergent transitions