STEP 10 / 17

貝氏定理、先驗、後驗

從 p(θ|y) ∝ p(y|θ)·p(θ) 出發:建立先驗、看資料修正成後驗、把不確定性留在參數上而非 p-value 上。

Starting from p(θ|y) ∝ p(y|θ)·p(θ): place a prior, let data update it to a posterior, and carry uncertainty on parameters rather than on p-values.

為什麼要做貝氏?

頻率學派把參數 θ 當固定未知量,談的是「重複實驗下估計量的長期表現」;貝氏學派把 θ 視為帶有分布的隨機量,用後驗分布 p(θ|y) 直接陳述「給定資料,θ 在哪個範圍的機率多大」。這正是大多數實驗者真正想知道的——而頻率派的 CI / p-value 並不能這樣解讀。

核心公式只有一個:p(θ|y) = p(y|θ)·p(θ) / p(y),正比形式寫成 posterior ∝ likelihood × prior。三件事決定一切:(1) 先驗 p(θ) 編碼你對 θ 的事前知識;(2) 概似 p(y|θ) 是資料的機率模型;(3) 後驗自動把兩者組合成參數的不確定性,並可進一步生出預測分布。

Frequentists treat θ as a fixed unknown and reason about long-run frequencies of estimators; Bayesians treat θ as a random quantity with a distribution, and use the posterior p(θ|y) to state directly "given the data, the probability θ falls in this range is …". That is what most experimenters actually want — and what a frequentist CI / p-value cannot deliver.

The whole machinery is one equation: p(θ|y) = p(y|θ)·p(θ) / p(y), or proportionally posterior ∝ likelihood × prior. Three pieces govern everything: (1) the prior p(θ) encodes prior knowledge about θ; (2) the likelihood p(y|θ) is the data model; (3) the posterior combines them automatically and propagates uncertainty into predictive distributions.

💡
核心原則:後驗是先驗與概似的「妥協」。n 小時先驗主導n 大時概似(資料)主導。所以先驗選擇影響最大的時刻,永遠是樣本量極小的研究 — 也是生物醫學初探實驗的常態。 Core principle: the posterior is a compromise between prior and likelihood. Small n → prior dominates; large n → data dominate. So prior choice matters most exactly in the small-sample regime — which is most early-stage biomedical experiments.

一、貝氏推論的四個積木

共軛先驗開始,因為它們有封閉形式的後驗,能讓你不靠 MCMC 直接體會「先驗 + 資料 = 後驗」。三組最重要的共軛配對主導了大部分初探分析。

Start with conjugate priors — they have closed-form posteriors, letting you grasp "prior + data = posterior" without MCMC. Three pairings dominate exploratory analysis.

🎲

共軛先驗三大組合

  • Beta–Binomial:先驗 Beta(α, β),觀察 k/n,後驗 Beta(α+k, β+n−k)
  • Gamma–Poisson:先驗 Gamma(a, b),觀察計數 Σ yᵢn 次,後驗 Gamma(a+Σyᵢ, b+n)
  • Normal–Normal (已知 σ²):先驗 N(μ₀, τ₀²),後驗 N(μₙ, τₙ²)μₙ 是先驗與樣本均值的精度加權平均。
  • 意義:共軛家族讓我們以「資料量等價於虛擬樣本」的方式詮釋 α, β。
  • Beta–Binomial: prior Beta(α, β), observe k/n, posterior Beta(α+k, β+n−k).
  • Gamma–Poisson: prior Gamma(a, b), observe Σ yᵢ over n trials, posterior Gamma(a+Σyᵢ, b+n).
  • Normal–Normal (σ² known): prior N(μ₀, τ₀²), posterior N(μₙ, τₙ²) with μₙ a precision-weighted average of prior and sample mean.
  • Why it matters: conjugacy lets us read α, β as a count of "pseudo-observations" — interpretable prior strength.
🧭

先驗的三種類型

  • Informative:根據過往實驗或文獻,有實質訊息;風險:把錯誤信念也鎖進結果。
  • Weakly informative:限制不合理範圍但不主導後驗(如 logit 上 N(0, 2.5));rstanarm / brms 預設。
  • Flat / improper:看似「無資訊」,但轉換尺度後常常極具資訊量——這是初學者最大的陷阱。
  • Informative: real prior knowledge from earlier work; risk — bakes in any wrong beliefs.
  • Weakly informative: rules out absurd values without dominating the posterior (e.g. N(0, 2.5) on the logit scale); the default in rstanarm / brms.
  • Flat / improper: looks "non-informative", but is often very informative on transformed scales — the classic beginner trap.
🔮

預測分布

先驗預測 p(ỹ) = ∫ p(ỹ|θ) p(θ) dθ:「在看資料前,模型預期會看到什麼資料?」這是檢驗先驗合理性的關鍵診斷。

後驗預測 p(ỹ|y) = ∫ p(ỹ|θ) p(θ|y) dθ:「給定觀察後,新資料的預期分布」——後驗預測檢查 (PPC,第 12 章) 的基礎。

Prior predictive p(ỹ) = ∫ p(ỹ|θ) p(θ) dθ: "what data does the model think it will see, before any data?" Crucial sanity check for prior plausibility.

Posterior predictive p(ỹ|y) = ∫ p(ỹ|θ) p(θ|y) dθ: "given what we observed, what data should new draws look like?" — the basis of posterior predictive checks (PPC, Step 12).

🎯

點估計與決策

  • MAP:後驗的模 (mode)。受參數化影響,並非貝氏統計推薦的點摘要。
  • 後驗均值:最小化平方損失 (squared loss)。
  • 後驗中位數:最小化絕對損失;對偏態後驗較穩健。
  • 決策理論:給定損失函數 L(θ, a),選 a* 使 E_{θ|y}[L(θ, a)] 最小。
  • MAP: posterior mode. Reparameterization-dependent and not the recommended Bayesian point summary.
  • Posterior mean: minimizes squared loss.
  • Posterior median: minimizes absolute loss; robust under skew.
  • Decision theory: given loss L(θ, a), pick a* minimizing E_{θ|y}[L(θ, a)].
⚠️
flat prior 不等於 non-informative:在機率 p 上取 Uniform(0, 1),等價於在 log-odds 上取 Logistic(0, 1)——並非常數!同理,在 σ 上 flat 與在 log σ 或 σ² 上 flat 推出完全不同的後驗。「無資訊先驗」是個哲學陷阱,現代貝氏實務改用弱資訊先驗 (Gelman 2008)。
Flat ≠ non-informative. A Uniform(0, 1) prior on a probability p is equivalent to a Logistic(0, 1) prior on the log-odds — not flat there. Similarly, flat on σ vs flat on log σ vs flat on σ² yield different posteriors. The phrase "non-informative prior" is a philosophical trap; modern Bayesian practice favors weakly informative priors (Gelman, 2008).

Beta–Binomial 後驗即時更新

調整先驗 α, β 與觀察到的成功次數 k、試驗次數 n。藍色虛線是先驗、灰色點線是標準化概似、紅色實線是後驗慢慢調大 n,看後驗如何從先驗主導變為資料主導——這就是「資料壓過先驗」的數學畫面。

Tune prior α, β and observed successes k out of n trials. Blue dashed = prior; grey dotted = scaled likelihood; red solid = posterior. Slowly raise n and watch the posterior swing from prior-dominated to likelihood-dominated — the picture of "data overwhelming the prior".

X: p ∈ [0,1] | Y: density | 藍虛=先驗、灰點=概似、紅實=後驗

二、先驗敏感性與模型誤設

先驗敏感性檢查 (prior sensitivity analysis):同一個模型,用兩到三組合理的先驗各跑一次;若關鍵結論隨先驗大幅改變,就要在文章中明說。「後驗就是後驗」 只是表達貝氏一致性,並不代表它對誤設免疫——若概似模型本身錯了 (例如把計數套常態),任何先驗都救不回來。

什麼時候先驗最重要?(a) 樣本量極小 (n < 30);(b) 模型有不可辨識性 (non-identifiable);(c) 推導量在參數的非線性函數上 (尺度轉換後 flat prior 不再 flat)。

Prior sensitivity analysis: re-run the model under two or three defensible priors; if the key conclusion shifts noticeably, disclose this in your paper. The slogan "the posterior is what it is" states Bayesian coherence — it does not mean the model is immune to misspecification. If your likelihood is wrong (e.g. Gaussian on counts), no prior can save you.

When do priors matter most? (a) Small n (n < 30); (b) non-identifiable models; (c) inference on nonlinear functions of parameters (flat is rarely flat after a transform).

🚫
常見錯誤:Uniform(0, 1e6) 加在變異數參數上自稱「無資訊」。這對小變異數區段其實非常不公平——它把絕大部分機率質量壓在很大的變異數上。建議改用 half-Normal 或 half-Cauchy on σ。 Common mistake: slapping Uniform(0, 1e6) on a variance parameter and calling it "non-informative". This is in fact wildly unfair to small variances — almost all probability mass sits on large values. Use a half-Normal or half-Cauchy on σ instead.

三、先驗選擇決策樹

🌳 先驗選擇流程

Q1:
你是否有過往實驗 / 文獻提供 θ 的合理範圍?→ 是 → informative prior(如 Beta(20, 80) 表示「過去看過約 100 例,平均 20% 陽性」);否 → Q2。
Q2:
模型是 GLM (logistic / Poisson)?→ 是 → 對係數採 weakly informative:N(0, 2.5) on logit;N(0, 2) on log-rate。
Q3:
是變異數 / 標準差?→ 是 → half-Normal 或 half-Cauchy on σ (Gelman 推薦);不要用 Uniform(0, 大數)。
Q4:
樣本量小 (n < 30)?→ 是 → 一定要做先驗敏感性檢查,並在報告中註明。
Q5:
你關心的量是 θ 的非線性函數 (如 hazard ratio、ROC AUC)?→ 是 → 跑先驗預測模擬,確認推導量分布也合理。
Q1:
Do you have prior experiments / literature giving a plausible range for θ? → Yes → informative prior (e.g. Beta(20, 80) = "~100 prior cases, ~20% positive"); otherwise → Q2.
Q2:
GLM (logistic / Poisson)? → Yes → weakly informative on coefficients: N(0, 2.5) on the logit scale; N(0, 2) on log-rate.
Q3:
A variance / SD parameter? → Yes → half-Normal or half-Cauchy on σ (Gelman). Avoid Uniform(0, big).
Q4:
Small sample (n < 30)? → Yes → always run a prior-sensitivity check and report it.
Q5:
Care about a nonlinear function of θ (hazard ratio, ROC AUC)? → Yes → run prior predictive simulation and verify the derived quantity also looks reasonable.
情境 先驗 備註
A/B 測試成功率 (Bayesian A/B)Beta(1, 1) 或 Beta(2, 2)小樣本先驗強度可換算為「虛擬樣本」A/B success rateBeta(1, 1) or Beta(2, 2)prior strength = pseudo-counts
CRISPR screen log fold changeN(0, 2) 或 Cauchy(0, 1)弱訊息、容許大效應CRISPR screen log fold changeN(0, 2) or Cauchy(0, 1)weakly informative, allows big effects
RNA-seq NB dispersionlog-Normal(−3, 1)DESeq2 風格的 trended priorRNA-seq NB dispersionlog-Normal(−3, 1)DESeq2-style trended prior
Hierarchical 變異成分 σhalf-Normal(0, 1) 或 half-Cauchy(0, 2.5)避免 Uniform(0, 大)Hierarchical σ componenthalf-Normal(0, 1) or half-Cauchy(0, 2.5)avoid Uniform(0, big)
GWAS Bayesian (bglr)spike-and-slab, BayesB少數大效應 + 多數零效應GWAS Bayesian (bglr)spike-and-slab, BayesBfew large, many null effects

實作:封閉解 + 機率程式

# --- R --- Beta–Binomial 封閉解
a <- 2; b <- 2; k <- 7; n <- 10
# 後驗 = Beta(a+k, b+n-k)
post_a <- a + k; post_b <- b + n - k
curve(dbeta(x, a, b),       from=0, to=1, col="steelblue", lty=2, ylab="density")
curve(dbeta(x, post_a, post_b), add=TRUE, col="firebrick", lwd=2)
qbeta(c(.025, .975), post_a, post_b)   # 95% credible interval

# --- 一般化:rstanarm logistic + 弱資訊先驗
library(rstanarm)
fit <- stan_glm(y ~ x, data=df, family=binomial(),
                prior = normal(0, 2.5),
                prior_intercept = normal(0, 5),
                chains=4, iter=2000)
prior_summary(fit)                  # 確認先驗設定
posterior_interval(fit, prob=.95)   # 95% credible
posterior_predict(fit, draws=500)    # 後驗預測抽樣

# --- 先驗敏感性:兩組先驗對照
fit2 <- update(fit, prior = normal(0, 10))
cbind(coef(fit), coef(fit2))
# --- Python ---
import numpy as np, scipy.stats as st
import pymc as pm, arviz as az

# 1) Beta-Binomial 封閉解
a, b, k, n = 2, 2, 7, 10
post = st.beta(a + k, b + n - k)
print(post.mean(), post.interval(0.95))

# 2) PyMC:先驗預測 → 抽樣 → 後驗預測
with pm.Model() as m:
    p   = pm.Beta("p", alpha=2, beta=2)
    obs = pm.Binomial("y", n=n, p=p, observed=k)
    prior_pred = pm.sample_prior_predictive(1000)
    idata      = pm.sample(2000, tune=1000, chains=4, target_accept=0.9)
    post_pred  = pm.sample_posterior_predictive(idata)

az.summary(idata, hdi_prob=0.95)
az.plot_posterior(idata, var_names=["p"])
az.plot_dist_comparison(idata, var_names=["p"])   # prior vs posterior

# 3) 先驗敏感性
with pm.Model() as m_flat:
    p2 = pm.Beta("p", 1, 1)
    pm.Binomial("y", n=n, p=p2, observed=k)
    idata2 = pm.sample()
az.compare({"prior_Beta22":idata, "prior_Beta11":idata2})

📝 自我檢測

1. 為什麼「機率 p 上的均勻先驗」與「log-odds 上的均勻先驗」並不等價?

1. Why is a uniform prior on a probability not the same as a uniform prior on its log-odds?

A. 兩者其實一樣,只是寫法不同A. They are actually the same, just written differently
B. 變數變換需要 Jacobian,平的密度在轉換尺度上不再平B. Variable transformation introduces a Jacobian — a flat density on one scale is not flat on the other
C. log-odds 不存在機率分布C. The log-odds has no probability distribution
D. 因為 logistic 是奇函數D. Because logistic is an odd function

2. Poisson 速率 λ 的共軛先驗以及後驗形式是?

2. State the conjugate prior for the Poisson rate and the resulting posterior.

A. Beta 先驗 → Beta 後驗A. Beta prior → Beta posterior
B. Normal 先驗 → Normal 後驗B. Normal prior → Normal posterior
C. Gamma(a, b) 先驗 → Gamma(a + Σyᵢ, b + n) 後驗C. Gamma(a, b) prior → Gamma(a + Σyᵢ, b + n) posterior
D. Cauchy 先驗 → Cauchy 後驗D. Cauchy prior → Cauchy posterior

3. 先驗選擇影響最大的時刻是?

3. When does prior choice matter most?

A. 小樣本、非識別性模型、或關心非線性推導量時A. Small samples, non-identifiable models, or inference on nonlinear functions of θ
B. n > 10⁶ 時B. When n > 10⁶
C. 任何時候都一樣重要C. It is equally important in every regime
D. 永遠不重要——資料總會壓過先驗D. It never matters — data always overwhelm the prior