Step 10: Bayes' Theorem, Priors, Posteriors — Statistical Inference Tutorial

概覽

為什麼要做貝氏？

頻率學派把參數 θ 當固定未知量，談的是「重複實驗下估計量的長期表現」；貝氏學派把 θ 視為帶有分布的隨機量，用後驗分布 p(θ|y) 直接陳述「給定資料，θ 在哪個範圍的機率多大」。這正是大多數實驗者真正想知道的——而頻率派的 CI / p-value 並不能這樣解讀。

核心公式只有一個：p(θ|y) = p(y|θ)·p(θ) / p(y)，正比形式寫成 posterior ∝ likelihood × prior。三件事決定一切：(1) 先驗 p(θ) 編碼你對 θ 的事前知識；(2) 概似 p(y|θ) 是資料的機率模型；(3) 後驗自動把兩者組合成參數的不確定性，並可進一步生出預測分布。

Frequentists treat θ as a fixed unknown and reason about long-run frequencies of estimators; Bayesians treat θ as a random quantity with a distribution, and use the posterior p(θ|y) to state directly "given the data, the probability θ falls in this range is …". That is what most experimenters actually want — and what a frequentist CI / p-value cannot deliver.

The whole machinery is one equation: p(θ|y) = p(y|θ)·p(θ) / p(y), or proportionally posterior ∝ likelihood × prior. Three pieces govern everything: (1) the prior p(θ) encodes prior knowledge about θ; (2) the likelihood p(y|θ) is the data model; (3) the posterior combines them automatically and propagates uncertainty into predictive distributions.

💡

核心原則：後驗是先驗與概似的「妥協」。n 小時先驗主導，n 大時概似（資料）主導。所以先驗選擇影響最大的時刻，永遠是樣本量極小的研究 — 也是生物醫學初探實驗的常態。 Core principle: the posterior is a compromise between prior and likelihood. Small n → prior dominates; large n → data dominate. So prior choice matters most exactly in the small-sample regime — which is most early-stage biomedical experiments.

核心概念

一、貝氏推論的四個積木

從共軛先驗開始，因為它們有封閉形式的後驗，能讓你不靠 MCMC 直接體會「先驗 + 資料 = 後驗」。三組最重要的共軛配對主導了大部分初探分析。

Start with conjugate priors — they have closed-form posteriors, letting you grasp "prior + data = posterior" without MCMC. Three pairings dominate exploratory analysis.

🎲

共軛先驗三大組合

Beta–Binomial：先驗 Beta(α, β)，觀察 k/n，後驗 Beta(α+k, β+n−k)。
Gamma–Poisson：先驗 Gamma(a, b)，觀察計數 Σ yᵢ 共 n 次，後驗 Gamma(a+Σyᵢ, b+n)。
Normal–Normal (已知 σ²)：先驗 N(μ₀, τ₀²)，後驗 N(μₙ, τₙ²)，μₙ 是先驗與樣本均值的精度加權平均。
意義：共軛家族讓我們以「資料量等價於虛擬樣本」的方式詮釋 α, β。

Beta–Binomial: prior Beta(α, β), observe k/n, posterior Beta(α+k, β+n−k).
Gamma–Poisson: prior Gamma(a, b), observe Σ yᵢ over n trials, posterior Gamma(a+Σyᵢ, b+n).
Normal–Normal (σ² known): prior N(μ₀, τ₀²), posterior N(μₙ, τₙ²) with μₙ a precision-weighted average of prior and sample mean.
Why it matters: conjugacy lets us read α, β as a count of "pseudo-observations" — interpretable prior strength.

🧭

先驗的三種類型

Informative：根據過往實驗或文獻，有實質訊息；風險：把錯誤信念也鎖進結果。
Weakly informative：限制不合理範圍但不主導後驗（如 logit 上 N(0, 2.5)）；rstanarm / brms 預設。
Flat / improper：看似「無資訊」，但轉換尺度後常常極具資訊量——這是初學者最大的陷阱。

Informative: real prior knowledge from earlier work; risk — bakes in any wrong beliefs.
Weakly informative: rules out absurd values without dominating the posterior (e.g. N(0, 2.5) on the logit scale); the default in rstanarm / brms.
Flat / improper: looks "non-informative", but is often very informative on transformed scales — the classic beginner trap.

🔮

預測分布

先驗預測 p(ỹ) = ∫ p(ỹ|θ) p(θ) dθ：「在看資料前，模型預期會看到什麼資料？」這是檢驗先驗合理性的關鍵診斷。

後驗預測 p(ỹ|y) = ∫ p(ỹ|θ) p(θ|y) dθ：「給定觀察後，新資料的預期分布」——後驗預測檢查 (PPC，第 12 章) 的基礎。

Prior predictive p(ỹ) = ∫ p(ỹ|θ) p(θ) dθ: "what data does the model think it will see, before any data?" Crucial sanity check for prior plausibility.

Posterior predictive p(ỹ|y) = ∫ p(ỹ|θ) p(θ|y) dθ: "given what we observed, what data should new draws look like?" — the basis of posterior predictive checks (PPC, Step 12).

🎯

點估計與決策

MAP：後驗的模 (mode)。受參數化影響，並非貝氏統計推薦的點摘要。
後驗均值：最小化平方損失 (squared loss)。
後驗中位數：最小化絕對損失；對偏態後驗較穩健。
決策理論：給定損失函數 L(θ, a)，選 a* 使 E_{θ|y}[L(θ, a)] 最小。

MAP: posterior mode. Reparameterization-dependent and not the recommended Bayesian point summary.
Posterior mean: minimizes squared loss.
Posterior median: minimizes absolute loss; robust under skew.
Decision theory: given loss L(θ, a), pick a* minimizing E_{θ|y}[L(θ, a)].

⚠️

flat prior 不等於 non-informative：在機率 p 上取 Uniform(0, 1)，等價於在 log-odds 上取 Logistic(0, 1)——並非常數！同理，在 σ 上 flat 與在 log σ 或 σ² 上 flat 推出完全不同的後驗。「無資訊先驗」是個哲學陷阱，現代貝氏實務改用弱資訊先驗 (Gelman 2008)。

Flat ≠ non-informative. A Uniform(0, 1) prior on a probability p is equivalent to a Logistic(0, 1) prior on the log-odds — not flat there. Similarly, flat on σ vs flat on log σ vs flat on σ² yield different posteriors. The phrase "non-informative prior" is a philosophical trap; modern Bayesian practice favors weakly informative priors (Gelman, 2008).

互動模擬

Beta–Binomial 後驗即時更新

調整先驗 α, β 與觀察到的成功次數 k、試驗次數 n。藍色虛線是先驗、灰色點線是標準化概似、紅色實線是後驗。慢慢調大 n，看後驗如何從先驗主導變為資料主導——這就是「資料壓過先驗」的數學畫面。

Tune prior α, β and observed successes k out of n trials. Blue dashed = prior; grey dotted = scaled likelihood; red solid = posterior. Slowly raise n and watch the posterior swing from prior-dominated to likelihood-dominated — the picture of "data overwhelming the prior".

———

先驗 α 2

先驗 β 2

成功 k 7

試驗 n 10

X: p ∈ [0,1] | Y: density | 藍虛=先驗、灰點=概似、紅實=後驗

陷阱與診斷

二、先驗敏感性與模型誤設

先驗敏感性檢查 (prior sensitivity analysis)：同一個模型，用兩到三組合理的先驗各跑一次；若關鍵結論隨先驗大幅改變，就要在文章中明說。「後驗就是後驗」 只是表達貝氏一致性，並不代表它對誤設免疫——若概似模型本身錯了 (例如把計數套常態)，任何先驗都救不回來。

什麼時候先驗最重要？(a) 樣本量極小 (n < 30)；(b) 模型有不可辨識性 (non-identifiable)；(c) 推導量在參數的非線性函數上 (尺度轉換後 flat prior 不再 flat)。

Prior sensitivity analysis: re-run the model under two or three defensible priors; if the key conclusion shifts noticeably, disclose this in your paper. The slogan "the posterior is what it is" states Bayesian coherence — it does not mean the model is immune to misspecification. If your likelihood is wrong (e.g. Gaussian on counts), no prior can save you.

When do priors matter most? (a) Small n (n < 30); (b) non-identifiable models; (c) inference on nonlinear functions of parameters (flat is rarely flat after a transform).

🚫

常見錯誤：把 Uniform(0, 1e6) 加在變異數參數上自稱「無資訊」。這對小變異數區段其實非常不公平——它把絕大部分機率質量壓在很大的變異數上。建議改用 half-Normal 或 half-Cauchy on σ。 Common mistake: slapping Uniform(0, 1e6) on a variance parameter and calling it "non-informative". This is in fact wildly unfair to small variances — almost all probability mass sits on large values. Use a half-Normal or half-Cauchy on σ instead.

決策引導

三、先驗選擇決策樹

🌳 先驗選擇流程

Q1:

你是否有過往實驗 / 文獻提供 θ 的合理範圍？→ 是 → informative prior（如 Beta(20, 80) 表示「過去看過約 100 例，平均 20% 陽性」）；否 → Q2。

Q2:

模型是 GLM (logistic / Poisson)？→ 是 → 對係數採 weakly informative：N(0, 2.5) on logit；N(0, 2) on log-rate。

Q3:

是變異數 / 標準差？→ 是 → half-Normal 或 half-Cauchy on σ (Gelman 推薦)；不要用 Uniform(0, 大數)。

Q4:

樣本量小 (n < 30)？→ 是 → 一定要做先驗敏感性檢查，並在報告中註明。

Q5:

你關心的量是 θ 的非線性函數 (如 hazard ratio、ROC AUC)？→ 是 → 跑先驗預測模擬，確認推導量分布也合理。

Q1:

Do you have prior experiments / literature giving a plausible range for θ? → Yes → informative prior (e.g. Beta(20, 80) = "~100 prior cases, ~20% positive"); otherwise → Q2.

Q2:

GLM (logistic / Poisson)? → Yes → weakly informative on coefficients: N(0, 2.5) on the logit scale; N(0, 2) on log-rate.

Q3:

A variance / SD parameter? → Yes → half-Normal or half-Cauchy on σ (Gelman). Avoid Uniform(0, big).

Q4:

Small sample (n < 30)? → Yes → always run a prior-sensitivity check and report it.

Q5:

Care about a nonlinear function of θ (hazard ratio, ROC AUC)? → Yes → run prior predictive simulation and verify the derived quantity also looks reasonable.

情境	先驗	備註
A/B 測試成功率 (Bayesian A/B)	Beta(1, 1) 或 Beta(2, 2)	小樣本先驗強度可換算為「虛擬樣本」	A/B success rate	Beta(1, 1) or Beta(2, 2)	prior strength = pseudo-counts
CRISPR screen log fold change	N(0, 2) 或 Cauchy(0, 1)	弱訊息、容許大效應	CRISPR screen log fold change	N(0, 2) or Cauchy(0, 1)	weakly informative, allows big effects
RNA-seq NB dispersion	log-Normal(−3, 1)	DESeq2 風格的 trended prior	RNA-seq NB dispersion	log-Normal(−3, 1)	DESeq2-style trended prior
Hierarchical 變異成分 σ	half-Normal(0, 1) 或 half-Cauchy(0, 2.5)	避免 Uniform(0, 大)	Hierarchical σ component	half-Normal(0, 1) or half-Cauchy(0, 2.5)	avoid Uniform(0, big)
GWAS Bayesian (bglr)	spike-and-slab, BayesB	少數大效應 + 多數零效應	GWAS Bayesian (bglr)	spike-and-slab, BayesB	few large, many null effects

程式碼

實作：封閉解 + 機率程式

# --- R --- Beta–Binomial 封閉解
a <- 2; b <- 2; k <- 7; n <- 10
# 後驗 = Beta(a+k, b+n-k)
post_a <- a + k; post_b <- b + n - k
curve(dbeta(x, a, b),       from=0, to=1, col="steelblue", lty=2, ylab="density")
curve(dbeta(x, post_a, post_b), add=TRUE, col="firebrick", lwd=2)
qbeta(c(.025, .975), post_a, post_b)   # 95% credible interval

# --- 一般化：rstanarm logistic + 弱資訊先驗
library(rstanarm)
fit <- stan_glm(y ~ x, data=df, family=binomial(),
                prior = normal(0, 2.5),
                prior_intercept = normal(0, 5),
                chains=4, iter=2000)
prior_summary(fit)                  # 確認先驗設定
posterior_interval(fit, prob=.95)   # 95% credible
posterior_predict(fit, draws=500)    # 後驗預測抽樣

# --- 先驗敏感性：兩組先驗對照
fit2 <- update(fit, prior = normal(0, 10))
cbind(coef(fit), coef(fit2))

# --- Python ---
import numpy as np, scipy.stats as st
import pymc as pm, arviz as az

# 1) Beta-Binomial 封閉解
a, b, k, n = 2, 2, 7, 10
post = st.beta(a + k, b + n - k)
print(post.mean(), post.interval(0.95))

# 2) PyMC：先驗預測 → 抽樣 → 後驗預測
with pm.Model() as m:
    p   = pm.Beta("p", alpha=2, beta=2)
    obs = pm.Binomial("y", n=n, p=p, observed=k)
    prior_pred = pm.sample_prior_predictive(1000)
    idata      = pm.sample(2000, tune=1000, chains=4, target_accept=0.9)
    post_pred  = pm.sample_posterior_predictive(idata)

az.summary(idata, hdi_prob=0.95)
az.plot_posterior(idata, var_names=["p"])
az.plot_dist_comparison(idata, var_names=["p"])   # prior vs posterior

# 3) 先驗敏感性
with pm.Model() as m_flat:
    p2 = pm.Beta("p", 1, 1)
    pm.Binomial("y", n=n, p=p2, observed=k)
    idata2 = pm.sample()
az.compare({"prior_Beta22":idata, "prior_Beta11":idata2})

📝 自我檢測

1. 為什麼「機率 p 上的均勻先驗」與「log-odds 上的均勻先驗」並不等價？

1. Why is a uniform prior on a probability not the same as a uniform prior on its log-odds?

A. 兩者其實一樣，只是寫法不同A. They are actually the same, just written differently

B. 變數變換需要 Jacobian，平的密度在轉換尺度上不再平B. Variable transformation introduces a Jacobian — a flat density on one scale is not flat on the other

C. log-odds 不存在機率分布C. The log-odds has no probability distribution

D. 因為 logistic 是奇函數D. Because logistic is an odd function

2. Poisson 速率 λ 的共軛先驗以及後驗形式是？

2. State the conjugate prior for the Poisson rate and the resulting posterior.

A. Beta 先驗 → Beta 後驗A. Beta prior → Beta posterior

B. Normal 先驗 → Normal 後驗B. Normal prior → Normal posterior

C. Gamma(a, b) 先驗 → Gamma(a + Σyᵢ, b + n) 後驗C. Gamma(a, b) prior → Gamma(a + Σyᵢ, b + n) posterior

D. Cauchy 先驗 → Cauchy 後驗D. Cauchy prior → Cauchy posterior

3. 先驗選擇影響最大的時刻是？

3. When does prior choice matter most?

A. 小樣本、非識別性模型、或關心非線性推導量時A. Small samples, non-identifiable models, or inference on nonlinear functions of θ

B. n > 10⁶ 時B. When n > 10⁶

C. 任何時候都一樣重要C. It is equally important in every regime

D. 永遠不重要——資料總會壓過先驗D. It never matters — data always overwhelm the prior