為什麼要做貝氏?
頻率學派把參數 θ 當固定未知量,談的是「重複實驗下估計量的長期表現」;貝氏學派把 θ 視為帶有分布的隨機量,用後驗分布 p(θ|y) 直接陳述「給定資料,θ 在哪個範圍的機率多大」。這正是大多數實驗者真正想知道的——而頻率派的 CI / p-value 並不能這樣解讀。
核心公式只有一個:p(θ|y) = p(y|θ)·p(θ) / p(y),正比形式寫成 posterior ∝ likelihood × prior。三件事決定一切:(1) 先驗 p(θ) 編碼你對 θ 的事前知識;(2) 概似 p(y|θ) 是資料的機率模型;(3) 後驗自動把兩者組合成參數的不確定性,並可進一步生出預測分布。
Frequentists treat θ as a fixed unknown and reason about long-run frequencies of estimators; Bayesians treat θ as a random quantity with a distribution, and use the posterior p(θ|y) to state directly "given the data, the probability θ falls in this range is …". That is what most experimenters actually want — and what a frequentist CI / p-value cannot deliver.
The whole machinery is one equation: p(θ|y) = p(y|θ)·p(θ) / p(y), or proportionally posterior ∝ likelihood × prior. Three pieces govern everything: (1) the prior p(θ) encodes prior knowledge about θ; (2) the likelihood p(y|θ) is the data model; (3) the posterior combines them automatically and propagates uncertainty into predictive distributions.
一、貝氏推論的四個積木
從共軛先驗開始,因為它們有封閉形式的後驗,能讓你不靠 MCMC 直接體會「先驗 + 資料 = 後驗」。三組最重要的共軛配對主導了大部分初探分析。
Start with conjugate priors — they have closed-form posteriors, letting you grasp "prior + data = posterior" without MCMC. Three pairings dominate exploratory analysis.
共軛先驗三大組合
- Beta–Binomial:先驗
Beta(α, β),觀察k/n,後驗Beta(α+k, β+n−k)。 - Gamma–Poisson:先驗
Gamma(a, b),觀察計數Σ yᵢ共n次,後驗Gamma(a+Σyᵢ, b+n)。 - Normal–Normal (已知 σ²):先驗
N(μ₀, τ₀²),後驗N(μₙ, τₙ²),μₙ是先驗與樣本均值的精度加權平均。 - 意義:共軛家族讓我們以「資料量等價於虛擬樣本」的方式詮釋 α, β。
- Beta–Binomial: prior
Beta(α, β), observek/n, posteriorBeta(α+k, β+n−k). - Gamma–Poisson: prior
Gamma(a, b), observeΣ yᵢoverntrials, posteriorGamma(a+Σyᵢ, b+n). - Normal–Normal (σ² known): prior
N(μ₀, τ₀²), posteriorN(μₙ, τₙ²)withμₙa precision-weighted average of prior and sample mean. - Why it matters: conjugacy lets us read α, β as a count of "pseudo-observations" — interpretable prior strength.
先驗的三種類型
- Informative:根據過往實驗或文獻,有實質訊息;風險:把錯誤信念也鎖進結果。
- Weakly informative:限制不合理範圍但不主導後驗(如 logit 上
N(0, 2.5));rstanarm / brms 預設。 - Flat / improper:看似「無資訊」,但轉換尺度後常常極具資訊量——這是初學者最大的陷阱。
- Informative: real prior knowledge from earlier work; risk — bakes in any wrong beliefs.
- Weakly informative: rules out absurd values without dominating the posterior (e.g.
N(0, 2.5)on the logit scale); the default in rstanarm / brms. - Flat / improper: looks "non-informative", but is often very informative on transformed scales — the classic beginner trap.
預測分布
先驗預測 p(ỹ) = ∫ p(ỹ|θ) p(θ) dθ:「在看資料前,模型預期會看到什麼資料?」這是檢驗先驗合理性的關鍵診斷。
後驗預測 p(ỹ|y) = ∫ p(ỹ|θ) p(θ|y) dθ:「給定觀察後,新資料的預期分布」——後驗預測檢查 (PPC,第 12 章) 的基礎。
Prior predictive p(ỹ) = ∫ p(ỹ|θ) p(θ) dθ: "what data does the model think it will see, before any data?" Crucial sanity check for prior plausibility.
Posterior predictive p(ỹ|y) = ∫ p(ỹ|θ) p(θ|y) dθ: "given what we observed, what data should new draws look like?" — the basis of posterior predictive checks (PPC, Step 12).
點估計與決策
- MAP:後驗的模 (mode)。受參數化影響,並非貝氏統計推薦的點摘要。
- 後驗均值:最小化平方損失 (squared loss)。
- 後驗中位數:最小化絕對損失;對偏態後驗較穩健。
- 決策理論:給定損失函數
L(θ, a),選a*使E_{θ|y}[L(θ, a)]最小。
- MAP: posterior mode. Reparameterization-dependent and not the recommended Bayesian point summary.
- Posterior mean: minimizes squared loss.
- Posterior median: minimizes absolute loss; robust under skew.
- Decision theory: given loss
L(θ, a), picka*minimizingE_{θ|y}[L(θ, a)].
Uniform(0, 1),等價於在 log-odds 上取 Logistic(0, 1)——並非常數!同理,在 σ 上 flat 與在 log σ 或 σ² 上 flat 推出完全不同的後驗。「無資訊先驗」是個哲學陷阱,現代貝氏實務改用弱資訊先驗 (Gelman 2008)。Uniform(0, 1) prior on a probability p is equivalent to a Logistic(0, 1) prior on the log-odds — not flat there. Similarly, flat on σ vs flat on log σ vs flat on σ² yield different posteriors. The phrase "non-informative prior" is a philosophical trap; modern Bayesian practice favors weakly informative priors (Gelman, 2008).Beta–Binomial 後驗即時更新
調整先驗 α, β 與觀察到的成功次數 k、試驗次數 n。藍色虛線是先驗、灰色點線是標準化概似、紅色實線是後驗。慢慢調大 n,看後驗如何從先驗主導變為資料主導——這就是「資料壓過先驗」的數學畫面。
Tune prior α, β and observed successes k out of n trials. Blue dashed = prior; grey dotted = scaled likelihood; red solid = posterior. Slowly raise n and watch the posterior swing from prior-dominated to likelihood-dominated — the picture of "data overwhelming the prior".
X: p ∈ [0,1] | Y: density | 藍虛=先驗、灰點=概似、紅實=後驗
二、先驗敏感性與模型誤設
先驗敏感性檢查 (prior sensitivity analysis):同一個模型,用兩到三組合理的先驗各跑一次;若關鍵結論隨先驗大幅改變,就要在文章中明說。「後驗就是後驗」 只是表達貝氏一致性,並不代表它對誤設免疫——若概似模型本身錯了 (例如把計數套常態),任何先驗都救不回來。
什麼時候先驗最重要?(a) 樣本量極小 (n < 30);(b) 模型有不可辨識性 (non-identifiable);(c) 推導量在參數的非線性函數上 (尺度轉換後 flat prior 不再 flat)。
Prior sensitivity analysis: re-run the model under two or three defensible priors; if the key conclusion shifts noticeably, disclose this in your paper. The slogan "the posterior is what it is" states Bayesian coherence — it does not mean the model is immune to misspecification. If your likelihood is wrong (e.g. Gaussian on counts), no prior can save you.
When do priors matter most? (a) Small n (n < 30); (b) non-identifiable models; (c) inference on nonlinear functions of parameters (flat is rarely flat after a transform).
Uniform(0, 1e6) 加在變異數參數上自稱「無資訊」。這對小變異數區段其實非常不公平——它把絕大部分機率質量壓在很大的變異數上。建議改用 half-Normal 或 half-Cauchy on σ。
Common mistake: slapping Uniform(0, 1e6) on a variance parameter and calling it "non-informative". This is in fact wildly unfair to small variances — almost all probability mass sits on large values. Use a half-Normal or half-Cauchy on σ instead.
三、先驗選擇決策樹
🌳 先驗選擇流程
Beta(20, 80) 表示「過去看過約 100 例,平均 20% 陽性」);否 → Q2。N(0, 2.5) on logit;N(0, 2) on log-rate。Beta(20, 80) = "~100 prior cases, ~20% positive"); otherwise → Q2.N(0, 2.5) on the logit scale; N(0, 2) on log-rate.Uniform(0, big).| 情境 | 先驗 | 備註 | |||
|---|---|---|---|---|---|
| A/B 測試成功率 (Bayesian A/B) | Beta(1, 1) 或 Beta(2, 2) | 小樣本先驗強度可換算為「虛擬樣本」 | A/B success rate | Beta(1, 1) or Beta(2, 2) | prior strength = pseudo-counts |
| CRISPR screen log fold change | N(0, 2) 或 Cauchy(0, 1) | 弱訊息、容許大效應 | CRISPR screen log fold change | N(0, 2) or Cauchy(0, 1) | weakly informative, allows big effects |
| RNA-seq NB dispersion | log-Normal(−3, 1) | DESeq2 風格的 trended prior | RNA-seq NB dispersion | log-Normal(−3, 1) | DESeq2-style trended prior |
| Hierarchical 變異成分 σ | half-Normal(0, 1) 或 half-Cauchy(0, 2.5) | 避免 Uniform(0, 大) | Hierarchical σ component | half-Normal(0, 1) or half-Cauchy(0, 2.5) | avoid Uniform(0, big) |
| GWAS Bayesian (bglr) | spike-and-slab, BayesB | 少數大效應 + 多數零效應 | GWAS Bayesian (bglr) | spike-and-slab, BayesB | few large, many null effects |
實作:封閉解 + 機率程式
# --- R --- Beta–Binomial 封閉解 a <- 2; b <- 2; k <- 7; n <- 10 # 後驗 = Beta(a+k, b+n-k) post_a <- a + k; post_b <- b + n - k curve(dbeta(x, a, b), from=0, to=1, col="steelblue", lty=2, ylab="density") curve(dbeta(x, post_a, post_b), add=TRUE, col="firebrick", lwd=2) qbeta(c(.025, .975), post_a, post_b) # 95% credible interval # --- 一般化:rstanarm logistic + 弱資訊先驗 library(rstanarm) fit <- stan_glm(y ~ x, data=df, family=binomial(), prior = normal(0, 2.5), prior_intercept = normal(0, 5), chains=4, iter=2000) prior_summary(fit) # 確認先驗設定 posterior_interval(fit, prob=.95) # 95% credible posterior_predict(fit, draws=500) # 後驗預測抽樣 # --- 先驗敏感性:兩組先驗對照 fit2 <- update(fit, prior = normal(0, 10)) cbind(coef(fit), coef(fit2))
# --- Python --- import numpy as np, scipy.stats as st import pymc as pm, arviz as az # 1) Beta-Binomial 封閉解 a, b, k, n = 2, 2, 7, 10 post = st.beta(a + k, b + n - k) print(post.mean(), post.interval(0.95)) # 2) PyMC:先驗預測 → 抽樣 → 後驗預測 with pm.Model() as m: p = pm.Beta("p", alpha=2, beta=2) obs = pm.Binomial("y", n=n, p=p, observed=k) prior_pred = pm.sample_prior_predictive(1000) idata = pm.sample(2000, tune=1000, chains=4, target_accept=0.9) post_pred = pm.sample_posterior_predictive(idata) az.summary(idata, hdi_prob=0.95) az.plot_posterior(idata, var_names=["p"]) az.plot_dist_comparison(idata, var_names=["p"]) # prior vs posterior # 3) 先驗敏感性 with pm.Model() as m_flat: p2 = pm.Beta("p", 1, 1) pm.Binomial("y", n=n, p=p2, observed=k) idata2 = pm.sample() az.compare({"prior_Beta22":idata, "prior_Beta11":idata2})
📝 自我檢測
1. 為什麼「機率 p 上的均勻先驗」與「log-odds 上的均勻先驗」並不等價?
1. Why is a uniform prior on a probability not the same as a uniform prior on its log-odds?
2. Poisson 速率 λ 的共軛先驗以及後驗形式是?
2. State the conjugate prior for the Poisson rate and the resulting posterior.
3. 先驗選擇影響最大的時刻是?
3. When does prior choice matter most?