參考資料 · REFERENCES

權威書籍、原始論文與套件文件

為 Statistical Inference Interactive Tutorial 17 個主題完整佐證的學術出處,附 DOI 連結。

Fully cited sources for all 17 modules of the Statistical Inference Interactive Tutorial, with DOI links.

如何使用這份資料?

本頁針對教學中提到的每個概念、檢定、套件與演算法整理學術出處。引用標籤含義:

For every concept, test, package, and algorithm mentioned in the tutorial, this page lists the canonical source. Tag legend:

📚

Book

教科書 / 線上免費書

Textbooks / free online books

📄

Paper

原始論文(含 DOI / PubMed)

Original papers (with DOI / PubMed)

📘

Doc

官方文件、vignette、tutorial

Official docs, vignettes, tutorials

Best Practice

綜述與 community 推薦

Reviews and community recommendations

📊

Benchmark

獨立 benchmarking 評比

Independent benchmarking studies

本頁目錄

貫穿全教程的權威教科書

這幾本書幾乎覆蓋了本教程所有章節的理論基礎,是進階學習的最佳起點。

These books cover the theoretical foundation of nearly every topic in this tutorial — the best starting point for going deeper.

🎲 機率與分布

🎯 點估計與抽樣分布

📏 信賴區間與 Bootstrap

⚖️ 假設檢定與錯誤

🧪 常見檢定方法

📈 線性迴歸與診斷

🔀 ANOVA / 混合效應

🎚️ Logistic 迴歸與 OR

🔢 GLM 計數模型

🧠 貝氏定理 / 先驗 / 後驗

⛓️ MCMC / HMC / 診斷

📐 後驗摘要 / BF / PPC

🪢 階層 / 經驗貝氏

🚦 多重檢定與 FDR

🧬 DE 統計核心

🔄 排列 / Bootstrap / GSEA

存活分析

📌 教學註記與細節

以下整理本頁引用脈絡中容易被忽略、但會直接影響貝氏/頻率推論判讀的補充說明。內容皆為延伸閱讀,所引文獻已收錄於上方各 Step;目的在於把零散注記集中於一處,方便日後查閱。

Below is a consolidated set of notes that often slip past readers but materially affect how Bayesian/frequentist inference is read. These are extended discussion items — the underlying references already appear in the Step sections above, and gathering them here is meant to make later look-up easier.

E1 · Welch's t as the New Default (Delacre, Lakens, Leys 2017, IRSP)

傳統教學常先教 Student's t-test (assumes equal variances),再以 Levene's test 決定是否改用 Welch's t-test。Delacre, Lakens & Leys 2017 (Int Rev Soc Psychol; 10.5334/irsp.82) 透過 Monte Carlo simulation 證明這個流程實際上 inferior — 因為 (1) Levene's test 本身有 type II error,會在實際存在 variance heterogeneity 時失敗;(2) Welch's t-test 在 variances 確實相等時,僅損失少量 power (df 略小);(3) Welch's t-test 對 unequal n + unequal variance 的組合穩健。結論:應將 Welch's t-test (R: `t.test()` 預設、Python: `scipy.stats.ttest_ind(equal_var=False)`) 作為兩組比較的 *default*,不需先做 Levene/F-test。Ruxton 2006 (Behav Ecol) 為更早的同方向論證。注意:Welch-Satterthwaite df 為非整數,這是正確行為。對 paired design,仍應使用 paired t-test (差值的 one-sample t)。在嚴重 non-normality + small n 下,Wilcoxon signed-rank / Mann-Whitney 或 permutation test 為較佳選擇。

Traditional teaching first introduces Student's t-test (equal variances) and uses Levene's test to decide whether to switch to Welch's t-test. Delacre, Lakens & Leys 2017 (Int Rev Soc Psychol; 10.5334/irsp.82) showed via Monte Carlo simulation that this workflow is actually inferior because (1) Levene's test itself has type II error and can fail when variance heterogeneity is real; (2) Welch's t-test loses only a small amount of power (slightly smaller df) when variances are equal; (3) Welch's t-test is robust to combinations of unequal n + unequal variance. Conclusion: Welch's t-test (R: `t.test()` default; Python: `scipy.stats.ttest_ind(equal_var=False)`) should be the *default* for two-group comparisons — no preliminary Levene/F-test required. Ruxton 2006 (Behav Ecol) made an earlier version of the same argument. Note: the Welch-Satterthwaite df is non-integer — that is correct behavior. For paired designs, use the paired t-test (one-sample t on differences). Under severe non-normality + small n, Wilcoxon signed-rank / Mann-Whitney or a permutation test is preferable.

E2 · MCMC Convergence ≠ Truth (Vehtari et al. 2021, Bayesian Analysis)

MCMC 診斷 (Gelman-Rubin R̂、effective sample size ESS、trace plot) 的角色是判斷「chain 是否從 stationary distribution 抽樣」— 即 *computational convergence*。但 convergence 不代表後驗分布反映真實 — 模型若 misspecified、prior 若主導、likelihood 若計算錯誤,MCMC 仍可能 converge 到一個錯誤的後驗。Vehtari et al. 2021 (Bayesian Analysis 16(2):667-718) 提出 rank-normalized split-R̂ + folded R̂ 為改良診斷:(1) split chain 為兩半比較 within/between variance;(2) rank-transform 處理重尾或非 Gaussian posterior;(3) folded 版本檢測 scale 而非 location 差異。建議 R̂ < 1.01 (舊閾值 1.1 過於寬鬆) 且 ESS > 400 per parameter (對 quantile 估計,bulk-ESS 與 tail-ESS 應分開檢視)。配套:用 prior predictive check (Gelman 2017) 確認 prior 合理;posterior predictive check (PPC; Gabry 2019, JRSS A) 確認 model 可生成類似觀察資料;leave-one-out cross-validation (LOO; Vehtari 2017, Stats Comput) 比較 model 預測力。實作:ArviZ `az.summary()`、`az.plot_trace()`、`az.loo()`;bayesplot `mcmc_trace()`、`ppc_dens_overlay()`。

MCMC diagnostics (Gelman-Rubin R̂, effective sample size ESS, trace plot) tell you whether the chain is sampling from the stationary distribution — i.e., *computational convergence*. Convergence does not imply the posterior reflects reality — if the model is misspecified, the prior dominates, or the likelihood is coded wrong, MCMC can still converge to an incorrect posterior. Vehtari et al. 2021 (Bayesian Analysis 16(2):667-718) propose rank-normalized split-R̂ + folded R̂ as improved diagnostics: (1) split each chain in half and compare within/between variance; (2) rank-transform to handle heavy-tailed or non-Gaussian posteriors; (3) the folded version detects scale rather than location differences. Recommend R̂ < 1.01 (the old 1.1 threshold is too lax) and ESS > 400 per parameter (for quantile estimation, examine bulk-ESS and tail-ESS separately). Complement with: prior predictive checks (Gelman 2017) to validate the prior, posterior predictive checks (PPC; Gabry 2019, JRSS A) to confirm the model can generate data similar to observed, and leave-one-out cross-validation (LOO; Vehtari 2017, Stats Comput) to compare predictive power. Implementation: ArviZ `az.summary()`, `az.plot_trace()`, `az.loo()`; bayesplot `mcmc_trace()`, `ppc_dens_overlay()`.

E3 · BIC vs AIC — Asymptotic Limits

AIC = −2·log L + 2k;BIC = −2·log L + k·log(n)。雖然兩者形式相似,背後 asymptotic justification 完全不同:AIC (Akaike 1973) 為估計 *out-of-sample predictive accuracy* (Kullback-Leibler divergence),假設 "true model" *不在* candidate set 中 — AIC 選最接近真實的 approximation。BIC (Schwarz 1978) 為估計 *posterior model probability*,假設 "true model" *在* candidate set 中 — BIC 在 n → ∞ 時 consistent (選到真實 model 機率 → 1),但代價是在中等 n 時 *under-fit* (penalty 過重)。實用上:(1) 預測為目的 → AIC 或 LOO-CV (Vehtari 2017);(2) 變數選擇 + 真實 model 假設 → BIC;(3) hierarchical/random effects 模型 → DIC 已被淘汰,改用 WAIC 或 PSIS-LOO;(4) 雙者不可直接比較數值 — AIC 與 BIC 不在同一尺度。Burnham & Anderson 2002 為頻率主義模型平均的標準參考。對 Bayesian 模型,建議 LOO-CV (Vehtari 2017, Stats Comput 27:1413) + ELPD 為主要工具,因其考慮整個 posterior 而非單一 point estimate。Stan/PyMC + ArviZ 的 `az.compare()` 提供標準輸出。

AIC = −2·log L + 2k; BIC = −2·log L + k·log(n). The forms look similar, but the asymptotic justifications differ completely: AIC (Akaike 1973) estimates *out-of-sample predictive accuracy* (Kullback-Leibler divergence) under the assumption that the "true model" is *not* in the candidate set — AIC selects the best approximation. BIC (Schwarz 1978) estimates *posterior model probability* under the assumption that the "true model" *is* in the candidate set — BIC is consistent as n → ∞ (probability of selecting the true model → 1), but the cost is *under-fitting* at moderate n (heavy penalty). Practical guidance: (1) prediction → AIC or LOO-CV (Vehtari 2017); (2) variable selection assuming the true model is in the set → BIC; (3) hierarchical / random-effects models → DIC is deprecated, use WAIC or PSIS-LOO; (4) AIC and BIC values cannot be directly compared — they are not on the same scale. Burnham & Anderson 2002 is the standard reference for frequentist multi-model inference. For Bayesian models, prefer LOO-CV (Vehtari 2017, Stats Comput 27:1413) + ELPD as the primary tool because it uses the full posterior rather than a point estimate. Stan/PyMC + ArviZ `az.compare()` provides standardized output.

E4 · Stan vs PyMC — NUTS / brms / JAX Backends

兩大主流 probabilistic programming 平台各有定位:(1) Stan (Carpenter et al. 2017, JSS 76(1):1-32) 以 C++ 實作 NUTS (No-U-Turn Sampler, Hoffman & Gelman 2014) 的 HMC,提供 R 介面 (rstan、rstanarm、brms — Bürkner 2017 JSS 80(1):1-28) 與 Python (PyStan、cmdstanpy)。優勢:成熟 (2012-)、社群活躍、warning system 完整 (divergent transitions、E-BFMI)。劣勢:模型需編譯 (cold start ~30s)。(2) PyMC (v5+; Salvatier 2016, PeerJ CS) 以 Python + PyTensor,後端可切換 (default Numba, JAX via `pymc-jax`)。優勢:(a) JAX backend 提供 GPU/TPU 加速;(b) variational inference (ADVI、Pathfinder) 與 SMC sampler 內建;(c) 與 deep learning ecosystem (Flax、Haiku) 整合。NumPyro (Phan 2019) 為 Uber 開發的 JAX-native PPL,速度更快但 API 較底層。BlackJAX 為 JAX-based 採樣器函式庫。選擇建議:(i) hierarchical model with random slopes → brms (Stan),syntax 最接近 lme4;(ii) deep generative model / amortized inference → NumPyro/Pyro;(iii) 大型 dataset + GPU → PyMC + JAX 或 NumPyro。所有平台都支援 ArviZ 為共用 diagnostics 介面 (Kumar 2019)。

The two mainstream probabilistic programming platforms have different niches: (1) Stan (Carpenter et al. 2017, JSS 76(1):1-32) implements NUTS (No-U-Turn Sampler, Hoffman & Gelman 2014) HMC in C++, with R interfaces (rstan, rstanarm, brms — Bürkner 2017 JSS 80(1):1-28) and Python (PyStan, cmdstanpy). Pros: mature (since 2012), active community, comprehensive warning system (divergent transitions, E-BFMI). Cons: models must be compiled (cold start ~30s). (2) PyMC (v5+; Salvatier 2016, PeerJ CS) uses Python + PyTensor with switchable backends (Numba by default, JAX via `pymc-jax`). Pros: (a) JAX backend gives GPU/TPU acceleration; (b) variational inference (ADVI, Pathfinder) and SMC sampler built in; (c) integrates with deep-learning ecosystems (Flax, Haiku). NumPyro (Phan 2019, Uber) is a JAX-native PPL — faster but lower-level API. BlackJAX is a JAX-based sampler library. Choosing: (i) hierarchical model with random slopes → brms (Stan), syntax closest to lme4; (ii) deep generative model / amortized inference → NumPyro/Pyro; (iii) large dataset + GPU → PyMC + JAX or NumPyro. All platforms support ArviZ as a common diagnostics interface (Kumar 2019).

E5 · p-value vs Posterior — Jeffreys-Lindley Paradox

在大樣本下,frequentist test 與 Bayesian model comparison 可給出截然相反的結論 — 即 Jeffreys-Lindley paradox (Lindley 1957, Biometrika)。經典例子:H₀: μ = 0 vs H₁: μ ≠ 0,n = 10⁶,sample mean = 0.001 (z = 2.0, p = 0.046 → reject H₀);但若 prior on μ under H₁ 為 N(0, 1),Bayes factor BF₁₀ ≈ 0.05 → strong evidence *for* H₀。原因:p-value 隨 n 增加而對任何 ε ≠ 0 的偏離都敏感;Bayes factor 則 penalize "預測力分散到大區間" 的 alternative,即 Occam's razor。實作建議:(1) 大樣本 + 點 null 場景,p-value 容易誤導 — 改用 region of practical equivalence (ROPE; Kruschke 2018) 或 minimum effect size of interest (SESOI)。(2) 報告 effect size + 95% CI/CrI 比單一 p-value/BF 更 informative。(3) Bayes factor 對 prior on alternative *敏感* (e.g., Cauchy(0, r=√2/2) 為 BayesFactor 套件 default; Rouder 2009),需 sensitivity analysis。(4) ASA 2016/2019 與 Greenland 2016 (Eur J Epidemiol) 為 p-value 誤解的標準參考。Berger & Sellke 1987 (JASA) 量化 p ≈ 0.05 對應的 minimum BF₁₀ ≈ 2.5,遠小於 Jeffreys "substantial" 的 3。

At large sample sizes, a frequentist test and a Bayesian model comparison can give opposite conclusions — the Jeffreys-Lindley paradox (Lindley 1957, Biometrika). Classic example: H₀: μ = 0 vs H₁: μ ≠ 0, n = 10⁶, sample mean = 0.001 (z = 2.0, p = 0.046 → reject H₀); but with prior μ ~ N(0, 1) under H₁, the Bayes factor BF₁₀ ≈ 0.05 → strong evidence *for* H₀. Reason: the p-value becomes sensitive to any ε ≠ 0 deviation as n grows, while the Bayes factor penalizes alternatives that "spread predictive mass over a wide region" — Occam's razor. Recommendations: (1) under large-n + point-null settings, p-values are easy to misread — use a region of practical equivalence (ROPE; Kruschke 2018) or a minimum effect size of interest (SESOI). (2) Reporting effect size + 95% CI/CrI is more informative than a single p-value or BF. (3) Bayes factors are *sensitive* to the prior on the alternative (e.g., Cauchy(0, r=√2/2), the BayesFactor-package default, Rouder 2009) — perform sensitivity analysis. (4) ASA 2016/2019 and Greenland 2016 (Eur J Epidemiol) are the canonical references on p-value misinterpretations. Berger & Sellke 1987 (JASA) quantified that p ≈ 0.05 corresponds to a minimum BF₁₀ ≈ 2.5, far below Jeffreys' "substantial" threshold of 3.

E6 · Prior Choice Sensitivity (Gabry et al. 2019, JRSS A)

「Bayesian 結果依賴 prior」是頻率主義常見批評。實情更精確:(1) 在 well-identified、large-sample 場景,posterior 由 likelihood 主導,prior 影響輕微;(2) 在 small n、稀疏 data、複雜階層模型中,prior 可能顯著影響 posterior — 此時必須做 prior sensitivity analysis。Gabry et al. 2019 (JRSS A 182(2):389-402) 與 Schad et al. 2021 (Psychol Methods) 提出 Bayesian workflow:(i) prior predictive check (在看到資料前模擬 prior implied data,確認合理範圍);(ii) fit model;(iii) MCMC diagnostics (E2);(iv) posterior predictive check (PPC; 模擬與觀察比較);(v) leave-one-out (LOO; Vehtari 2017) 評估預測力;(vi) prior sensitivity (重新跑 weakly informative vs informative prior 比較)。預設 prior:(a) 對 fixed effects 用 weakly informative (Normal(0, 2.5·SD(x)) for slope; Gelman 2008 Ann Appl Stat);(b) 對 variance components 避免 inv-gamma(ε, ε),改用 half-normal/half-cauchy/half-t (Gelman 2006 Bayesian Anal);(c) brms 預設 prior 為 student-t(3, ...),比 N(0, 100) 的 "vague" 更穩定。Stan Wiki "Prior Choice Recommendations" 為實用 cheat sheet。

"Bayesian results depend on the prior" is a frequent frequentist critique. More precisely: (1) in well-identified large-sample settings, the posterior is dominated by the likelihood and the prior matters little; (2) at small n, sparse data, or complex hierarchical models, the prior can substantially affect the posterior — prior sensitivity analysis is then required. Gabry et al. 2019 (JRSS A 182(2):389-402) and Schad et al. 2021 (Psychol Methods) propose a Bayesian workflow: (i) prior predictive check (simulate prior-implied data before seeing the actual data, verify the range is plausible); (ii) fit the model; (iii) MCMC diagnostics (E2); (iv) posterior predictive check (PPC; compare simulated vs observed); (v) leave-one-out (LOO; Vehtari 2017) to assess predictive performance; (vi) prior sensitivity (re-run with weakly-informative vs informative priors and compare). Default priors: (a) for fixed effects, weakly informative (Normal(0, 2.5·SD(x)) for slopes; Gelman 2008 Ann Appl Stat); (b) for variance components, avoid inv-gamma(ε, ε) — prefer half-normal / half-Cauchy / half-t (Gelman 2006 Bayesian Anal); (c) brms defaults to student-t(3, ...), which is more stable than the "vague" N(0, 100). The Stan Wiki "Prior Choice Recommendations" is a practical cheat sheet.

E7 · FDR vs FWER — Conceptual Distinction

兩者控制的錯誤類型不同:(1) FWER (family-wise error rate) = P(至少一個 false positive) ≤ α — 適用於「任何一個錯誤都嚴重」的情境 (clinical trial 多重 endpoints、藥物批准)。方法:Bonferroni、Holm 1979、Hochberg 1988、單步/逐步 Šidák。(2) FDR (false discovery rate) = E[V/R | R>0]·P(R>0) ≤ q — 「在被宣告為 significant 的結果中,預期 false positive 比例」。適用於 large-scale screening (transcriptomics、GWAS、proteomics)。方法:Benjamini-Hochberg 1995、Storey 2002 q-value、Ignatiadis 2016 IHW (data-driven hypothesis weighting)。FDR 比 FWER *寬鬆* — m = 10000 時,BH 在 q=0.05 下可保留 ~500 真實 discovery 中的 25 false positive;Bonferroni 在 α=0.05/10000 下幾乎無 power。注意:(a) FDR 不對單一 hypothesis 給 error guarantee — 只對 *整批* 結果;(b) BH 假設 p-value 在 H₀ 下 uniform 且 PRDS (positive regression dependent on subset; Benjamini-Yekutieli 2001) — 對 negatively correlated tests 應用 BY;(c) 對 discrete tests (Fisher exact),BH 過於保守,可改用 group-FDR 或 adjusted-BH。Local FDR (Efron 2010 Large-Scale Inference) 為單一 test 的後驗 FDR,與 Bayesian framework 對接。在 RNA-seq 中,IHW 與 apeglm (Zhu 2018) 同時使用可顯著提高 power。

FDR and FWER control different errors: (1) FWER (family-wise error rate) = P(at least one false positive) ≤ α — appropriate when "any single error is serious" (multi-endpoint clinical trials, drug approval). Methods: Bonferroni, Holm 1979, Hochberg 1988, single-step / step-down Šidák. (2) FDR (false discovery rate) = E[V/R | R>0]·P(R>0) ≤ q — "expected proportion of false positives among declared significants". Appropriate for large-scale screening (transcriptomics, GWAS, proteomics). Methods: Benjamini-Hochberg 1995, Storey 2002 q-value, Ignatiadis 2016 IHW (data-driven hypothesis weighting). FDR is *more permissive* than FWER — at m = 10000 and q = 0.05, BH can retain ~25 false positives among ~500 true discoveries; Bonferroni at α = 0.05/10000 has essentially no power. Notes: (a) FDR makes no guarantee about a single hypothesis — only about the *batch* of results; (b) BH assumes p-values are uniform under H₀ and PRDS (positive regression dependence on subset; Benjamini-Yekutieli 2001) — use BY for negatively correlated tests; (c) for discrete tests (Fisher exact), BH is over-conservative — use group-FDR or adjusted-BH. Local FDR (Efron 2010 Large-Scale Inference) is the posterior FDR for a single test and connects to the Bayesian framework. In RNA-seq, using IHW together with apeglm (Zhu 2018) significantly increases power.

E8 · James-Stein / Shrinkage — limma / ashr / DESeq2 Connection

James-Stein 1961 (Berkeley Symp) 是 20 世紀統計學最反直覺的結果之一:當同時估計 *三個或更多 ≥ 3* 獨立常態 means 時,shrinkage estimator (向總平均收縮一個比例) 的 *total MSE* 嚴格優於 maximum likelihood (independent sample means),無論真實 means 為何。其精神為「借力」(borrow strength) — 即使 means 之間 *沒有* 直接關係,pooled information 也能改善整體估計。現代基因體統計大量應用此精神:(1) limma (Smyth 2004; Ritchie 2015) 用 empirical Bayes 收縮每個 gene 的 variance 估計,向 global variance prior 收縮 — 解決 microarray/RNA-seq 中 per-gene n 過小的問題;moderated t 比 ordinary t 顯著更穩健。(2) DESeq2 (Love 2014, Genome Biol) 收縮 dispersion 估計向 mean-dispersion trend;apeglm (Zhu 2018, Bioinformatics) 用 heavy-tailed (Cauchy-like) prior 收縮 log2 fold change,避免 low-count gene 的 noise 主導 ranking。(3) edgeR (Robinson 2010) 透過 weighted likelihood 結合 common dispersion 與 tagwise dispersion。(4) ashr (Stephens 2017, Biostatistics 18(2):275) 將 FDR 框架本身改寫為 adaptive shrinkage — local FDR + effect size 一併估計,使用 unimodal prior 假設提供 LFSR (local false sign rate) 取代 q-value。(5) Bayesian hierarchical model 為更通用的 shrinkage 框架 — group-level random effects 即自動實現 borrow-strength。Efron 2010 (Large-Scale Inference) 為此議題的完整 textbook。實作要點:在小樣本、多 hypothesis、高 variance 場景,always prefer shrinkage estimators over per-test MLE。

James-Stein 1961 (Berkeley Symp) is one of the most counter-intuitive results in 20th-century statistics: when estimating *three or more (≥ 3)* independent normal means simultaneously, a shrinkage estimator (shrinking each sample mean toward the overall mean by a fraction) strictly dominates the MLE (independent sample means) in *total MSE*, regardless of the true means. The intuition is "borrowing strength" — even when the means have *no* direct relationship, pooled information still improves overall estimation. Modern genomic statistics applies this idea extensively: (1) limma (Smyth 2004; Ritchie 2015) uses empirical Bayes to shrink each gene's variance estimate toward a global variance prior, solving the per-gene-small-n problem in microarray/RNA-seq; moderated t is significantly more robust than ordinary t. (2) DESeq2 (Love 2014, Genome Biol) shrinks dispersion estimates toward a mean-dispersion trend; apeglm (Zhu 2018, Bioinformatics) uses a heavy-tailed (Cauchy-like) prior to shrink log2 fold changes, preventing low-count genes from dominating rankings by noise. (3) edgeR (Robinson 2010) combines common and tagwise dispersion through a weighted likelihood. (4) ashr (Stephens 2017, Biostatistics 18(2):275) reformulates the FDR framework itself as adaptive shrinkage — jointly estimating local FDR and effect size under a unimodal-prior assumption and giving LFSR (local false sign rate) in place of the q-value. (5) Bayesian hierarchical models are a more general shrinkage framework — group-level random effects implement borrow-strength automatically. Efron 2010 (Large-Scale Inference) is the complete textbook on this topic. Practice: in small-n, many-hypothesis, high-variance settings, always prefer shrinkage estimators over per-test MLE.

📖
查找指南: 所有 DOI 連結指向出版社頁面。若需下載 PDF,可搭配 PubMedPMCarXivbioRxiv 找開放近用版本。