Statistical Inference

概覽

如何使用這份資料？

本頁針對教學中提到的每個概念、檢定、套件與演算法整理學術出處。引用標籤含義：

For every concept, test, package, and algorithm mentioned in the tutorial, this page lists the canonical source. Tag legend:

📚

Book

教科書 / 線上免費書

Textbooks / free online books

📄

Paper

原始論文（含 DOI / PubMed）

Original papers (with DOI / PubMed)

📘

Doc

官方文件、vignette、tutorial

Official docs, vignettes, tutorials

⭐

Best Practice

綜述與 community 推薦

Reviews and community recommendations

📊

Benchmark

獨立 benchmarking 評比

Independent benchmarking studies

本頁目錄

★核心教科書 Step 1Probability / Distributions Step 2Point Estimation Step 3CI / Bootstrap Step 4Hypothesis Testing Step 5Common Tests Step 6Linear Regression Step 7ANOVA / Mixed Step 8Logistic Step 9GLM Counts Step 10Bayes Basics Step 11MCMC / HMC Step 12Posterior / PPC Step 13Empirical Bayes Step 14FDR Step 15DE Statistics Step 16Permutation / GSEA Step 17Survival

核心教科書

⭐ 貫穿全教程的權威教科書

這幾本書幾乎覆蓋了本教程所有章節的理論基礎，是進階學習的最佳起點。

These books cover the theoretical foundation of nearly every topic in this tutorial — the best starting point for going deeper.

📚 BOOKWasserman L. All of Statistics: A Concise Course in Statistical Inference. Springer (2004).DOI: 10.1007/978-0-387-21736-9 · 頻率學派與貝氏的快速入門Fast-paced intro to frequentist & Bayesian
📚 BOOKCasella G, Berger RL. Statistical Inference, 2nd ed. Duxbury (2002).研究所機率與統計的標準教科書The graduate-level standard text
📚 BOOKEfron B, Hastie T. Computer Age Statistical Inference: Algorithms, Evidence and Data Science. Cambridge UP (2016).hastie.su.domains/CASI (免費 PDFfree PDF)
📚 BOOKGelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis (BDA3), 3rd ed. CRC (2013).stat.columbia.edu/~gelman/book (免費 PDFfree PDF)
📚 BOOKMcElreath R. Statistical Rethinking: A Bayesian Course with Examples in R and Stan, 2nd ed. CRC (2020).最受歡迎的貝氏統計入門The most-loved Bayesian intro
📚 BOOKMurphy KP. Probabilistic Machine Learning: An Introduction. MIT Press (2022).probml.github.io/pml-book (免費 PDFfree PDF)
📚 BOOKEfron B. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge UP (2010).基因體尺度檢定的奠基之作Foundational for genomic-scale inference
📚 BOOKHarrell FE. Regression Modeling Strategies, 2nd ed. Springer (2015).DOI: 10.1007/978-3-319-19425-7 · 臨床/生物統計建模的權威Authoritative for clinical/biostat modeling

Step 1

🎲 機率與分布

PAPERAnders S, Huber W. Differential expression analysis for sequence count data. Genome Biology 11:R106 (2010).DOI: 10.1186/gb-2010-11-10-r106 · NB 應用於 RNA-seq 的奠基論文Foundational NB-for-RNA-seq paper
📚Wasserman, Ch. 1–3 · Casella & Berger Ch. 2–4 · Efron & Hastie Ch. 1
DOCR Documentation — Distributionsstat.ethz.ch/R-manual · scipy.stats docsstat.ethz.ch/R-manual · scipy.stats docs

Step 2

🎯 點估計與抽樣分布

📚Casella & Berger Ch. 7 · Wasserman Ch. 9
PAPERJames W, Stein C. Estimation with quadratic loss. Proc 4th Berkeley Symp 1:361–379 (1961). James-Stein 收縮估計奠基Founding paper on shrinkage
PAPEREfron B. Bootstrap methods: another look at the jackknife. Annals of Statistics 7(1):1–26 (1979).DOI: 10.1214/aos/1176344552DOI: 10.1214/aos/1176344552
DOCMASS::fitdistrCRAN MASSCRAN MASS

Step 3

📏 信賴區間與 Bootstrap

📚Efron B, Tibshirani RJ. An Introduction to the Bootstrap. CRC (1993).
📚Davison AC, Hinkley DV. Bootstrap Methods and Their Application. Cambridge UP (1997).
PAPERDiCiccio TJ, Efron B. Bootstrap confidence intervals. Statistical Science 11(3):189–228 (1996).DOI: 10.1214/ss/1032280214DOI: 10.1214/ss/1032280214
DOCR boot packageCRAN boot · scipy bootstrap docsCRAN boot · scipy bootstrap docs

Step 4

⚖️ 假設檢定與錯誤

⭐ ASAWasserstein RL, Lazar NA. The ASA's statement on p-values: context, process, and purpose. The American Statistician 70(2):129–133 (2016).DOI: 10.1080/00031305.2016.1154108DOI: 10.1080/00031305.2016.1154108
⭐Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology 31:337–350 (2016).DOI: 10.1007/s10654-016-0149-3DOI: 10.1007/s10654-016-0149-3
📚Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Routledge (1988).
DOCR pwr package · statsmodels stats.powerCRAN pwrCRAN pwr

Step 5

🧪 常見檢定方法

PAPERWelch BL. The generalization of "Student's" problem when several different population variances are involved. Biometrika 34:28–35 (1947).DOI: 10.1093/biomet/34.1-2.28DOI: 10.1093/biomet/34.1-2.28
📚Hollander M, Wolfe DA, Chicken E. Nonparametric Statistical Methods, 3rd ed. Wiley (2014).
📚Agresti A. Categorical Data Analysis, 3rd ed. Wiley (2013).
DOCR stats::t.test, wilcox.test, fisher.test, kruskal.test · scipy ttest_ind, mannwhitneyu, chi2_contingency

Step 6

📈 線性迴歸與診斷

📚Faraway JJ. Linear Models with R, 2nd ed. CRC (2014).
📚Fox J, Weisberg S. An R Companion to Applied Regression, 3rd ed. SAGE (2019). 內含 car 套件作者By the car package authors
📚Belsley DA, Kuh E, Welsch RE. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley (1980).
DOCR stats::lm, plot.lm; car::vif, influencePlot; sandwich::vcovHC; lmtest::bpteststatsmodels ols + get_robustcov_resultsstatsmodels ols + get_robustcov_results

Step 7

🔀 ANOVA / 混合效應

PAPERBates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67(1):1–48 (2015).DOI: 10.18637/jss.v067.i01DOI: 10.18637/jss.v067.i01
PAPERKuznetsova A, Brockhoff PB, Christensen RHB. lmerTest package: tests in linear mixed effects models. JSS 82(13):1–26 (2017).DOI: 10.18637/jss.v082.i13DOI: 10.18637/jss.v082.i13
📚Pinheiro JC, Bates DM. Mixed-Effects Models in S and S-PLUS. Springer (2000).
📚Gelman A, Hill J. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge UP (2007).
DOCemmeans vignettes (Lenth)CRAN emmeansCRAN emmeans

Step 8

🎚️ Logistic 迴歸與 OR

📚Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression, 3rd ed. Wiley (2013).
PAPERFirth D. Bias reduction of maximum likelihood estimates. Biometrika 80(1):27–38 (1993).DOI: 10.1093/biomet/80.1.27DOI: 10.1093/biomet/80.1.27
PAPERKing G, Zeng L. Logistic regression in rare events data. Political Analysis 9:137–163 (2001).DOI: 10.1093/oxfordjournals.pan.a004868DOI: 10.1093/oxfordjournals.pan.a004868
DOCR glm(family=binomial); logistf; brglm2; pROCscikit-learn LogisticRegression, calibration_curvescikit-learn LogisticRegression, calibration_curve

Step 9

🔢 GLM 計數模型

📚McCullagh P, Nelder JA. Generalized Linear Models, 2nd ed. CRC (1989).
📚Hilbe JM. Negative Binomial Regression, 2nd ed. Cambridge UP (2011).
PAPERRobinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140 (2010).DOI: 10.1093/bioinformatics/btp616DOI: 10.1093/bioinformatics/btp616
PAPERLove MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15:550 (2014).DOI: 10.1186/s13059-014-0550-8DOI: 10.1186/s13059-014-0550-8
DOCR MASS::glm.nb, pscl::zeroinfl, AER::dispersionteststatsmodels GLM, NegativeBinomial, ZeroInflatedNegativeBinomialPstatsmodels GLM, NegativeBinomial, ZeroInflatedNegativeBinomialP

Step 10

🧠 貝氏定理 / 先驗 / 後驗

📚Gelman et al., BDA3 Ch. 1–3 · McElreath, Statistical Rethinking 2e Ch. 1–4
📚Berger JO. Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer (1985).
DOCrstanarmmc-stan.org/rstanarm · PyMC pymc.iomc-stan.org/rstanarm · PyMC pymc.io

Step 11

⛓️ MCMC / HMC / 診斷

PAPERBetancourt M. A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint 1701.02434 (2017).arxiv.org/abs/1701.02434arxiv.org/abs/1701.02434
PAPERCarpenter B, Gelman A, Hoffman MD, et al. Stan: A probabilistic programming language. JSS 76(1):1–32 (2017).DOI: 10.18637/jss.v076.i01DOI: 10.18637/jss.v076.i01
PAPERVehtari A, Gelman A, Simpson D, Carpenter B, Bürkner P-C. Rank-normalization, folding, and localization: An improved R̂ for assessing convergence of MCMC. Bayesian Analysis 16(2):667–718 (2021).DOI: 10.1214/20-BA1221DOI: 10.1214/20-BA1221
PAPERBürkner P-C. brms: An R package for Bayesian multilevel models using Stan. JSS 80(1):1–28 (2017).DOI: 10.18637/jss.v080.i01DOI: 10.18637/jss.v080.i01
DOCArviZ; bayesplotpython.arviz.org · mc-stan.org/bayesplotpython.arviz.org · mc-stan.org/bayesplot

Step 12

📐 後驗摘要 / BF / PPC

PAPERVehtari A, Gelman A, Gabry J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27:1413–1432 (2017).DOI: 10.1007/s11222-016-9696-4DOI: 10.1007/s11222-016-9696-4
PAPERGabry J, Simpson D, Vehtari A, Betancourt M, Gelman A. Visualization in Bayesian workflow. JRSS A 182(2):389–402 (2019).DOI: 10.1111/rssa.12378DOI: 10.1111/rssa.12378
📚Kruschke JK. Doing Bayesian Data Analysis, 2nd ed. Academic Press (2015).
DOCR loo, bayestestR, tidybayes

Step 13

🪢 階層 / 經驗貝氏

PAPERSmyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3:Article 3 (2004).DOI: 10.2202/1544-6115.1027DOI: 10.2202/1544-6115.1027
PAPERRitchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7):e47 (2015).DOI: 10.1093/nar/gkv007DOI: 10.1093/nar/gkv007
PAPERStephens M. False discovery rates: a new deal. Biostatistics 18(2):275–294 (2017). (ashr)DOI: 10.1093/biostatistics/kxw041DOI: 10.1093/biostatistics/kxw041
📚Efron, Large-Scale Inference (2010) · BDA3 Ch. 5

Step 14

🚦 多重檢定與 FDR

PAPERBenjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS B 57(1):289–300 (1995).DOI: 10.1111/j.2517-6161.1995.tb02031.xDOI: 10.1111/j.2517-6161.1995.tb02031.x
PAPERStorey JD. A direct approach to false discovery rates. JRSS B 64(3):479–498 (2002).DOI: 10.1111/1467-9868.00346DOI: 10.1111/1467-9868.00346
PAPERStorey JD, Tibshirani R. Statistical significance for genomewide studies. PNAS 100(16):9440–9445 (2003).DOI: 10.1073/pnas.1530509100DOI: 10.1073/pnas.1530509100
PAPERIgnatiadis N, Klaus B, Zaugg JB, Huber W. Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nature Methods 13:577–580 (2016).DOI: 10.1038/nmeth.3885 (IHW)DOI: 10.1038/nmeth.3885 (IHW)
DOCR p.adjust, qvalue, IHW (Bioconductor)

Step 15

🧬 DE 統計核心

PAPERLaw CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology 15:R29 (2014).DOI: 10.1186/gb-2014-15-2-r29DOI: 10.1186/gb-2014-15-2-r29
PAPERMcCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. NAR 40(10):4288–4297 (2012). (edgeR QL)DOI: 10.1093/nar/gks042DOI: 10.1093/nar/gks042
PAPERZhu A, Ibrahim JG, Love MI. Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics 35(12):2084–2092 (2019). (apeglm)DOI: 10.1093/bioinformatics/bty895DOI: 10.1093/bioinformatics/bty895
PAPERBourgon R, Gentleman R, Huber W. Independent filtering increases detection power for high-throughput experiments. PNAS 107(21):9546–9551 (2010).DOI: 10.1073/pnas.0914005107DOI: 10.1073/pnas.0914005107
BENCHSoneson C, Robinson MD. Bias, robustness and scalability in single-cell differential expression analysis. Nature Methods 15:255–261 (2018).DOI: 10.1038/nmeth.4612DOI: 10.1038/nmeth.4612
DOCDESeq2, edgeR, limma vignettesDESeq2 · edgeR · limmaDESeq2 · edgeR · limma

Step 16

🔄 排列 / Bootstrap / GSEA

PAPERSubramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102(43):15545–15550 (2005).DOI: 10.1073/pnas.0506580102DOI: 10.1073/pnas.0506580102
PAPERWu D, Smyth GK. Camera: a competitive gene set test accounting for inter-gene correlation. NAR 40(17):e133 (2012).DOI: 10.1093/nar/gks461DOI: 10.1093/nar/gks461
PAPERKorotkevich G, Sukhov V, Budin N, Shpak B, Artyomov MN, Sergushichev A. Fast gene set enrichment analysis. bioRxiv 060012 (2021). (fgsea)DOI: 10.1101/060012DOI: 10.1101/060012
PAPERGoeman JJ, Bühlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23(8):980–987 (2007).DOI: 10.1093/bioinformatics/btm051DOI: 10.1093/bioinformatics/btm051
📚Efron & Tibshirani, An Introduction to the Bootstrap (1993)
DOCR fgsea, clusterProfiler, limma::camera/roast · Python gseapy

Step 17

⏳ 存活分析

PAPERCox DR. Regression models and life-tables. JRSS B 34(2):187–220 (1972).DOI: 10.1111/j.2517-6161.1972.tb00899.xDOI: 10.1111/j.2517-6161.1972.tb00899.x
PAPERKaplan EL, Meier P. Nonparametric estimation from incomplete observations. JASA 53(282):457–481 (1958).DOI: 10.1080/01621459.1958.10501452DOI: 10.1080/01621459.1958.10501452
PAPERFine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. JASA 94(446):496–509 (1999).DOI: 10.1080/01621459.1999.10474144DOI: 10.1080/01621459.1999.10474144
📚Therneau TM, Grambsch PM. Modeling Survival Data: Extending the Cox Model. Springer (2000).
📚Klein JP, Moeschberger ML. Survival Analysis: Techniques for Censored and Truncated Data, 2nd ed. Springer (2003).
DOCR survival, survminer; Python lifelines, scikit-survivallifelines.readthedocs.iolifelines.readthedocs.io

教學註記

📌 教學註記與細節

以下整理本頁引用脈絡中容易被忽略、但會直接影響貝氏/頻率推論判讀的補充說明。內容皆為延伸閱讀，所引文獻已收錄於上方各 Step；目的在於把零散注記集中於一處，方便日後查閱。

Below is a consolidated set of notes that often slip past readers but materially affect how Bayesian/frequentist inference is read. These are extended discussion items — the underlying references already appear in the Step sections above, and gathering them here is meant to make later look-up easier.

E1 · Welch's t as the New Default (Delacre, Lakens, Leys 2017, IRSP)

傳統教學常先教 Student's t-test (assumes equal variances)，再以 Levene's test 決定是否改用 Welch's t-test。Delacre, Lakens & Leys 2017 (Int Rev Soc Psychol; 10.5334/irsp.82) 透過 Monte Carlo simulation 證明這個流程實際上 inferior — 因為 (1) Levene's test 本身有 type II error，會在實際存在 variance heterogeneity 時失敗；(2) Welch's t-test 在 variances 確實相等時，僅損失少量 power (df 略小)；(3) Welch's t-test 對 unequal n + unequal variance 的組合穩健。結論：應將 Welch's t-test (R: `t.test()` 預設、Python: `scipy.stats.ttest_ind(equal_var=False)`) 作為兩組比較的 *default*，不需先做 Levene/F-test。Ruxton 2006 (Behav Ecol) 為更早的同方向論證。注意：Welch-Satterthwaite df 為非整數，這是正確行為。對 paired design，仍應使用 paired t-test (差值的 one-sample t)。在嚴重 non-normality + small n 下，Wilcoxon signed-rank / Mann-Whitney 或 permutation test 為較佳選擇。

Traditional teaching first introduces Student's t-test (equal variances) and uses Levene's test to decide whether to switch to Welch's t-test. Delacre, Lakens & Leys 2017 (Int Rev Soc Psychol; 10.5334/irsp.82) showed via Monte Carlo simulation that this workflow is actually inferior because (1) Levene's test itself has type II error and can fail when variance heterogeneity is real; (2) Welch's t-test loses only a small amount of power (slightly smaller df) when variances are equal; (3) Welch's t-test is robust to combinations of unequal n + unequal variance. Conclusion: Welch's t-test (R: `t.test()` default; Python: `scipy.stats.ttest_ind(equal_var=False)`) should be the *default* for two-group comparisons — no preliminary Levene/F-test required. Ruxton 2006 (Behav Ecol) made an earlier version of the same argument. Note: the Welch-Satterthwaite df is non-integer — that is correct behavior. For paired designs, use the paired t-test (one-sample t on differences). Under severe non-normality + small n, Wilcoxon signed-rank / Mann-Whitney or a permutation test is preferable.

E2 · MCMC Convergence ≠ Truth (Vehtari et al. 2021, Bayesian Analysis)

MCMC 診斷 (Gelman-Rubin R̂、effective sample size ESS、trace plot) 的角色是判斷「chain 是否從 stationary distribution 抽樣」— 即 *computational convergence*。但 convergence 不代表後驗分布反映真實 — 模型若 misspecified、prior 若主導、likelihood 若計算錯誤，MCMC 仍可能 converge 到一個錯誤的後驗。Vehtari et al. 2021 (Bayesian Analysis 16(2):667-718) 提出 rank-normalized split-R̂ + folded R̂ 為改良診斷：(1) split chain 為兩半比較 within/between variance；(2) rank-transform 處理重尾或非 Gaussian posterior；(3) folded 版本檢測 scale 而非 location 差異。建議 R̂ < 1.01 (舊閾值 1.1 過於寬鬆) 且 ESS > 400 per parameter (對 quantile 估計，bulk-ESS 與 tail-ESS 應分開檢視)。配套：用 prior predictive check (Gelman 2017) 確認 prior 合理；posterior predictive check (PPC; Gabry 2019, JRSS A) 確認 model 可生成類似觀察資料；leave-one-out cross-validation (LOO; Vehtari 2017, Stats Comput) 比較 model 預測力。實作：ArviZ `az.summary()`、`az.plot_trace()`、`az.loo()`；bayesplot `mcmc_trace()`、`ppc_dens_overlay()`。

MCMC diagnostics (Gelman-Rubin R̂, effective sample size ESS, trace plot) tell you whether the chain is sampling from the stationary distribution — i.e., *computational convergence*. Convergence does not imply the posterior reflects reality — if the model is misspecified, the prior dominates, or the likelihood is coded wrong, MCMC can still converge to an incorrect posterior. Vehtari et al. 2021 (Bayesian Analysis 16(2):667-718) propose rank-normalized split-R̂ + folded R̂ as improved diagnostics: (1) split each chain in half and compare within/between variance; (2) rank-transform to handle heavy-tailed or non-Gaussian posteriors; (3) the folded version detects scale rather than location differences. Recommend R̂ < 1.01 (the old 1.1 threshold is too lax) and ESS > 400 per parameter (for quantile estimation, examine bulk-ESS and tail-ESS separately). Complement with: prior predictive checks (Gelman 2017) to validate the prior, posterior predictive checks (PPC; Gabry 2019, JRSS A) to confirm the model can generate data similar to observed, and leave-one-out cross-validation (LOO; Vehtari 2017, Stats Comput) to compare predictive power. Implementation: ArviZ `az.summary()`, `az.plot_trace()`, `az.loo()`; bayesplot `mcmc_trace()`, `ppc_dens_overlay()`.

E3 · BIC vs AIC — Asymptotic Limits

AIC = −2·log L + 2k；BIC = −2·log L + k·log(n)。雖然兩者形式相似，背後 asymptotic justification 完全不同：AIC (Akaike 1973) 為估計 *out-of-sample predictive accuracy* (Kullback-Leibler divergence)，假設 "true model" *不在* candidate set 中 — AIC 選最接近真實的 approximation。BIC (Schwarz 1978) 為估計 *posterior model probability*，假設 "true model" *在* candidate set 中 — BIC 在 n → ∞ 時 consistent (選到真實 model 機率 → 1)，但代價是在中等 n 時 *under-fit* (penalty 過重)。實用上：(1) 預測為目的 → AIC 或 LOO-CV (Vehtari 2017)；(2) 變數選擇 + 真實 model 假設 → BIC；(3) hierarchical/random effects 模型 → DIC 已被淘汰，改用 WAIC 或 PSIS-LOO；(4) 雙者不可直接比較數值 — AIC 與 BIC 不在同一尺度。Burnham & Anderson 2002 為頻率主義模型平均的標準參考。對 Bayesian 模型，建議 LOO-CV (Vehtari 2017, Stats Comput 27:1413) + ELPD 為主要工具，因其考慮整個 posterior 而非單一 point estimate。Stan/PyMC + ArviZ 的 `az.compare()` 提供標準輸出。

AIC = −2·log L + 2k; BIC = −2·log L + k·log(n). The forms look similar, but the asymptotic justifications differ completely: AIC (Akaike 1973) estimates *out-of-sample predictive accuracy* (Kullback-Leibler divergence) under the assumption that the "true model" is *not* in the candidate set — AIC selects the best approximation. BIC (Schwarz 1978) estimates *posterior model probability* under the assumption that the "true model" *is* in the candidate set — BIC is consistent as n → ∞ (probability of selecting the true model → 1), but the cost is *under-fitting* at moderate n (heavy penalty). Practical guidance: (1) prediction → AIC or LOO-CV (Vehtari 2017); (2) variable selection assuming the true model is in the set → BIC; (3) hierarchical / random-effects models → DIC is deprecated, use WAIC or PSIS-LOO; (4) AIC and BIC values cannot be directly compared — they are not on the same scale. Burnham & Anderson 2002 is the standard reference for frequentist multi-model inference. For Bayesian models, prefer LOO-CV (Vehtari 2017, Stats Comput 27:1413) + ELPD as the primary tool because it uses the full posterior rather than a point estimate. Stan/PyMC + ArviZ `az.compare()` provides standardized output.

E4 · Stan vs PyMC — NUTS / brms / JAX Backends

兩大主流 probabilistic programming 平台各有定位：(1) Stan (Carpenter et al. 2017, JSS 76(1):1-32) 以 C++ 實作 NUTS (No-U-Turn Sampler, Hoffman & Gelman 2014) 的 HMC，提供 R 介面 (rstan、rstanarm、brms — Bürkner 2017 JSS 80(1):1-28) 與 Python (PyStan、cmdstanpy)。優勢：成熟 (2012-)、社群活躍、warning system 完整 (divergent transitions、E-BFMI)。劣勢：模型需編譯 (cold start ~30s)。(2) PyMC (v5+; Salvatier 2016, PeerJ CS) 以 Python + PyTensor，後端可切換 (default Numba, JAX via `pymc-jax`)。優勢：(a) JAX backend 提供 GPU/TPU 加速；(b) variational inference (ADVI、Pathfinder) 與 SMC sampler 內建；(c) 與 deep learning ecosystem (Flax、Haiku) 整合。NumPyro (Phan 2019) 為 Uber 開發的 JAX-native PPL，速度更快但 API 較底層。BlackJAX 為 JAX-based 採樣器函式庫。選擇建議：(i) hierarchical model with random slopes → brms (Stan)，syntax 最接近 lme4；(ii) deep generative model / amortized inference → NumPyro/Pyro；(iii) 大型 dataset + GPU → PyMC + JAX 或 NumPyro。所有平台都支援 ArviZ 為共用 diagnostics 介面 (Kumar 2019)。

The two mainstream probabilistic programming platforms have different niches: (1) Stan (Carpenter et al. 2017, JSS 76(1):1-32) implements NUTS (No-U-Turn Sampler, Hoffman & Gelman 2014) HMC in C++, with R interfaces (rstan, rstanarm, brms — Bürkner 2017 JSS 80(1):1-28) and Python (PyStan, cmdstanpy). Pros: mature (since 2012), active community, comprehensive warning system (divergent transitions, E-BFMI). Cons: models must be compiled (cold start ~30s). (2) PyMC (v5+; Salvatier 2016, PeerJ CS) uses Python + PyTensor with switchable backends (Numba by default, JAX via `pymc-jax`). Pros: (a) JAX backend gives GPU/TPU acceleration; (b) variational inference (ADVI, Pathfinder) and SMC sampler built in; (c) integrates with deep-learning ecosystems (Flax, Haiku). NumPyro (Phan 2019, Uber) is a JAX-native PPL — faster but lower-level API. BlackJAX is a JAX-based sampler library. Choosing: (i) hierarchical model with random slopes → brms (Stan), syntax closest to lme4; (ii) deep generative model / amortized inference → NumPyro/Pyro; (iii) large dataset + GPU → PyMC + JAX or NumPyro. All platforms support ArviZ as a common diagnostics interface (Kumar 2019).

E5 · p-value vs Posterior — Jeffreys-Lindley Paradox

在大樣本下，frequentist test 與 Bayesian model comparison 可給出截然相反的結論 — 即 Jeffreys-Lindley paradox (Lindley 1957, Biometrika)。經典例子：H₀: μ = 0 vs H₁: μ ≠ 0，n = 10⁶，sample mean = 0.001 (z = 2.0, p = 0.046 → reject H₀)；但若 prior on μ under H₁ 為 N(0, 1)，Bayes factor BF₁₀ ≈ 0.05 → strong evidence *for* H₀。原因：p-value 隨 n 增加而對任何 ε ≠ 0 的偏離都敏感；Bayes factor 則 penalize "預測力分散到大區間" 的 alternative，即 Occam's razor。實作建議：(1) 大樣本 + 點 null 場景，p-value 容易誤導 — 改用 region of practical equivalence (ROPE; Kruschke 2018) 或 minimum effect size of interest (SESOI)。(2) 報告 effect size + 95% CI/CrI 比單一 p-value/BF 更 informative。(3) Bayes factor 對 prior on alternative *敏感* (e.g., Cauchy(0, r=√2/2) 為 BayesFactor 套件 default; Rouder 2009)，需 sensitivity analysis。(4) ASA 2016/2019 與 Greenland 2016 (Eur J Epidemiol) 為 p-value 誤解的標準參考。Berger & Sellke 1987 (JASA) 量化 p ≈ 0.05 對應的 minimum BF₁₀ ≈ 2.5，遠小於 Jeffreys "substantial" 的 3。

At large sample sizes, a frequentist test and a Bayesian model comparison can give opposite conclusions — the Jeffreys-Lindley paradox (Lindley 1957, Biometrika). Classic example: H₀: μ = 0 vs H₁: μ ≠ 0, n = 10⁶, sample mean = 0.001 (z = 2.0, p = 0.046 → reject H₀); but with prior μ ~ N(0, 1) under H₁, the Bayes factor BF₁₀ ≈ 0.05 → strong evidence *for* H₀. Reason: the p-value becomes sensitive to any ε ≠ 0 deviation as n grows, while the Bayes factor penalizes alternatives that "spread predictive mass over a wide region" — Occam's razor. Recommendations: (1) under large-n + point-null settings, p-values are easy to misread — use a region of practical equivalence (ROPE; Kruschke 2018) or a minimum effect size of interest (SESOI). (2) Reporting effect size + 95% CI/CrI is more informative than a single p-value or BF. (3) Bayes factors are *sensitive* to the prior on the alternative (e.g., Cauchy(0, r=√2/2), the BayesFactor-package default, Rouder 2009) — perform sensitivity analysis. (4) ASA 2016/2019 and Greenland 2016 (Eur J Epidemiol) are the canonical references on p-value misinterpretations. Berger & Sellke 1987 (JASA) quantified that p ≈ 0.05 corresponds to a minimum BF₁₀ ≈ 2.5, far below Jeffreys' "substantial" threshold of 3.

E6 · Prior Choice Sensitivity (Gabry et al. 2019, JRSS A)

「Bayesian 結果依賴 prior」是頻率主義常見批評。實情更精確：(1) 在 well-identified、large-sample 場景，posterior 由 likelihood 主導，prior 影響輕微；(2) 在 small n、稀疏 data、複雜階層模型中，prior 可能顯著影響 posterior — 此時必須做 prior sensitivity analysis。Gabry et al. 2019 (JRSS A 182(2):389-402) 與 Schad et al. 2021 (Psychol Methods) 提出 Bayesian workflow：(i) prior predictive check (在看到資料前模擬 prior implied data，確認合理範圍)；(ii) fit model；(iii) MCMC diagnostics (E2)；(iv) posterior predictive check (PPC; 模擬與觀察比較)；(v) leave-one-out (LOO; Vehtari 2017) 評估預測力；(vi) prior sensitivity (重新跑 weakly informative vs informative prior 比較)。預設 prior：(a) 對 fixed effects 用 weakly informative (Normal(0, 2.5·SD(x)) for slope; Gelman 2008 Ann Appl Stat)；(b) 對 variance components 避免 inv-gamma(ε, ε)，改用 half-normal/half-cauchy/half-t (Gelman 2006 Bayesian Anal)；(c) brms 預設 prior 為 student-t(3, ...)，比 N(0, 100) 的 "vague" 更穩定。Stan Wiki "Prior Choice Recommendations" 為實用 cheat sheet。

"Bayesian results depend on the prior" is a frequent frequentist critique. More precisely: (1) in well-identified large-sample settings, the posterior is dominated by the likelihood and the prior matters little; (2) at small n, sparse data, or complex hierarchical models, the prior can substantially affect the posterior — prior sensitivity analysis is then required. Gabry et al. 2019 (JRSS A 182(2):389-402) and Schad et al. 2021 (Psychol Methods) propose a Bayesian workflow: (i) prior predictive check (simulate prior-implied data before seeing the actual data, verify the range is plausible); (ii) fit the model; (iii) MCMC diagnostics (E2); (iv) posterior predictive check (PPC; compare simulated vs observed); (v) leave-one-out (LOO; Vehtari 2017) to assess predictive performance; (vi) prior sensitivity (re-run with weakly-informative vs informative priors and compare). Default priors: (a) for fixed effects, weakly informative (Normal(0, 2.5·SD(x)) for slopes; Gelman 2008 Ann Appl Stat); (b) for variance components, avoid inv-gamma(ε, ε) — prefer half-normal / half-Cauchy / half-t (Gelman 2006 Bayesian Anal); (c) brms defaults to student-t(3, ...), which is more stable than the "vague" N(0, 100). The Stan Wiki "Prior Choice Recommendations" is a practical cheat sheet.

E7 · FDR vs FWER — Conceptual Distinction

兩者控制的錯誤類型不同：(1) FWER (family-wise error rate) = P(至少一個 false positive) ≤ α — 適用於「任何一個錯誤都嚴重」的情境 (clinical trial 多重 endpoints、藥物批准)。方法：Bonferroni、Holm 1979、Hochberg 1988、單步/逐步 Šidák。(2) FDR (false discovery rate) = E[V/R | R>0]·P(R>0) ≤ q — 「在被宣告為 significant 的結果中，預期 false positive 比例」。適用於 large-scale screening (transcriptomics、GWAS、proteomics)。方法：Benjamini-Hochberg 1995、Storey 2002 q-value、Ignatiadis 2016 IHW (data-driven hypothesis weighting)。FDR 比 FWER *寬鬆* — m = 10000 時，BH 在 q=0.05 下可保留 ~500 真實 discovery 中的 25 false positive；Bonferroni 在 α=0.05/10000 下幾乎無 power。注意：(a) FDR 不對單一 hypothesis 給 error guarantee — 只對 *整批* 結果；(b) BH 假設 p-value 在 H₀ 下 uniform 且 PRDS (positive regression dependent on subset; Benjamini-Yekutieli 2001) — 對 negatively correlated tests 應用 BY；(c) 對 discrete tests (Fisher exact)，BH 過於保守，可改用 group-FDR 或 adjusted-BH。Local FDR (Efron 2010 Large-Scale Inference) 為單一 test 的後驗 FDR，與 Bayesian framework 對接。在 RNA-seq 中，IHW 與 apeglm (Zhu 2018) 同時使用可顯著提高 power。

FDR and FWER control different errors: (1) FWER (family-wise error rate) = P(at least one false positive) ≤ α — appropriate when "any single error is serious" (multi-endpoint clinical trials, drug approval). Methods: Bonferroni, Holm 1979, Hochberg 1988, single-step / step-down Šidák. (2) FDR (false discovery rate) = E[V/R | R>0]·P(R>0) ≤ q — "expected proportion of false positives among declared significants". Appropriate for large-scale screening (transcriptomics, GWAS, proteomics). Methods: Benjamini-Hochberg 1995, Storey 2002 q-value, Ignatiadis 2016 IHW (data-driven hypothesis weighting). FDR is *more permissive* than FWER — at m = 10000 and q = 0.05, BH can retain ~25 false positives among ~500 true discoveries; Bonferroni at α = 0.05/10000 has essentially no power. Notes: (a) FDR makes no guarantee about a single hypothesis — only about the *batch* of results; (b) BH assumes p-values are uniform under H₀ and PRDS (positive regression dependence on subset; Benjamini-Yekutieli 2001) — use BY for negatively correlated tests; (c) for discrete tests (Fisher exact), BH is over-conservative — use group-FDR or adjusted-BH. Local FDR (Efron 2010 Large-Scale Inference) is the posterior FDR for a single test and connects to the Bayesian framework. In RNA-seq, using IHW together with apeglm (Zhu 2018) significantly increases power.

E8 · James-Stein / Shrinkage — limma / ashr / DESeq2 Connection

James-Stein 1961 (Berkeley Symp) 是 20 世紀統計學最反直覺的結果之一：當同時估計 *三個或更多 ≥ 3* 獨立常態 means 時，shrinkage estimator (向總平均收縮一個比例) 的 *total MSE* 嚴格優於 maximum likelihood (independent sample means)，無論真實 means 為何。其精神為「借力」(borrow strength) — 即使 means 之間 *沒有* 直接關係，pooled information 也能改善整體估計。現代基因體統計大量應用此精神：(1) limma (Smyth 2004; Ritchie 2015) 用 empirical Bayes 收縮每個 gene 的 variance 估計，向 global variance prior 收縮 — 解決 microarray/RNA-seq 中 per-gene n 過小的問題；moderated t 比 ordinary t 顯著更穩健。(2) DESeq2 (Love 2014, Genome Biol) 收縮 dispersion 估計向 mean-dispersion trend；apeglm (Zhu 2018, Bioinformatics) 用 heavy-tailed (Cauchy-like) prior 收縮 log2 fold change，避免 low-count gene 的 noise 主導 ranking。(3) edgeR (Robinson 2010) 透過 weighted likelihood 結合 common dispersion 與 tagwise dispersion。(4) ashr (Stephens 2017, Biostatistics 18(2):275) 將 FDR 框架本身改寫為 adaptive shrinkage — local FDR + effect size 一併估計，使用 unimodal prior 假設提供 LFSR (local false sign rate) 取代 q-value。(5) Bayesian hierarchical model 為更通用的 shrinkage 框架 — group-level random effects 即自動實現 borrow-strength。Efron 2010 (Large-Scale Inference) 為此議題的完整 textbook。實作要點：在小樣本、多 hypothesis、高 variance 場景，always prefer shrinkage estimators over per-test MLE。

James-Stein 1961 (Berkeley Symp) is one of the most counter-intuitive results in 20th-century statistics: when estimating *three or more (≥ 3)* independent normal means simultaneously, a shrinkage estimator (shrinking each sample mean toward the overall mean by a fraction) strictly dominates the MLE (independent sample means) in *total MSE*, regardless of the true means. The intuition is "borrowing strength" — even when the means have *no* direct relationship, pooled information still improves overall estimation. Modern genomic statistics applies this idea extensively: (1) limma (Smyth 2004; Ritchie 2015) uses empirical Bayes to shrink each gene's variance estimate toward a global variance prior, solving the per-gene-small-n problem in microarray/RNA-seq; moderated t is significantly more robust than ordinary t. (2) DESeq2 (Love 2014, Genome Biol) shrinks dispersion estimates toward a mean-dispersion trend; apeglm (Zhu 2018, Bioinformatics) uses a heavy-tailed (Cauchy-like) prior to shrink log2 fold changes, preventing low-count genes from dominating rankings by noise. (3) edgeR (Robinson 2010) combines common and tagwise dispersion through a weighted likelihood. (4) ashr (Stephens 2017, Biostatistics 18(2):275) reformulates the FDR framework itself as adaptive shrinkage — jointly estimating local FDR and effect size under a unimodal-prior assumption and giving LFSR (local false sign rate) in place of the q-value. (5) Bayesian hierarchical models are a more general shrinkage framework — group-level random effects implement borrow-strength automatically. Efron 2010 (Large-Scale Inference) is the complete textbook on this topic. Practice: in small-n, many-hypothesis, high-variance settings, always prefer shrinkage estimators over per-test MLE.

📖

查找指南： 所有 DOI 連結指向出版社頁面。若需下載 PDF，可搭配 PubMed、 PMC、 arXiv、 bioRxiv 找開放近用版本。