如何使用這份資料?
本頁針對教學中提到的每個概念、檢定、套件與演算法整理學術出處。引用標籤含義:
For every concept, test, package, and algorithm mentioned in the tutorial, this page lists the canonical source. Tag legend:
Book
教科書 / 線上免費書
Textbooks / free online books
Paper
原始論文(含 DOI / PubMed)
Original papers (with DOI / PubMed)
Doc
官方文件、vignette、tutorial
Official docs, vignettes, tutorials
Best Practice
綜述與 community 推薦
Reviews and community recommendations
Benchmark
獨立 benchmarking 評比
Independent benchmarking studies
本頁目錄
⭐ 貫穿全教程的權威教科書
這幾本書幾乎覆蓋了本教程所有章節的理論基礎,是進階學習的最佳起點。
These books cover the theoretical foundation of nearly every topic in this tutorial — the best starting point for going deeper.
- 📚 BOOK All of Statistics: A Concise Course in Statistical Inference. Springer (2004).
- 📚 BOOK Statistical Inference, 2nd ed. Duxbury (2002).
- 📚 BOOK Computer Age Statistical Inference: Algorithms, Evidence and Data Science. Cambridge UP (2016).
- 📚 BOOK Bayesian Data Analysis (BDA3), 3rd ed. CRC (2013).
- 📚 BOOK Statistical Rethinking: A Bayesian Course with Examples in R and Stan, 2nd ed. CRC (2020).
- 📚 BOOK Probabilistic Machine Learning: An Introduction. MIT Press (2022).
- 📚 BOOK Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge UP (2010).
- 📚 BOOK Regression Modeling Strategies, 2nd ed. Springer (2015).
🎲 機率與分布
- PAPER Differential expression analysis for sequence count data. Genome Biology 11:R106 (2010).
- 📚Wasserman, Ch. 1–3 · Casella & Berger Ch. 2–4 · Efron & Hastie Ch. 1
- DOCR Documentation —
Distributions
🎯 點估計與抽樣分布
- 📚Casella & Berger Ch. 7 · Wasserman Ch. 9
- PAPER Estimation with quadratic loss. Proc 4th Berkeley Symp 1:361–379 (1961).
- PAPER Bootstrap methods: another look at the jackknife. Annals of Statistics 7(1):1–26 (1979).
- DOC
MASS::fitdistr
📏 信賴區間與 Bootstrap
- 📚 An Introduction to the Bootstrap. CRC (1993).
- 📚 Bootstrap Methods and Their Application. Cambridge UP (1997).
- PAPER Bootstrap confidence intervals. Statistical Science 11(3):189–228 (1996).
- DOCR
bootpackage
⚖️ 假設檢定與錯誤
- ⭐ ASA The ASA's statement on p-values: context, process, and purpose. The American Statistician 70(2):129–133 (2016).
- ⭐ Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology 31:337–350 (2016).
- 📚 Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Routledge (1988).
- DOCR
pwrpackage · statsmodelsstats.power
🧪 常見檢定方法
- PAPER The generalization of "Student's" problem when several different population variances are involved. Biometrika 34:28–35 (1947).
- 📚 Nonparametric Statistical Methods, 3rd ed. Wiley (2014).
- 📚 Categorical Data Analysis, 3rd ed. Wiley (2013).
- DOCR
stats::t.test, wilcox.test, fisher.test, kruskal.test· scipyttest_ind, mannwhitneyu, chi2_contingency
📈 線性迴歸與診斷
- 📚 Linear Models with R, 2nd ed. CRC (2014).
- 📚 An R Companion to Applied Regression, 3rd ed. SAGE (2019).
- 📚 Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley (1980).
- DOCR
stats::lm, plot.lm;car::vif, influencePlot;sandwich::vcovHC;lmtest::bptest
🔀 ANOVA / 混合效應
- PAPER Fitting linear mixed-effects models using
lme4. Journal of Statistical Software 67(1):1–48 (2015). - PAPER
lmerTestpackage: tests in linear mixed effects models. JSS 82(13):1–26 (2017). - 📚 Mixed-Effects Models in S and S-PLUS. Springer (2000).
- 📚 Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge UP (2007).
- DOC
emmeansvignettes (Lenth)
🎚️ Logistic 迴歸與 OR
- 📚 Applied Logistic Regression, 3rd ed. Wiley (2013).
- PAPER Bias reduction of maximum likelihood estimates. Biometrika 80(1):27–38 (1993).
- PAPER Logistic regression in rare events data. Political Analysis 9:137–163 (2001).
- DOCR
glm(family=binomial);logistf;brglm2;pROC
🔢 GLM 計數模型
- 📚 Generalized Linear Models, 2nd ed. CRC (1989).
- 📚 Negative Binomial Regression, 2nd ed. Cambridge UP (2011).
- PAPER
edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140 (2010). - PAPER Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15:550 (2014).
- DOCR
MASS::glm.nb,pscl::zeroinfl,AER::dispersiontest
🧠 貝氏定理 / 先驗 / 後驗
- 📚Gelman et al., BDA3 Ch. 1–3 · McElreath, Statistical Rethinking 2e Ch. 1–4
- 📚 Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer (1985).
- DOC
rstanarm
⛓️ MCMC / HMC / 診斷
- PAPER A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint 1701.02434 (2017).
- PAPER Stan: A probabilistic programming language. JSS 76(1):1–32 (2017).
- PAPER Rank-normalization, folding, and localization: An improved R̂ for assessing convergence of MCMC. Bayesian Analysis 16(2):667–718 (2021).
- PAPER
brms: An R package for Bayesian multilevel models using Stan. JSS 80(1):1–28 (2017). - DOCArviZ;
bayesplot
📐 後驗摘要 / BF / PPC
- PAPER Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27:1413–1432 (2017).
- PAPER Visualization in Bayesian workflow. JRSS A 182(2):389–402 (2019).
- 📚 Doing Bayesian Data Analysis, 2nd ed. Academic Press (2015).
- DOCR
loo,bayestestR,tidybayes
🪢 階層 / 經驗貝氏
- PAPER Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3:Article 3 (2004).
- PAPER
limmapowers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7):e47 (2015). - PAPER False discovery rates: a new deal. Biostatistics 18(2):275–294 (2017). (
ashr) - 📚Efron, Large-Scale Inference (2010) · BDA3 Ch. 5
🚦 多重檢定與 FDR
- PAPER Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS B 57(1):289–300 (1995).
- PAPER A direct approach to false discovery rates. JRSS B 64(3):479–498 (2002).
- PAPER Statistical significance for genomewide studies. PNAS 100(16):9440–9445 (2003).
- PAPER Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nature Methods 13:577–580 (2016).
- DOCR
p.adjust,qvalue,IHW(Bioconductor)
🧬 DE 統計核心
- PAPER
voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology 15:R29 (2014). - PAPER Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. NAR 40(10):4288–4297 (2012). (
edgeRQL) - PAPER Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics 35(12):2084–2092 (2019). (
apeglm) - PAPER Independent filtering increases detection power for high-throughput experiments. PNAS 107(21):9546–9551 (2010).
- BENCH Bias, robustness and scalability in single-cell differential expression analysis. Nature Methods 15:255–261 (2018).
- DOCDESeq2, edgeR, limma vignettes
🔄 排列 / Bootstrap / GSEA
- PAPER Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102(43):15545–15550 (2005).
- PAPER
Camera: a competitive gene set test accounting for inter-gene correlation. NAR 40(17):e133 (2012). - PAPER Fast gene set enrichment analysis. bioRxiv 060012 (2021). (
fgsea) - PAPER Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23(8):980–987 (2007).
- 📚Efron & Tibshirani, An Introduction to the Bootstrap (1993)
- DOCR
fgsea,clusterProfiler,limma::camera/roast· Pythongseapy
⏳ 存活分析
- PAPER Regression models and life-tables. JRSS B 34(2):187–220 (1972).
- PAPER Nonparametric estimation from incomplete observations. JASA 53(282):457–481 (1958).
- PAPER A proportional hazards model for the subdistribution of a competing risk. JASA 94(446):496–509 (1999).
- 📚 Modeling Survival Data: Extending the Cox Model. Springer (2000).
- 📚 Survival Analysis: Techniques for Censored and Truncated Data, 2nd ed. Springer (2003).
- DOCR
survival,survminer; Pythonlifelines,scikit-survival
📌 教學註記與細節
以下整理本頁引用脈絡中容易被忽略、但會直接影響貝氏/頻率推論判讀的補充說明。內容皆為延伸閱讀,所引文獻已收錄於上方各 Step;目的在於把零散注記集中於一處,方便日後查閱。
Below is a consolidated set of notes that often slip past readers but materially affect how Bayesian/frequentist inference is read. These are extended discussion items — the underlying references already appear in the Step sections above, and gathering them here is meant to make later look-up easier.
E1 · Welch's t as the New Default (Delacre, Lakens, Leys 2017, IRSP)
傳統教學常先教 Student's t-test (assumes equal variances),再以 Levene's test 決定是否改用 Welch's t-test。Delacre, Lakens & Leys 2017 (Int Rev Soc Psychol; 10.5334/irsp.82) 透過 Monte Carlo simulation 證明這個流程實際上 inferior — 因為 (1) Levene's test 本身有 type II error,會在實際存在 variance heterogeneity 時失敗;(2) Welch's t-test 在 variances 確實相等時,僅損失少量 power (df 略小);(3) Welch's t-test 對 unequal n + unequal variance 的組合穩健。結論:應將 Welch's t-test (R: `t.test()` 預設、Python: `scipy.stats.ttest_ind(equal_var=False)`) 作為兩組比較的 *default*,不需先做 Levene/F-test。Ruxton 2006 (Behav Ecol) 為更早的同方向論證。注意:Welch-Satterthwaite df 為非整數,這是正確行為。對 paired design,仍應使用 paired t-test (差值的 one-sample t)。在嚴重 non-normality + small n 下,Wilcoxon signed-rank / Mann-Whitney 或 permutation test 為較佳選擇。
Traditional teaching first introduces Student's t-test (equal variances) and uses Levene's test to decide whether to switch to Welch's t-test. Delacre, Lakens & Leys 2017 (Int Rev Soc Psychol; 10.5334/irsp.82) showed via Monte Carlo simulation that this workflow is actually inferior because (1) Levene's test itself has type II error and can fail when variance heterogeneity is real; (2) Welch's t-test loses only a small amount of power (slightly smaller df) when variances are equal; (3) Welch's t-test is robust to combinations of unequal n + unequal variance. Conclusion: Welch's t-test (R: `t.test()` default; Python: `scipy.stats.ttest_ind(equal_var=False)`) should be the *default* for two-group comparisons — no preliminary Levene/F-test required. Ruxton 2006 (Behav Ecol) made an earlier version of the same argument. Note: the Welch-Satterthwaite df is non-integer — that is correct behavior. For paired designs, use the paired t-test (one-sample t on differences). Under severe non-normality + small n, Wilcoxon signed-rank / Mann-Whitney or a permutation test is preferable.
E2 · MCMC Convergence ≠ Truth (Vehtari et al. 2021, Bayesian Analysis)
MCMC 診斷 (Gelman-Rubin R̂、effective sample size ESS、trace plot) 的角色是判斷「chain 是否從 stationary distribution 抽樣」— 即 *computational convergence*。但 convergence 不代表後驗分布反映真實 — 模型若 misspecified、prior 若主導、likelihood 若計算錯誤,MCMC 仍可能 converge 到一個錯誤的後驗。Vehtari et al. 2021 (Bayesian Analysis 16(2):667-718) 提出 rank-normalized split-R̂ + folded R̂ 為改良診斷:(1) split chain 為兩半比較 within/between variance;(2) rank-transform 處理重尾或非 Gaussian posterior;(3) folded 版本檢測 scale 而非 location 差異。建議 R̂ < 1.01 (舊閾值 1.1 過於寬鬆) 且 ESS > 400 per parameter (對 quantile 估計,bulk-ESS 與 tail-ESS 應分開檢視)。配套:用 prior predictive check (Gelman 2017) 確認 prior 合理;posterior predictive check (PPC; Gabry 2019, JRSS A) 確認 model 可生成類似觀察資料;leave-one-out cross-validation (LOO; Vehtari 2017, Stats Comput) 比較 model 預測力。實作:ArviZ `az.summary()`、`az.plot_trace()`、`az.loo()`;bayesplot `mcmc_trace()`、`ppc_dens_overlay()`。
MCMC diagnostics (Gelman-Rubin R̂, effective sample size ESS, trace plot) tell you whether the chain is sampling from the stationary distribution — i.e., *computational convergence*. Convergence does not imply the posterior reflects reality — if the model is misspecified, the prior dominates, or the likelihood is coded wrong, MCMC can still converge to an incorrect posterior. Vehtari et al. 2021 (Bayesian Analysis 16(2):667-718) propose rank-normalized split-R̂ + folded R̂ as improved diagnostics: (1) split each chain in half and compare within/between variance; (2) rank-transform to handle heavy-tailed or non-Gaussian posteriors; (3) the folded version detects scale rather than location differences. Recommend R̂ < 1.01 (the old 1.1 threshold is too lax) and ESS > 400 per parameter (for quantile estimation, examine bulk-ESS and tail-ESS separately). Complement with: prior predictive checks (Gelman 2017) to validate the prior, posterior predictive checks (PPC; Gabry 2019, JRSS A) to confirm the model can generate data similar to observed, and leave-one-out cross-validation (LOO; Vehtari 2017, Stats Comput) to compare predictive power. Implementation: ArviZ `az.summary()`, `az.plot_trace()`, `az.loo()`; bayesplot `mcmc_trace()`, `ppc_dens_overlay()`.
E3 · BIC vs AIC — Asymptotic Limits
AIC = −2·log L + 2k;BIC = −2·log L + k·log(n)。雖然兩者形式相似,背後 asymptotic justification 完全不同:AIC (Akaike 1973) 為估計 *out-of-sample predictive accuracy* (Kullback-Leibler divergence),假設 "true model" *不在* candidate set 中 — AIC 選最接近真實的 approximation。BIC (Schwarz 1978) 為估計 *posterior model probability*,假設 "true model" *在* candidate set 中 — BIC 在 n → ∞ 時 consistent (選到真實 model 機率 → 1),但代價是在中等 n 時 *under-fit* (penalty 過重)。實用上:(1) 預測為目的 → AIC 或 LOO-CV (Vehtari 2017);(2) 變數選擇 + 真實 model 假設 → BIC;(3) hierarchical/random effects 模型 → DIC 已被淘汰,改用 WAIC 或 PSIS-LOO;(4) 雙者不可直接比較數值 — AIC 與 BIC 不在同一尺度。Burnham & Anderson 2002 為頻率主義模型平均的標準參考。對 Bayesian 模型,建議 LOO-CV (Vehtari 2017, Stats Comput 27:1413) + ELPD 為主要工具,因其考慮整個 posterior 而非單一 point estimate。Stan/PyMC + ArviZ 的 `az.compare()` 提供標準輸出。
AIC = −2·log L + 2k; BIC = −2·log L + k·log(n). The forms look similar, but the asymptotic justifications differ completely: AIC (Akaike 1973) estimates *out-of-sample predictive accuracy* (Kullback-Leibler divergence) under the assumption that the "true model" is *not* in the candidate set — AIC selects the best approximation. BIC (Schwarz 1978) estimates *posterior model probability* under the assumption that the "true model" *is* in the candidate set — BIC is consistent as n → ∞ (probability of selecting the true model → 1), but the cost is *under-fitting* at moderate n (heavy penalty). Practical guidance: (1) prediction → AIC or LOO-CV (Vehtari 2017); (2) variable selection assuming the true model is in the set → BIC; (3) hierarchical / random-effects models → DIC is deprecated, use WAIC or PSIS-LOO; (4) AIC and BIC values cannot be directly compared — they are not on the same scale. Burnham & Anderson 2002 is the standard reference for frequentist multi-model inference. For Bayesian models, prefer LOO-CV (Vehtari 2017, Stats Comput 27:1413) + ELPD as the primary tool because it uses the full posterior rather than a point estimate. Stan/PyMC + ArviZ `az.compare()` provides standardized output.
E4 · Stan vs PyMC — NUTS / brms / JAX Backends
兩大主流 probabilistic programming 平台各有定位:(1) Stan (Carpenter et al. 2017, JSS 76(1):1-32) 以 C++ 實作 NUTS (No-U-Turn Sampler, Hoffman & Gelman 2014) 的 HMC,提供 R 介面 (rstan、rstanarm、brms — Bürkner 2017 JSS 80(1):1-28) 與 Python (PyStan、cmdstanpy)。優勢:成熟 (2012-)、社群活躍、warning system 完整 (divergent transitions、E-BFMI)。劣勢:模型需編譯 (cold start ~30s)。(2) PyMC (v5+; Salvatier 2016, PeerJ CS) 以 Python + PyTensor,後端可切換 (default Numba, JAX via `pymc-jax`)。優勢:(a) JAX backend 提供 GPU/TPU 加速;(b) variational inference (ADVI、Pathfinder) 與 SMC sampler 內建;(c) 與 deep learning ecosystem (Flax、Haiku) 整合。NumPyro (Phan 2019) 為 Uber 開發的 JAX-native PPL,速度更快但 API 較底層。BlackJAX 為 JAX-based 採樣器函式庫。選擇建議:(i) hierarchical model with random slopes → brms (Stan),syntax 最接近 lme4;(ii) deep generative model / amortized inference → NumPyro/Pyro;(iii) 大型 dataset + GPU → PyMC + JAX 或 NumPyro。所有平台都支援 ArviZ 為共用 diagnostics 介面 (Kumar 2019)。
The two mainstream probabilistic programming platforms have different niches: (1) Stan (Carpenter et al. 2017, JSS 76(1):1-32) implements NUTS (No-U-Turn Sampler, Hoffman & Gelman 2014) HMC in C++, with R interfaces (rstan, rstanarm, brms — Bürkner 2017 JSS 80(1):1-28) and Python (PyStan, cmdstanpy). Pros: mature (since 2012), active community, comprehensive warning system (divergent transitions, E-BFMI). Cons: models must be compiled (cold start ~30s). (2) PyMC (v5+; Salvatier 2016, PeerJ CS) uses Python + PyTensor with switchable backends (Numba by default, JAX via `pymc-jax`). Pros: (a) JAX backend gives GPU/TPU acceleration; (b) variational inference (ADVI, Pathfinder) and SMC sampler built in; (c) integrates with deep-learning ecosystems (Flax, Haiku). NumPyro (Phan 2019, Uber) is a JAX-native PPL — faster but lower-level API. BlackJAX is a JAX-based sampler library. Choosing: (i) hierarchical model with random slopes → brms (Stan), syntax closest to lme4; (ii) deep generative model / amortized inference → NumPyro/Pyro; (iii) large dataset + GPU → PyMC + JAX or NumPyro. All platforms support ArviZ as a common diagnostics interface (Kumar 2019).
E5 · p-value vs Posterior — Jeffreys-Lindley Paradox
在大樣本下,frequentist test 與 Bayesian model comparison 可給出截然相反的結論 — 即 Jeffreys-Lindley paradox (Lindley 1957, Biometrika)。經典例子:H₀: μ = 0 vs H₁: μ ≠ 0,n = 10⁶,sample mean = 0.001 (z = 2.0, p = 0.046 → reject H₀);但若 prior on μ under H₁ 為 N(0, 1),Bayes factor BF₁₀ ≈ 0.05 → strong evidence *for* H₀。原因:p-value 隨 n 增加而對任何 ε ≠ 0 的偏離都敏感;Bayes factor 則 penalize "預測力分散到大區間" 的 alternative,即 Occam's razor。實作建議:(1) 大樣本 + 點 null 場景,p-value 容易誤導 — 改用 region of practical equivalence (ROPE; Kruschke 2018) 或 minimum effect size of interest (SESOI)。(2) 報告 effect size + 95% CI/CrI 比單一 p-value/BF 更 informative。(3) Bayes factor 對 prior on alternative *敏感* (e.g., Cauchy(0, r=√2/2) 為 BayesFactor 套件 default; Rouder 2009),需 sensitivity analysis。(4) ASA 2016/2019 與 Greenland 2016 (Eur J Epidemiol) 為 p-value 誤解的標準參考。Berger & Sellke 1987 (JASA) 量化 p ≈ 0.05 對應的 minimum BF₁₀ ≈ 2.5,遠小於 Jeffreys "substantial" 的 3。
At large sample sizes, a frequentist test and a Bayesian model comparison can give opposite conclusions — the Jeffreys-Lindley paradox (Lindley 1957, Biometrika). Classic example: H₀: μ = 0 vs H₁: μ ≠ 0, n = 10⁶, sample mean = 0.001 (z = 2.0, p = 0.046 → reject H₀); but with prior μ ~ N(0, 1) under H₁, the Bayes factor BF₁₀ ≈ 0.05 → strong evidence *for* H₀. Reason: the p-value becomes sensitive to any ε ≠ 0 deviation as n grows, while the Bayes factor penalizes alternatives that "spread predictive mass over a wide region" — Occam's razor. Recommendations: (1) under large-n + point-null settings, p-values are easy to misread — use a region of practical equivalence (ROPE; Kruschke 2018) or a minimum effect size of interest (SESOI). (2) Reporting effect size + 95% CI/CrI is more informative than a single p-value or BF. (3) Bayes factors are *sensitive* to the prior on the alternative (e.g., Cauchy(0, r=√2/2), the BayesFactor-package default, Rouder 2009) — perform sensitivity analysis. (4) ASA 2016/2019 and Greenland 2016 (Eur J Epidemiol) are the canonical references on p-value misinterpretations. Berger & Sellke 1987 (JASA) quantified that p ≈ 0.05 corresponds to a minimum BF₁₀ ≈ 2.5, far below Jeffreys' "substantial" threshold of 3.
E6 · Prior Choice Sensitivity (Gabry et al. 2019, JRSS A)
「Bayesian 結果依賴 prior」是頻率主義常見批評。實情更精確:(1) 在 well-identified、large-sample 場景,posterior 由 likelihood 主導,prior 影響輕微;(2) 在 small n、稀疏 data、複雜階層模型中,prior 可能顯著影響 posterior — 此時必須做 prior sensitivity analysis。Gabry et al. 2019 (JRSS A 182(2):389-402) 與 Schad et al. 2021 (Psychol Methods) 提出 Bayesian workflow:(i) prior predictive check (在看到資料前模擬 prior implied data,確認合理範圍);(ii) fit model;(iii) MCMC diagnostics (E2);(iv) posterior predictive check (PPC; 模擬與觀察比較);(v) leave-one-out (LOO; Vehtari 2017) 評估預測力;(vi) prior sensitivity (重新跑 weakly informative vs informative prior 比較)。預設 prior:(a) 對 fixed effects 用 weakly informative (Normal(0, 2.5·SD(x)) for slope; Gelman 2008 Ann Appl Stat);(b) 對 variance components 避免 inv-gamma(ε, ε),改用 half-normal/half-cauchy/half-t (Gelman 2006 Bayesian Anal);(c) brms 預設 prior 為 student-t(3, ...),比 N(0, 100) 的 "vague" 更穩定。Stan Wiki "Prior Choice Recommendations" 為實用 cheat sheet。
"Bayesian results depend on the prior" is a frequent frequentist critique. More precisely: (1) in well-identified large-sample settings, the posterior is dominated by the likelihood and the prior matters little; (2) at small n, sparse data, or complex hierarchical models, the prior can substantially affect the posterior — prior sensitivity analysis is then required. Gabry et al. 2019 (JRSS A 182(2):389-402) and Schad et al. 2021 (Psychol Methods) propose a Bayesian workflow: (i) prior predictive check (simulate prior-implied data before seeing the actual data, verify the range is plausible); (ii) fit the model; (iii) MCMC diagnostics (E2); (iv) posterior predictive check (PPC; compare simulated vs observed); (v) leave-one-out (LOO; Vehtari 2017) to assess predictive performance; (vi) prior sensitivity (re-run with weakly-informative vs informative priors and compare). Default priors: (a) for fixed effects, weakly informative (Normal(0, 2.5·SD(x)) for slopes; Gelman 2008 Ann Appl Stat); (b) for variance components, avoid inv-gamma(ε, ε) — prefer half-normal / half-Cauchy / half-t (Gelman 2006 Bayesian Anal); (c) brms defaults to student-t(3, ...), which is more stable than the "vague" N(0, 100). The Stan Wiki "Prior Choice Recommendations" is a practical cheat sheet.
E7 · FDR vs FWER — Conceptual Distinction
兩者控制的錯誤類型不同:(1) FWER (family-wise error rate) = P(至少一個 false positive) ≤ α — 適用於「任何一個錯誤都嚴重」的情境 (clinical trial 多重 endpoints、藥物批准)。方法:Bonferroni、Holm 1979、Hochberg 1988、單步/逐步 Šidák。(2) FDR (false discovery rate) = E[V/R | R>0]·P(R>0) ≤ q — 「在被宣告為 significant 的結果中,預期 false positive 比例」。適用於 large-scale screening (transcriptomics、GWAS、proteomics)。方法:Benjamini-Hochberg 1995、Storey 2002 q-value、Ignatiadis 2016 IHW (data-driven hypothesis weighting)。FDR 比 FWER *寬鬆* — m = 10000 時,BH 在 q=0.05 下可保留 ~500 真實 discovery 中的 25 false positive;Bonferroni 在 α=0.05/10000 下幾乎無 power。注意:(a) FDR 不對單一 hypothesis 給 error guarantee — 只對 *整批* 結果;(b) BH 假設 p-value 在 H₀ 下 uniform 且 PRDS (positive regression dependent on subset; Benjamini-Yekutieli 2001) — 對 negatively correlated tests 應用 BY;(c) 對 discrete tests (Fisher exact),BH 過於保守,可改用 group-FDR 或 adjusted-BH。Local FDR (Efron 2010 Large-Scale Inference) 為單一 test 的後驗 FDR,與 Bayesian framework 對接。在 RNA-seq 中,IHW 與 apeglm (Zhu 2018) 同時使用可顯著提高 power。
FDR and FWER control different errors: (1) FWER (family-wise error rate) = P(at least one false positive) ≤ α — appropriate when "any single error is serious" (multi-endpoint clinical trials, drug approval). Methods: Bonferroni, Holm 1979, Hochberg 1988, single-step / step-down Šidák. (2) FDR (false discovery rate) = E[V/R | R>0]·P(R>0) ≤ q — "expected proportion of false positives among declared significants". Appropriate for large-scale screening (transcriptomics, GWAS, proteomics). Methods: Benjamini-Hochberg 1995, Storey 2002 q-value, Ignatiadis 2016 IHW (data-driven hypothesis weighting). FDR is *more permissive* than FWER — at m = 10000 and q = 0.05, BH can retain ~25 false positives among ~500 true discoveries; Bonferroni at α = 0.05/10000 has essentially no power. Notes: (a) FDR makes no guarantee about a single hypothesis — only about the *batch* of results; (b) BH assumes p-values are uniform under H₀ and PRDS (positive regression dependence on subset; Benjamini-Yekutieli 2001) — use BY for negatively correlated tests; (c) for discrete tests (Fisher exact), BH is over-conservative — use group-FDR or adjusted-BH. Local FDR (Efron 2010 Large-Scale Inference) is the posterior FDR for a single test and connects to the Bayesian framework. In RNA-seq, using IHW together with apeglm (Zhu 2018) significantly increases power.
E8 · James-Stein / Shrinkage — limma / ashr / DESeq2 Connection
James-Stein 1961 (Berkeley Symp) 是 20 世紀統計學最反直覺的結果之一:當同時估計 *三個或更多 ≥ 3* 獨立常態 means 時,shrinkage estimator (向總平均收縮一個比例) 的 *total MSE* 嚴格優於 maximum likelihood (independent sample means),無論真實 means 為何。其精神為「借力」(borrow strength) — 即使 means 之間 *沒有* 直接關係,pooled information 也能改善整體估計。現代基因體統計大量應用此精神:(1) limma (Smyth 2004; Ritchie 2015) 用 empirical Bayes 收縮每個 gene 的 variance 估計,向 global variance prior 收縮 — 解決 microarray/RNA-seq 中 per-gene n 過小的問題;moderated t 比 ordinary t 顯著更穩健。(2) DESeq2 (Love 2014, Genome Biol) 收縮 dispersion 估計向 mean-dispersion trend;apeglm (Zhu 2018, Bioinformatics) 用 heavy-tailed (Cauchy-like) prior 收縮 log2 fold change,避免 low-count gene 的 noise 主導 ranking。(3) edgeR (Robinson 2010) 透過 weighted likelihood 結合 common dispersion 與 tagwise dispersion。(4) ashr (Stephens 2017, Biostatistics 18(2):275) 將 FDR 框架本身改寫為 adaptive shrinkage — local FDR + effect size 一併估計,使用 unimodal prior 假設提供 LFSR (local false sign rate) 取代 q-value。(5) Bayesian hierarchical model 為更通用的 shrinkage 框架 — group-level random effects 即自動實現 borrow-strength。Efron 2010 (Large-Scale Inference) 為此議題的完整 textbook。實作要點:在小樣本、多 hypothesis、高 variance 場景,always prefer shrinkage estimators over per-test MLE。
James-Stein 1961 (Berkeley Symp) is one of the most counter-intuitive results in 20th-century statistics: when estimating *three or more (≥ 3)* independent normal means simultaneously, a shrinkage estimator (shrinking each sample mean toward the overall mean by a fraction) strictly dominates the MLE (independent sample means) in *total MSE*, regardless of the true means. The intuition is "borrowing strength" — even when the means have *no* direct relationship, pooled information still improves overall estimation. Modern genomic statistics applies this idea extensively: (1) limma (Smyth 2004; Ritchie 2015) uses empirical Bayes to shrink each gene's variance estimate toward a global variance prior, solving the per-gene-small-n problem in microarray/RNA-seq; moderated t is significantly more robust than ordinary t. (2) DESeq2 (Love 2014, Genome Biol) shrinks dispersion estimates toward a mean-dispersion trend; apeglm (Zhu 2018, Bioinformatics) uses a heavy-tailed (Cauchy-like) prior to shrink log2 fold changes, preventing low-count genes from dominating rankings by noise. (3) edgeR (Robinson 2010) combines common and tagwise dispersion through a weighted likelihood. (4) ashr (Stephens 2017, Biostatistics 18(2):275) reformulates the FDR framework itself as adaptive shrinkage — jointly estimating local FDR and effect size under a unimodal-prior assumption and giving LFSR (local false sign rate) in place of the q-value. (5) Bayesian hierarchical models are a more general shrinkage framework — group-level random effects implement borrow-strength automatically. Efron 2010 (Large-Scale Inference) is the complete textbook on this topic. Practice: in small-n, many-hypothesis, high-variance settings, always prefer shrinkage estimators over per-test MLE.