References — Biostatistics Tutorial

概覽

如何使用這份資料？

所有 DOI 連結指向出版社官方頁面。經典教科書（Rosner / Altman / Kleinbaum）為入門 +進階首選；報告規範（CONSORT / STROBE / TRIPOD）為論文寫作必讀；ASA 聲明指出 p-value 常被誤用的方式。最下方「教學註記」是閱讀本教學時應一併知道的方法論爭議與更新。

All DOI links point to publisher pages. Classic textbooks (Rosner / Altman / Kleinbaum) are the canonical entry points; reporting guidelines (CONSORT / STROBE / TRIPOD) are required reading for paper writing; the ASA statements catalogue common p-value misuses. The "Tutorial Notes" section at the bottom records methodological controversies and updates to be aware of while reading the tutorial.

教科書

📖 經典教科書

BOOKRosner B.Fundamentals of Biostatistics (8th ed., 2015)Cengage. 哈佛公衛常用入門，從描述到迴歸 / 存活完整覆蓋。 / Harvard public-health staple covering descriptive through regression / survival.
BOOKAltman DG.Practical Statistics for Medical Research (1991)Chapman & Hall. 臨床醫學工作者必備，重視 CI 與報告規範。 / Chapman & Hall. Clinically-oriented; emphasises CIs and reporting.
BOOKKleinbaum DG, Klein M.Survival Analysis: A Self-Learning Text (3rd ed., 2012)Springer. 存活分析最易讀的教科書，Cox PH 與 Kaplan-Meier 詳解。 / Springer. The most readable survival textbook; thorough Cox PH and KM.
BOOKHosmer DW, Lemeshow S, Sturdivant RX.Applied Logistic Regression (3rd ed., 2013)Wiley. 邏輯斯迴歸經典；Hosmer-Lemeshow 適合度檢定的源頭。 / Wiley. The classic on logistic regression; source of the Hosmer-Lemeshow GOF test.
BOOKPinheiro JC, Bates DM.Mixed-Effects Models in S and S-PLUS (2000)Springer. lme4 / nlme 套件作者所著的混合模型權威；理論 + R 實作。 / Springer. The lme4 / nlme authors' authoritative mixed-models text — theory + R.
BOOKVittinghoff E, Glidden DV, Shiboski SC, McCulloch CE.Regression Methods in Biostatistics (2nd ed., 2012)Springer. UCSF 生統教材，迴歸 / GLM / 存活 / 縱貫一冊到底。 / Springer. UCSF biostats text covering regression / GLM / survival / longitudinal in one volume.
BOOKCasella G, Berger RL.Statistical Inference (2nd ed., 2002)Duxbury. 研究所層級統計理論經典；MLE、充分統計量、檢定力理論全覆蓋。 / Duxbury. Graduate-level statistical theory canon; full coverage of MLE, sufficiency, testing theory.
BOOKJames G, Witten D, Hastie T, Tibshirani R.An Introduction to Statistical Learning (2nd ed., 2021)Springer. ISLR；R + Python 雙版本免費 PDF，回歸、分類、樹模型、SVM 全覆蓋。 / Springer. ISLR; free PDF (R + Python editions); regression, classification, trees, SVMs.
BOOKHarrell FE.Regression Modeling Strategies (2nd ed., 2015)Springer. DOI: 10.1007/978-3-319-19425-7 · 臨床預測模型權威；rms 套件作者；spline、validation、calibration 詳解。 / Authoritative on clinical prediction models; rms author; splines, validation, calibration.
BOOKKlein JP, Moeschberger ML.Survival Analysis: Techniques for Censored and Truncated Data (2nd ed., 2003)Springer. 存活分析數理較深的教科書；競爭風險、左截斷、計數過程理論。 / Springer. Mathematically deeper survival text; competing risks, left-truncation, counting-process theory.
BOOKHernán MA, Robins JM.Causal Inference: What If (2020)Chapman & Hall · 免費 PDF · 因果推論現代教科書；DAG、target trial emulation、g-methods。 / Free PDF. Modern causal inference textbook; DAGs, target trial emulation, g-methods.
BOOKGelman A, Hill J.Data Analysis Using Regression and Multilevel/Hierarchical Models (2006)Cambridge University Press. 多層次模型應用導向經典；Bayesian 與 frequentist 並陳。 / Cambridge. Application-oriented multilevel-modelling classic; Bayesian and frequentist side-by-side.
BOOKSnijders TAB, Bosker RJ.Multilevel Analysis (2nd ed., 2012)Sage. 多層次分析方法論教科書；ICC、cross-level interaction、centering 詳細。 / Sage. Methodological multilevel-analysis text; ICC, cross-level interaction, centering in depth.
BOOKTukey JW.Exploratory Data Analysis (1977)Addison-Wesley. EDA 一詞的源頭；箱形圖、stem-and-leaf 的發明者。 / Addison-Wesley. The book that coined "EDA"; inventor of box plot and stem-and-leaf.
BOOKEfron B, Hastie T.Computer Age Statistical Inference (2016)Cambridge. DOI: 10.1017/CBO9781316576533 · 從 bootstrap 到深度學習的當代統計總覽。 / Modern statistics from bootstrap to deep learning.
BOOKAgresti A.Categorical Data Analysis (3rd ed., 2012)Wiley. 類別資料分析權威；2×2 表、logistic、log-linear、GEE 全覆蓋。 / Wiley. Authoritative categorical-data text; 2×2 tables, logistic, log-linear, GEE.
BOOKCohen J.Statistical Power Analysis for the Behavioral Sciences (2nd ed., 1988)Routledge. effect size 小/中/大切點與檢定力分析的原典。 / Routledge. The original source for small/medium/large effect-size conventions and power analysis.

官方聲明

📜 ASA 聲明

STATEMENTWasserstein RL, Lazar NA.The ASA's Statement on p-Values: Context, Process, and Purpose. Am Stat 70(2):129-133 (2016).DOI: 10.1080/00031305.2016.1154108 · 6 條官方原則指出 p-value 常見誤用。 / Six official principles cataloguing p-value misuses.
STATEMENTWasserstein RL, Schirm AL, Lazar NA.Moving to a World Beyond "p < 0.05". Am Stat 73(sup1):1-19 (2019).DOI: 10.1080/00031305.2019.1583913 · 進一步論述廢除「統計顯著」門檻的論據。 / Further case against bright-line "statistical significance".
PAPERAmrhein V, Greenland S, McShane B.Retire statistical significance. Nature 567:305-307 (2019).DOI: 10.1038/d41586-019-00857-9 · 850 位科學家連署的 Nature 評論。 / 850-scientist Nature comment.

報告規範

📋 報告規範

GUIDELINESchulz KF, Altman DG, Moher D, CONSORT Group.CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMJ 340:c332 (2010).DOI: 10.1136/bmj.c332 · 隨機對照試驗 (RCT) 報告金標準。 / Gold standard for reporting RCTs.
GUIDELINEvon Elm E, Altman DG, Egger M, et al.STROBE: Strengthening the Reporting of Observational Studies in Epidemiology. Lancet 370:1453-1457 (2007).DOI: 10.1016/S0140-6736(07)61602-X · 觀察性研究（cohort / case-control / cross-section）報告規範。 / Reporting standard for observational designs.
GUIDELINECollins GS, Reitsma JB, Altman DG, Moons KGM.TRIPOD: Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis. BMJ 350:g7594 (2015).DOI: 10.1136/bmj.g7594 · 預測模型報告規範；2024 年 TRIPOD+AI 已發布。 / Prediction-model reporting; TRIPOD+AI released in 2024.
GUIDELINEPercie du Sert N, Hurst V, Ahluwalia A, et al.The ARRIVE guidelines 2.0: Updated guidelines for reporting animal research. PLoS Biol 18(7):e3000410 (2020).DOI: 10.1371/journal.pbio.3000410 · 動物研究報告 21 項清單。 / 21-item checklist for animal studies.

關鍵論文

🔑 章節關鍵論文

PAPERDelacre M, Lakens D, Leys C.Why Psychologists Should by Default Use Welch's t-test Instead of Student's t-test. Int Rev Soc Psychol 30(1):92-101 (2017).DOI: 10.5334/irsp.82 · t 檢定預設應用 Welch 而非 Student。 / Use Welch (not Student) as default t-test.
PAPERBenjamini Y, Hochberg Y.Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. JRSS-B 57(1):289-300 (1995).DOI: 10.1111/j.2517-6161.1995.tb02031.x · BH-FDR 原始論文，高通量資料工業標準。 / Original BH paper — industry standard for high-throughput data.
PAPERStorey JD.A direct approach to false discovery rates. JRSS-B 64(3):479-498 (2002).DOI: 10.1111/1467-9868.00346 · q-value（Storey）adapt 至 π₀，高 power。 / Storey q-value adapts to π₀ for higher power.
PAPERHoenig JM, Heisey DM.The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am Stat 55(1):19-24 (2001).DOI: 10.1198/000313001300339897 · 為什麼不要算 post-hoc power。 / Why you should never compute post-hoc power.
PAPERCohen J.A power primer. Psychol Bull 112(1):155-159 (1992).DOI: 10.1037/0033-2909.112.1.155 · Cohen's d 經典：小 0.2、中 0.5、大 0.8。 / The canonical d cutoffs: small 0.2, medium 0.5, large 0.8.
PAPERPeduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR.A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49(12):1373-1379 (1996).DOI: 10.1016/S0895-4356(96)00236-3 · EPV ≥ 10 規則的原始論文。 / Original EPV ≥ 10 rule for logistic regression.
PAPERGrambsch PM, Therneau TM.Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 81(3):515-526 (1994).DOI: 10.1093/biomet/81.3.515 · Schoenfeld residuals 檢查 Cox PH 假設。 / Schoenfeld residual test for Cox PH assumption.
PAPERAnscombe FJ.Graphs in statistical analysis. Am Stat 27(1):17-21 (1973).DOI: 10.1080/00031305.1973.10478966 · Anscombe's quartet：相同統計量、截然不同分布。 / Anscombe's quartet — identical summary stats, very different shapes.
PAPERMatejka J, Fitzmaurice G.Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. CHI 2017.DOI: 10.1145/3025453.3025912 · Datasaurus Dozen，Anscombe 的現代版示範。 / Datasaurus Dozen — the modern visual reboot of Anscombe.
PAPERBates D, Mächler M, Bolker B, Walker S.Fitting Linear Mixed-Effects Models Using lme4. J Stat Softw 67(1):1-48 (2015).DOI: 10.18637/jss.v067.i01 · lme4 套件官方論文，混合模型工業標準。 / Official lme4 paper; the de-facto mixed-models toolkit.
PAPERLiang K-Y, Zeger SL.Longitudinal data analysis using generalized linear models. Biometrika 73(1):13-22 (1986).DOI: 10.1093/biomet/73.1.13 · GEE 原始論文，群集相關資料的邊際模型。 / Original GEE paper for marginal models with clustered data.
PAPERHurlbert SH.Pseudoreplication and the design of ecological field experiments. Ecol Monogr 54(2):187-211 (1984).DOI: 10.2307/1942661 · pseudo-replication 概念的奠基論文。 / The founding paper on pseudo-replication.
PAPERLazic SE.The problem of pseudoreplication in neuroscientific studies. BMC Neurosci 11:5 (2010).DOI: 10.1186/1471-2202-11-5 · 神經科學界 pseudo-replication 普遍程度的量化評估。 / Quantitative audit of pseudoreplication prevalence in neuroscience.
PAPERAarts E, Verhage M, Veenvliet JV, Dolan CV, van der Sluis S.A solution to dependency: using multilevel analysis to accommodate nested data. Nat Neurosci 17(4):491-496 (2014).DOI: 10.1038/nn.3648 · 多層次分析解決神經科學巢狀資料的標準參考。 / Standard reference for multilevel solutions to nested neuroscience data.
PAPERRiley RD, Ensor J, Snell KIE, et al.Calculating the sample size required for developing a clinical prediction model. BMJ 368:m441 (2020).DOI: 10.1136/bmj.m441 · 預測模型樣本數計算的現代標準（取代 EPV 10 規則）。 / Modern standard for prediction-model sample size (supersedes EPV 10).
PAPERVickers AJ, Elkin EB.Decision Curve Analysis: A Novel Method for Evaluating Prediction Models. Med Decis Making 26(6):565-574 (2006).DOI: 10.1177/0272989X06295361 · DCA 原始論文，把臨床效用納入模型評估。 / Original DCA paper; clinical utility-aware model evaluation.
PAPERRoyston P, Parmar MKB.Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. Stat Med 32(30):1259-1283 (2013).DOI: 10.1002/sim.5733 · RMST：違反 PH 時取代 HR 的替代效應量。 / RMST — alternative effect measure when PH fails.
PAPERFine JP, Gray RJ.A proportional hazards model for the subdistribution of a competing risk. JASA 94(446):496-509 (1999).DOI: 10.1080/01621459.1999.10474144 · 競爭風險 subdistribution hazard 模型。 / Subdistribution hazard model for competing risks.
PAPERSchoenfeld DA.The asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika 68(1):316-319 (1981).DOI: 10.1093/biomet/68.1.316 · log-rank 樣本數公式來源；Schoenfeld residual 命名來源。 / Source of the log-rank sample-size formula; namesake of Schoenfeld residuals.
PAPERIgnatiadis N, Klaus B, Zaugg JB, Huber W.Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat Methods 13(7):577-580 (2016).DOI: 10.1038/nmeth.3885 · IHW：以共變量加權的現代 FDR 控制。 / IHW — covariate-weighted modern FDR control.
PAPERStorey JD, Tibshirani R.Statistical significance for genomewide studies. PNAS 100(16):9440-9445 (2003).DOI: 10.1073/pnas.1530509100 · q-value 在全基因體研究的奠基應用論文。 / Foundational q-value paper for genome-wide studies.
PAPERWestreich D, Greenland S.The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients. Am J Epidemiol 177(4):292-298 (2013).DOI: 10.1093/aje/kws412 · 多變量迴歸只有一個因果估計量，其他係數不可直接因果解讀。 / A multivariable regression has only one causal estimand; other coefficients aren't causally interpretable.
PAPERLakens D.Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses. Soc Psychol Personality Sci 8(4):355-362 (2017).DOI: 10.1177/1948550617697177 · TOST 等價檢定的實作入門。 / Practical primer on TOST equivalence testing.
PAPERGreenland S, Senn SJ, Rothman KJ, et al.Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 31(4):337-350 (2016).DOI: 10.1007/s10654-016-0149-3 · 25 個關於 p、CI、power 常見誤解的權威清單。 / Authoritative catalogue of 25 common misinterpretations of p, CI, power.
PAPERFaul F, Erdfelder E, Lang A-G, Buchner A.G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods 39(2):175-191 (2007).DOI: 10.3758/BF03193146 · G*Power 官方論文，引用 9000+ 篇。 / Official G*Power paper, cited 9000+ times.
PAPERTibshirani R.Regression shrinkage and selection via the lasso. JRSS-B 58(1):267-288 (1996).DOI: 10.1111/j.2517-6161.1996.tb02080.x · LASSO 原始論文，高維迴歸正則化基石。 / Original LASSO paper; cornerstone of high-dim regularised regression.
PAPERRobinson MD, McCarthy DJ, Smyth GK.edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139-140 (2010).DOI: 10.1093/bioinformatics/btp616 · edgeR：NB-GLM 高通量差異表達分析。 / edgeR — NB-GLM differential expression for HTS counts.
PAPERWeissgerber TL, Milic NM, Winham SJ, Garovic VD.Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128 (2015).DOI: 10.1371/journal.pbio.1002128 · 為何 bar plot 在小樣本研究有誤導性。 / Why bar plots mislead with small samples.
PAPERLord SJ, Velle KB, Mullins RD, Fritz-Laylin LK.SuperPlots: Communicating reproducibility and variability in cell biology. J Cell Biol 219(6):e202001064 (2020).DOI: 10.1083/jcb.202001064 · SuperPlots：細胞生物學中區分生物 vs 技術重複的視覺化標準。 / SuperPlots — standard for visualising biological vs technical replication in cell biology.
PAPERCurran-Everett D, Benos DJ.Guidelines for reporting statistics in journals published by the American Physiological Society: the sequel. Adv Physiol Educ 32(1):14-17 (2008).DOI: 10.1152/advan.00074.2007 · 為何報告 SD 而非 SE 才能傳達變異程度。 / Why SD (not SE) should be reported to convey variability.
PAPEREfron B.Bootstrap Methods: Another Look at the Jackknife. Ann Stat 7(1):1-26 (1979).DOI: 10.1214/aos/1176344552 · Bootstrap 重抽樣法奠基論文。 / Foundational bootstrap paper.
PAPERGreenland S.Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values. Am Stat 73(sup1):106-114 (2019).DOI: 10.1080/00031305.2018.1529625 · S-value（Shannon information）對 p 的補強。 / S-value (Shannon information) as a complement to p.
PAPERBegley CG, Ellis LM.Drug development: Raise standards for preclinical cancer research. Nature 483(7391):531-533 (2012).DOI: 10.1038/483531a · 重現性危機指標性論文（53 篇癌症研究只有 6 篇可重現）。 / Landmark replication-crisis paper (only 6/53 preclinical cancer studies reproduced).
PAPERWhite H.A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity. Econometrica 48(4):817-838 (1980).DOI: 10.2307/1912934 · robust（"White"）標準誤源頭。 / Origin of robust ("White") standard errors.
PAPERSaito T, Rehmsmeier M.The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS One 10(3):e0118432 (2015).DOI: 10.1371/journal.pone.0118432 · 不平衡資料應用 PR 曲線而非 ROC。 / Use PR (not ROC) for imbalanced datasets.

工具

🛠️ R / Python 工具

DOC · Rsurvival / survminer · Cox PH 與 KM 視覺化的標準 R 工具鏈。 / Standard R toolchain for Cox PH and KM visualisation.cran/survival · survminer
DOC · Rlme4 + lmerTest · 線性混合模型 + Satterthwaite df p-value。 / Linear mixed models + Satterthwaite df p-values.cran/lme4 · cran/lmerTest
DOC · Rpwr · power / sample-size 計算工具。 / Power and sample-size calculations.cran/pwr
DOC · Pythonlifelines · Python 存活分析的工業標準。 / Industry-standard Python survival library.lifelines.readthedocs.io
DOC · Pythonstatsmodels · OLS / GLM / mixed / GEE / power 一站式 Python 工具。 / One-stop Python toolkit for OLS / GLM / mixed / GEE / power.statsmodels.org
DOC · GUIG*Power · 跨平台 sample-size 計算 GUI，引用 9000+ 篇論文。 / Cross-platform sample-size GUI; cited in 9000+ papers.gpower.hhu.de
DOC · Pythonscikit-survival · scikit-learn 風格的 Python 存活分析；支援機器學習存活模型。 / scikit-learn-style Python survival library; ML survival models.scikit-survival.readthedocs.io
DOC · RglmmTMB · 廣義線性混合模型（含 zero-inflation、Tweedie 分布）的 TMB 後端 R 套件。 / GLMM toolkit (zero-inflation, Tweedie, etc.) backed by TMB.cran/glmmTMB
DOC · Rbrms · Bayesian 多層次模型介面（Stan 後端），公式語法接近 lme4。 / Bayesian multilevel-model interface to Stan with lme4-like formula syntax.paul-buerkner.github.io/brms
DOC · Rpmsampsize · 臨床預測模型樣本數計算（依 Riley 2020 BMJ）。 / Sample-size calculation for clinical prediction models (Riley 2020 BMJ).cran/pmsampsize
DOC · Pythonpingouin · 生醫常用統計檢定 + effect size + 等價檢定的 Python 工具。 / Biomedical stats helpers in Python (tests, effect sizes, TOST).pingouin-stats.org
DOC · Rgtsummary · 發表級 Table 1 / regression summary R 工具。 / Publication-ready Table 1 and regression tables in R.danieldsjoberg.com/gtsummary

📖

查找指南： 所有 DOI 連結指向出版社官方頁面。若需開放近用版本，可搭配 All DOI links point to publisher pages. For open-access PDFs, try PubMed · PMC · arXiv · bioRxiv 找開放近用版本。.

教學註記

📌 教學註記與細節

下方為閱讀本教學時應一併知道的方法論爭議與更新。本教學中的章節說明僅為入門框架，實務應用時請以最新規範與爭議文獻為準。

The notes below record methodological controversies and current updates worth knowing while reading this tutorial. Chapter explanations are introductory frameworks; practical applications should defer to the latest guidelines and current literature.

Hypothesis：p-value 的真意

即使教學以「p < 0.05 → 拒絕 H₀」呈現，仍需強調 Wasserstein & Lazar 2016 Am Stat（ASA 聲明）六大原則：(1) p-value 不是 H₁ 為真的機率；(2) p-value 不衡量效應大小；(3) 統計結論不應僅憑 p-value 門檻決定；(4) 透明、完整報告才能避免「p-hacking / forking paths」。Gelman & Loken 2014「Garden of Forking Paths」進一步指出，無意識的資料探索也會膨脹偽陽性；預先註冊（preregistration）+ 報告 effect size + CI 才是正確做法。

Even if the tutorial frames "p < 0.05 → reject H₀", emphasise the six principles of Wasserstein & Lazar 2016 Am Stat (ASA statement): (1) p is not the probability that H₁ is true; (2) p does not measure effect size; (3) decisions should not depend solely on a p-value threshold; (4) transparency and complete reporting prevent p-hacking / forking paths. Gelman & Loken 2014 (Garden of Forking Paths) shows even unintentional data exploration inflates false positives. The right answer is preregistration + effect sizes + CIs.

Sources: Wasserstein RL, Lazar NA (2016) Am Stat 70:129. DOI: 10.1080/00031305.2016.1154108; Gelman A, Loken E (2014) Am Sci 102:460.

t-tests：Welch 預設

傳統「先測等變異 (Levene) → 等變異用 Student、不等用 Welch」是過時流程。Delacre, Lakens & Leys 2017 Int Rev Soc Psychol 證明：等變異時 Welch 與 Student 表現幾乎相同，但 Welch 在不等變異時穩健；先測 Levene 反而引入兩段檢定（multiple testing）。實務原則：t 檢定直接用 Welch，不需 Levene 前測。R 的 t.test() 預設 var.equal=FALSE 即 Welch；Python scipy.stats.ttest_ind(equal_var=False) 同。

The traditional "Levene first → Student if equal-variance, Welch otherwise" pipeline is outdated. Delacre, Lakens & Leys 2017 Int Rev Soc Psychol show Welch is essentially equivalent to Student under equal variance and far more robust when not. The pre-test introduces multiple testing. Rule: use Welch by default, no Levene pre-test. R's t.test() uses var.equal=FALSE by default; Python scipy.stats.ttest_ind(equal_var=False).

Sources: Delacre M, Lakens D, Leys C (2017) Int Rev Soc Psychol 30:92. DOI: 10.5334/irsp.82.

Chi-square：小樣本改 Fisher's

卡方檢定的卡方近似需要 每格 expected count ≥ 5（Cochran 1954 規則）。當 2×2 表有任一格 expected < 5，改用 Fisher's exact test。Yates continuity correction 為早期試圖修正連續性誤差的做法，現代爭議是 over-conservative，多數情況不必加（Sokal & Rohlf 2012; Camilli 1995）。實務原則：R 的 chisq.test(correct=FALSE)、Python 的 scipy.stats.chi2_contingency(correction=False)；小樣本直接 fisher.test() / scipy.stats.fisher_exact。

The chi-square approximation requires expected count ≥ 5 per cell (Cochran 1954). For 2×2 tables with any expected < 5, use Fisher's exact test. Yates continuity correction was an early fix but modern view considers it over-conservative (Sokal & Rohlf 2012; Camilli 1995); usually omit it. Use chisq.test(correct=FALSE) in R, chi2_contingency(correction=False) in Python; for small samples switch to fisher.test() / fisher_exact.

Sources: Cochran WG (1954) Biometrics 10:417; Camilli G (1995) Psychol Bull 117:135.

Post-hoc power 的禁忌

論文 reviewer 偶爾會要求「報告 observed power」——這是統計學界一致反對的做法。Hoenig & Heisey 2001 Am Stat「The Abuse of Power」證明：observed power 與 p-value 是一對一函數（p < 0.05 ⇔ power > 0.50），所以 post-hoc power 不帶任何超出 p-value 的資訊；它的功能是讓不顯著結果看起來「sample size 不夠」。正確做法：報告 effect size + 其 95% CI，讓讀者自行判斷臨床/生物意義。如要回答「若再做一次研究的 power？」應指定 a priori effect size（如最小臨床意義差異 MCID）後重算。

Reviewers sometimes ask for "observed power" — this is universally opposed by statisticians. Hoenig & Heisey 2001 Am Stat ("The Abuse of Power") show observed power is a one-to-one function of the p-value (p < 0.05 ⇔ power > 0.50), carrying no information beyond the p-value. Its only function is to make non-significant results look "underpowered". The right approach: report effect size with 95% CI and let readers judge clinical / biological relevance. To ask "what power would another study have?" specify an a priori effect size (e.g. minimum clinically important difference) and compute from scratch.

Sources: Hoenig JM, Heisey DM (2001) Am Stat 55:19. DOI: 10.1198/000313001300339897.

FDR：BH 的假設與 BY 替代

Benjamini-Hochberg 1995 控制 FDR 在 positive regression dependency (PRDS) 條件下，包括獨立與多種正相依結構（基因檢定通常滿足）。對任意相依（如時間序列、特定 GWAS 結構），應改用 Benjamini-Yekutieli 2001 Ann Stat（BY），其保守度約為 BH 的 ln(m)+0.577 倍。R 的 p.adjust(p, method="BY")；Python 的 multipletests(method='fdr_by')。Storey q-value（Storey 2002 JRSS-B）adapt 至 π₀（真 H₀ 的比例），高 power 但要求 m 大、π₀ 可估，scRNA-seq DE 常用。

Benjamini-Hochberg 1995 controls FDR under positive regression dependency (PRDS) — independence plus many positive-dependence structures (gene tests usually qualify). Under arbitrary dependence (time series, certain GWAS structures), switch to Benjamini-Yekutieli 2001 Ann Stat (BY), about ln(m)+0.577 times more conservative. Use p.adjust(method="BY") in R, multipletests(method='fdr_by') in Python. Storey q-value (Storey 2002 JRSS-B) adapts to π₀ (proportion of true nulls) for higher power; requires large m and estimable π₀, common in scRNA-seq DE.

Sources: Benjamini Y, Hochberg Y (1995) JRSS-B 57:289. DOI: 10.1111/j.2517-6161.1995.tb02031.x; Benjamini Y, Yekutieli D (2001) Ann Stat 29:1165.

Cox PH 違反時的處理

Cox 模型核心是 hazard ratio 隨時間恆定（proportional hazards）。當 Schoenfeld residual 檢驗 (Grambsch & Therneau 1994 Biometrika) 顯示違反，選項包括：(1) 加入 time-varying covariate，例如 group : tt(time)；(2) 改用 stratified Cox（每層自己的基準 hazard，但組內仍 PH）；(3) 用 parametric AFT 模型（accelerated failure time，如 Weibull / log-normal）；(4) 改報 restricted mean survival time (RMST) 而非 HR（Royston & Parmar 2013 Stat Med）。實務原則：(a) 任何 Cox 模型都要查 cox.zph() 或 lifelines 的 check_assumptions()；(b) 報告該檢驗結果，不只 HR + CI。

Cox models hinge on constant hazard ratios (proportional hazards). When Schoenfeld residual tests (Grambsch & Therneau 1994 Biometrika) flag a violation, options: (1) add a time-varying covariate like group : tt(time); (2) switch to stratified Cox (each stratum has its own baseline hazard, PH still required within); (3) use parametric AFT models (Weibull, log-normal); (4) report restricted mean survival time (RMST) instead of HR (Royston & Parmar 2013 Stat Med). Rules: (a) always run cox.zph() or lifelines' check_assumptions(); (b) report the test result, not just HR + CI.

Sources: Grambsch PM, Therneau TM (1994) Biometrika 81:515. DOI: 10.1093/biomet/81.3.515; Royston P, Parmar MKB (2013) Stat Med 32:1259.

混合模型 vs pseudo-replication

生物實驗最常見錯誤之一：同一隻老鼠取多個細胞當「獨立樣本」、同一塊組織切多片當 n 個觀察。Hurlbert 1984 Ecol Monogr「Pseudoreplication and the design of ecological field experiments」是揭露 pseudo-replication 的經典論文。正確做法：(1) 用 random intercept 把 cluster 結構納入：lmer(y ~ treatment + (1|mouse))；(2) 設計階段就盤算 biological replicate (mouse) vs technical replicate (cell/slice) 數量；(3) 報告 sample size 必須註明 mouse 數與 cell 數。常見領域：electrophysiology (cells per mouse)、histology (sections per tissue)、cell biology (wells per plate)。

A common mistake in biology: treating multiple cells from one mouse, or multiple slices from one tissue, as "independent" observations. Hurlbert 1984 Ecol Monogr ("Pseudoreplication and the design of ecological field experiments") is the canonical exposé of this error. Right approach: (1) absorb the cluster structure with a random intercept: lmer(y ~ treatment + (1|mouse)); (2) at design time, decide biological replicate count (mice) vs technical replicate count (cells/slices); (3) reports must state both mouse and cell counts. Common fields: electrophysiology (cells per mouse), histology (sections per tissue), cell biology (wells per plate).

Sources: Hurlbert SH (1984) Ecol Monogr 54:187; Lazic SE (2010) BMC Neurosci 11:5; Aarts E et al. (2014) Nat Neurosci 17:491.

Regression：Table 2 Fallacy

論文 Table 2 經常並列多個變項的調整後迴歸係數（exposure 與全部 confounder 一起），讀者很容易把每一行都當成「因果效應」。Westreich & Greenland 2013 Am J Epidemiol「The Table 2 Fallacy」直言：一個多變量迴歸模型 只有一個因果估計量（你 a priori 指定的 exposure），其他共變量只是為了 deconfound 那個 exposure 而納入，其係數可能本身被 collider / mediator 偏誤污染，不可直接因果解讀。正確做法：(1) 為每個感興趣的 exposure 畫獨立的 DAG（Hernán-Robins Causal Inference: What If）；(2) 為每個 exposure 找出對應的「最小調整集」（minimum sufficient adjustment set）；(3) 在表格中只報告該 exposure 的調整後估計，不要列其他「合身」的係數；或標註它們僅供 deconfounding 之用。

Manuscript Table 2's commonly list the adjusted regression coefficients for the exposure alongside every covariate, inviting readers to interpret each row causally. Westreich & Greenland 2013 Am J Epidemiol ("The Table 2 Fallacy") states bluntly: a multivariable regression has exactly one causal estimand — the a priori specified exposure — and the other covariates are only there to deconfound it. Their coefficients may themselves be contaminated by collider / mediator bias and cannot be interpreted causally. Best practice: (1) draw a separate DAG for each exposure you care about (Hernán-Robins Causal Inference: What If); (2) derive the minimum sufficient adjustment set per exposure; (3) report only the adjusted estimate for the target exposure, or annotate co-coefficients as "for deconfounding only".

Sources: Westreich D, Greenland S (2013) Am J Epidemiol 177:292. DOI: 10.1093/aje/kws412; Hernán MA, Robins JM (2020) Causal Inference: What If.

Survival：Immortal Time Bias

藥物流行病學常見偏誤：把 exposure 發生之前的 follow-up time（一段必然「存活」的時間）算進 exposed 組，使其看起來壽命較長。例：用「曾經拿過 statin 處方」當 exposure，從 cohort 進入時就計時，那段「尚未拿藥但必然還活著」的時間是 immortal time，把它計入 exposed 組會人為製造存活優勢。Lévesque LE et al. 2010 BMJ「Problem of immortal time bias in cohort studies」是揭露此問題的經典。解決方法：(1) time-varying covariate，個體在 exposure 發生前算作 unexposed、之後才算 exposed；(2) target trial emulation（Hernán-Robins），把觀察性研究重塑為假想 RCT，從 time-zero 就分組；(3) marginal structural models with IPTW 處理隨時間變動的 confounder。

A canonical pharmacoepidemiology pitfall: follow-up time accrued before the exposure occurs (a period during which the individual is necessarily alive) gets attributed to the exposed group, manufacturing a spurious survival advantage. Example: defining "ever-statin-user" as exposure but starting the clock at cohort entry — the time before the prescription is immortal time; counting it as exposed exaggerates survival. Lévesque LE et al. 2010 BMJ ("Problem of immortal time bias in cohort studies") is the standard reference. Fixes: (1) time-varying covariate, with individuals unexposed before the event and exposed after; (2) target trial emulation (Hernán-Robins), recasting the observational study as a hypothetical RCT with everyone aligned at time-zero; (3) marginal structural models with IPTW to handle time-varying confounding.

Sources: Lévesque LE, Hanley JA, Kezouh A, Suissa S (2010) BMJ 340:b5087. DOI: 10.1136/bmj.b5087; Hernán MA, Robins JM (2020) Causal Inference: What If.

Logistic：0.5 閾值的迷思

許多分類器套件預設用 predict_proba ≥ 0.5 來輸出二元預測；但 0.5 只是在 (a) 兩類等比例且 (b) 偽陽性與偽陰性成本相等時最大化 accuracy 的閾值——這兩個條件在臨床現實幾乎都不成立。常見替代方案：(1) Youden's J 指數 ≔ max(TPR − FPR)，在 ROC 曲線左上角找轉折點；(2) Decision Curve Analysis（Vickers & Elkin 2006 Med Decis Making）：依照臨床「願意接受的 FP:FN 比」決定臨床淨效益最大的閾值；(3) 不平衡資料下改報 precision-recall（Saito-Rehmsmeier 2015 PLoS One）而非 ROC-AUC。實務原則：在 train/validate split 上選定閾值，在 test split 上才評估——不可在 test 上重複試閾值。

Most classifier libraries default to predict_proba ≥ 0.5 for binary output, but 0.5 is only the accuracy-maximising threshold when (a) classes are balanced and (b) FP and FN costs are equal — almost never the clinical reality. Standard alternatives: (1) Youden's J statistic ≔ max(TPR − FPR), the upper-left corner of the ROC; (2) Decision Curve Analysis (Vickers & Elkin 2006 Med Decis Making), choosing the threshold that maximises net clinical benefit at a chosen FP:FN ratio; (3) for imbalanced data, report precision-recall (Saito & Rehmsmeier 2015 PLoS One) instead of ROC-AUC. Rule: select the threshold on a train/validate split, evaluate only on the held-out test set — never repeatedly tune it on test data.

Sources: Youden WJ (1950) Cancer 3:32; Vickers AJ, Elkin EB (2006) Med Decis Making 26:565. DOI: 10.1177/0272989X06295361; Saito T, Rehmsmeier M (2015) PLoS One 10:e0118432.