如何使用這份資料?
所有 DOI 連結指向出版社官方頁面。經典教科書(Rosner / Altman / Kleinbaum)為入門 +進階首選;報告規範(CONSORT / STROBE / TRIPOD)為論文寫作必讀;ASA 聲明指出 p-value 常被誤用的方式。最下方「教學註記」是閱讀本教學時應一併知道的方法論爭議與更新。
All DOI links point to publisher pages. Classic textbooks (Rosner / Altman / Kleinbaum) are the canonical entry points; reporting guidelines (CONSORT / STROBE / TRIPOD) are required reading for paper writing; the ASA statements catalogue common p-value misuses. The "Tutorial Notes" section at the bottom records methodological controversies and updates to be aware of while reading the tutorial.
📖 經典教科書
- BOOKFundamentals of Biostatistics (8th ed., 2015)
- BOOKPractical Statistics for Medical Research (1991)
- BOOKSurvival Analysis: A Self-Learning Text (3rd ed., 2012)
- BOOKApplied Logistic Regression (3rd ed., 2013)
- BOOKMixed-Effects Models in S and S-PLUS (2000)
- BOOKRegression Methods in Biostatistics (2nd ed., 2012)
- BOOKStatistical Inference (2nd ed., 2002)
- BOOKAn Introduction to Statistical Learning (2nd ed., 2021)
- BOOKRegression Modeling Strategies (2nd ed., 2015)
- BOOKSurvival Analysis: Techniques for Censored and Truncated Data (2nd ed., 2003)
- BOOKCausal Inference: What If (2020)
- BOOKData Analysis Using Regression and Multilevel/Hierarchical Models (2006)
- BOOKMultilevel Analysis (2nd ed., 2012)
- BOOKExploratory Data Analysis (1977)
- BOOKComputer Age Statistical Inference (2016)
- BOOKCategorical Data Analysis (3rd ed., 2012)
- BOOKStatistical Power Analysis for the Behavioral Sciences (2nd ed., 1988)
📜 ASA 聲明
- STATEMENTThe ASA's Statement on p-Values: Context, Process, and Purpose. Am Stat 70(2):129-133 (2016).
- STATEMENTMoving to a World Beyond "p < 0.05". Am Stat 73(sup1):1-19 (2019).
- PAPERRetire statistical significance. Nature 567:305-307 (2019).
📋 報告規範
- GUIDELINECONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMJ 340:c332 (2010).
- GUIDELINESTROBE: Strengthening the Reporting of Observational Studies in Epidemiology. Lancet 370:1453-1457 (2007).
- GUIDELINETRIPOD: Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis. BMJ 350:g7594 (2015).
- GUIDELINEThe ARRIVE guidelines 2.0: Updated guidelines for reporting animal research. PLoS Biol 18(7):e3000410 (2020).
🔑 章節關鍵論文
- PAPERWhy Psychologists Should by Default Use Welch's t-test Instead of Student's t-test. Int Rev Soc Psychol 30(1):92-101 (2017).
- PAPERControlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. JRSS-B 57(1):289-300 (1995).
- PAPERA direct approach to false discovery rates. JRSS-B 64(3):479-498 (2002).
- PAPERThe Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am Stat 55(1):19-24 (2001).
- PAPERA power primer. Psychol Bull 112(1):155-159 (1992).
- PAPERA simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49(12):1373-1379 (1996).
- PAPERProportional hazards tests and diagnostics based on weighted residuals. Biometrika 81(3):515-526 (1994).
- PAPERGraphs in statistical analysis. Am Stat 27(1):17-21 (1973).
- PAPERSame Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. CHI 2017.
- PAPERFitting Linear Mixed-Effects Models Using lme4. J Stat Softw 67(1):1-48 (2015).
- PAPERLongitudinal data analysis using generalized linear models. Biometrika 73(1):13-22 (1986).
- PAPERPseudoreplication and the design of ecological field experiments. Ecol Monogr 54(2):187-211 (1984).
- PAPERThe problem of pseudoreplication in neuroscientific studies. BMC Neurosci 11:5 (2010).
- PAPERA solution to dependency: using multilevel analysis to accommodate nested data. Nat Neurosci 17(4):491-496 (2014).
- PAPERCalculating the sample size required for developing a clinical prediction model. BMJ 368:m441 (2020).
- PAPERDecision Curve Analysis: A Novel Method for Evaluating Prediction Models. Med Decis Making 26(6):565-574 (2006).
- PAPERRestricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. Stat Med 32(30):1259-1283 (2013).
- PAPERA proportional hazards model for the subdistribution of a competing risk. JASA 94(446):496-509 (1999).
- PAPERThe asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika 68(1):316-319 (1981).
- PAPERData-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat Methods 13(7):577-580 (2016).
- PAPERStatistical significance for genomewide studies. PNAS 100(16):9440-9445 (2003).
- PAPERThe Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients. Am J Epidemiol 177(4):292-298 (2013).
- PAPEREquivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses. Soc Psychol Personality Sci 8(4):355-362 (2017).
- PAPERStatistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 31(4):337-350 (2016).
- PAPERG*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods 39(2):175-191 (2007).
- PAPERRegression shrinkage and selection via the lasso. JRSS-B 58(1):267-288 (1996).
- PAPERedgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139-140 (2010).
- PAPERBeyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128 (2015).
- PAPERSuperPlots: Communicating reproducibility and variability in cell biology. J Cell Biol 219(6):e202001064 (2020).
- PAPERGuidelines for reporting statistics in journals published by the American Physiological Society: the sequel. Adv Physiol Educ 32(1):14-17 (2008).
- PAPERBootstrap Methods: Another Look at the Jackknife. Ann Stat 7(1):1-26 (1979).
- PAPERValid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values. Am Stat 73(sup1):106-114 (2019).
- PAPERDrug development: Raise standards for preclinical cancer research. Nature 483(7391):531-533 (2012).
- PAPERA Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity. Econometrica 48(4):817-838 (1980).
- PAPERThe Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS One 10(3):e0118432 (2015).
🛠️ R / Python 工具
- DOC · Rsurvival / survminer · Cox PH 與 KM 視覺化的標準 R 工具鏈。 / Standard R toolchain for Cox PH and KM visualisation.
- DOC · Rlme4 + lmerTest · 線性混合模型 + Satterthwaite df p-value。 / Linear mixed models + Satterthwaite df p-values.
- DOC · Rpwr · power / sample-size 計算工具。 / Power and sample-size calculations.
- DOC · Pythonlifelines · Python 存活分析的工業標準。 / Industry-standard Python survival library.
- DOC · Pythonstatsmodels · OLS / GLM / mixed / GEE / power 一站式 Python 工具。 / One-stop Python toolkit for OLS / GLM / mixed / GEE / power.
- DOC · GUIG*Power · 跨平台 sample-size 計算 GUI,引用 9000+ 篇論文。 / Cross-platform sample-size GUI; cited in 9000+ papers.
- DOC · Pythonscikit-survival · scikit-learn 風格的 Python 存活分析;支援機器學習存活模型。 / scikit-learn-style Python survival library; ML survival models.
- DOC · RglmmTMB · 廣義線性混合模型(含 zero-inflation、Tweedie 分布)的 TMB 後端 R 套件。 / GLMM toolkit (zero-inflation, Tweedie, etc.) backed by TMB.
- DOC · Rbrms · Bayesian 多層次模型介面(Stan 後端),公式語法接近 lme4。 / Bayesian multilevel-model interface to Stan with lme4-like formula syntax.
- DOC · Rpmsampsize · 臨床預測模型樣本數計算(依 Riley 2020 BMJ)。 / Sample-size calculation for clinical prediction models (Riley 2020 BMJ).
- DOC · Pythonpingouin · 生醫常用統計檢定 + effect size + 等價檢定的 Python 工具。 / Biomedical stats helpers in Python (tests, effect sizes, TOST).
- DOC · Rgtsummary · 發表級 Table 1 / regression summary R 工具。 / Publication-ready Table 1 and regression tables in R.
📌 教學註記與細節
下方為閱讀本教學時應一併知道的方法論爭議與更新。本教學中的章節說明僅為入門框架,實務應用時請以最新規範與爭議文獻為準。
The notes below record methodological controversies and current updates worth knowing while reading this tutorial. Chapter explanations are introductory frameworks; practical applications should defer to the latest guidelines and current literature.
Hypothesis:p-value 的真意
即使教學以「p < 0.05 → 拒絕 H₀」呈現,仍需強調 Wasserstein & Lazar 2016 Am Stat(ASA 聲明)六大原則:(1) p-value 不是 H₁ 為真的機率;(2) p-value 不衡量效應大小;(3) 統計結論不應僅憑 p-value 門檻決定;(4) 透明、完整報告才能避免「p-hacking / forking paths」。Gelman & Loken 2014「Garden of Forking Paths」進一步指出,無意識的資料探索也會膨脹偽陽性;預先註冊(preregistration)+ 報告 effect size + CI 才是正確做法。
Even if the tutorial frames "p < 0.05 → reject H₀", emphasise the six principles of Wasserstein & Lazar 2016 Am Stat (ASA statement): (1) p is not the probability that H₁ is true; (2) p does not measure effect size; (3) decisions should not depend solely on a p-value threshold; (4) transparency and complete reporting prevent p-hacking / forking paths. Gelman & Loken 2014 (Garden of Forking Paths) shows even unintentional data exploration inflates false positives. The right answer is preregistration + effect sizes + CIs.
Sources: Wasserstein RL, Lazar NA (2016) Am Stat 70:129. DOI: 10.1080/00031305.2016.1154108; Gelman A, Loken E (2014) Am Sci 102:460.
t-tests:Welch 預設
傳統「先測等變異 (Levene) → 等變異用 Student、不等用 Welch」是過時流程。Delacre, Lakens & Leys 2017 Int Rev Soc Psychol 證明:等變異時 Welch 與 Student 表現幾乎相同,但 Welch 在不等變異時穩健;先測 Levene 反而引入兩段檢定(multiple testing)。實務原則:t 檢定直接用 Welch,不需 Levene 前測。R 的 t.test() 預設 var.equal=FALSE 即 Welch;Python scipy.stats.ttest_ind(equal_var=False) 同。
The traditional "Levene first → Student if equal-variance, Welch otherwise" pipeline is outdated. Delacre, Lakens & Leys 2017 Int Rev Soc Psychol show Welch is essentially equivalent to Student under equal variance and far more robust when not. The pre-test introduces multiple testing. Rule: use Welch by default, no Levene pre-test. R's t.test() uses var.equal=FALSE by default; Python scipy.stats.ttest_ind(equal_var=False).
Sources: Delacre M, Lakens D, Leys C (2017) Int Rev Soc Psychol 30:92. DOI: 10.5334/irsp.82.
Chi-square:小樣本改 Fisher's
卡方檢定的卡方近似需要 每格 expected count ≥ 5(Cochran 1954 規則)。當 2×2 表有任一格 expected < 5,改用 Fisher's exact test。Yates continuity correction 為早期試圖修正連續性誤差的做法,現代爭議是 over-conservative,多數情況不必加(Sokal & Rohlf 2012; Camilli 1995)。實務原則:R 的 chisq.test(correct=FALSE)、Python 的 scipy.stats.chi2_contingency(correction=False);小樣本直接 fisher.test() / scipy.stats.fisher_exact。
The chi-square approximation requires expected count ≥ 5 per cell (Cochran 1954). For 2×2 tables with any expected < 5, use Fisher's exact test. Yates continuity correction was an early fix but modern view considers it over-conservative (Sokal & Rohlf 2012; Camilli 1995); usually omit it. Use chisq.test(correct=FALSE) in R, chi2_contingency(correction=False) in Python; for small samples switch to fisher.test() / fisher_exact.
Sources: Cochran WG (1954) Biometrics 10:417; Camilli G (1995) Psychol Bull 117:135.
Post-hoc power 的禁忌
論文 reviewer 偶爾會要求「報告 observed power」——這是統計學界一致反對的做法。Hoenig & Heisey 2001 Am Stat「The Abuse of Power」證明:observed power 與 p-value 是一對一函數(p < 0.05 ⇔ power > 0.50),所以 post-hoc power 不帶任何超出 p-value 的資訊;它的功能是讓不顯著結果看起來「sample size 不夠」。正確做法:報告 effect size + 其 95% CI,讓讀者自行判斷臨床/生物意義。如要回答「若再做一次研究的 power?」應指定 a priori effect size(如最小臨床意義差異 MCID)後重算。
Reviewers sometimes ask for "observed power" — this is universally opposed by statisticians. Hoenig & Heisey 2001 Am Stat ("The Abuse of Power") show observed power is a one-to-one function of the p-value (p < 0.05 ⇔ power > 0.50), carrying no information beyond the p-value. Its only function is to make non-significant results look "underpowered". The right approach: report effect size with 95% CI and let readers judge clinical / biological relevance. To ask "what power would another study have?" specify an a priori effect size (e.g. minimum clinically important difference) and compute from scratch.
Sources: Hoenig JM, Heisey DM (2001) Am Stat 55:19. DOI: 10.1198/000313001300339897.
FDR:BH 的假設與 BY 替代
Benjamini-Hochberg 1995 控制 FDR 在 positive regression dependency (PRDS) 條件下,包括獨立與多種正相依結構(基因檢定通常滿足)。對任意相依(如時間序列、特定 GWAS 結構),應改用 Benjamini-Yekutieli 2001 Ann Stat(BY),其保守度約為 BH 的 ln(m)+0.577 倍。R 的 p.adjust(p, method="BY");Python 的 multipletests(method='fdr_by')。Storey q-value(Storey 2002 JRSS-B)adapt 至 π₀(真 H₀ 的比例),高 power 但要求 m 大、π₀ 可估,scRNA-seq DE 常用。
Benjamini-Hochberg 1995 controls FDR under positive regression dependency (PRDS) — independence plus many positive-dependence structures (gene tests usually qualify). Under arbitrary dependence (time series, certain GWAS structures), switch to Benjamini-Yekutieli 2001 Ann Stat (BY), about ln(m)+0.577 times more conservative. Use p.adjust(method="BY") in R, multipletests(method='fdr_by') in Python. Storey q-value (Storey 2002 JRSS-B) adapts to π₀ (proportion of true nulls) for higher power; requires large m and estimable π₀, common in scRNA-seq DE.
Sources: Benjamini Y, Hochberg Y (1995) JRSS-B 57:289. DOI: 10.1111/j.2517-6161.1995.tb02031.x; Benjamini Y, Yekutieli D (2001) Ann Stat 29:1165.
Cox PH 違反時的處理
Cox 模型核心是 hazard ratio 隨時間恆定(proportional hazards)。當 Schoenfeld residual 檢驗 (Grambsch & Therneau 1994 Biometrika) 顯示違反,選項包括:(1) 加入 time-varying covariate,例如 group : tt(time);(2) 改用 stratified Cox(每層自己的基準 hazard,但組內仍 PH);(3) 用 parametric AFT 模型(accelerated failure time,如 Weibull / log-normal);(4) 改報 restricted mean survival time (RMST) 而非 HR(Royston & Parmar 2013 Stat Med)。實務原則:(a) 任何 Cox 模型都要查 cox.zph() 或 lifelines 的 check_assumptions();(b) 報告該檢驗結果,不只 HR + CI。
Cox models hinge on constant hazard ratios (proportional hazards). When Schoenfeld residual tests (Grambsch & Therneau 1994 Biometrika) flag a violation, options: (1) add a time-varying covariate like group : tt(time); (2) switch to stratified Cox (each stratum has its own baseline hazard, PH still required within); (3) use parametric AFT models (Weibull, log-normal); (4) report restricted mean survival time (RMST) instead of HR (Royston & Parmar 2013 Stat Med). Rules: (a) always run cox.zph() or lifelines' check_assumptions(); (b) report the test result, not just HR + CI.
Sources: Grambsch PM, Therneau TM (1994) Biometrika 81:515. DOI: 10.1093/biomet/81.3.515; Royston P, Parmar MKB (2013) Stat Med 32:1259.
混合模型 vs pseudo-replication
生物實驗最常見錯誤之一:同一隻老鼠取多個細胞當「獨立樣本」、同一塊組織切多片當 n 個觀察。Hurlbert 1984 Ecol Monogr「Pseudoreplication and the design of ecological field experiments」是揭露 pseudo-replication 的經典論文。正確做法:(1) 用 random intercept 把 cluster 結構納入:lmer(y ~ treatment + (1|mouse));(2) 設計階段就盤算 biological replicate (mouse) vs technical replicate (cell/slice) 數量;(3) 報告 sample size 必須註明 mouse 數與 cell 數。常見領域:electrophysiology (cells per mouse)、histology (sections per tissue)、cell biology (wells per plate)。
A common mistake in biology: treating multiple cells from one mouse, or multiple slices from one tissue, as "independent" observations. Hurlbert 1984 Ecol Monogr ("Pseudoreplication and the design of ecological field experiments") is the canonical exposé of this error. Right approach: (1) absorb the cluster structure with a random intercept: lmer(y ~ treatment + (1|mouse)); (2) at design time, decide biological replicate count (mice) vs technical replicate count (cells/slices); (3) reports must state both mouse and cell counts. Common fields: electrophysiology (cells per mouse), histology (sections per tissue), cell biology (wells per plate).
Sources: Hurlbert SH (1984) Ecol Monogr 54:187; Lazic SE (2010) BMC Neurosci 11:5; Aarts E et al. (2014) Nat Neurosci 17:491.
Regression:Table 2 Fallacy
論文 Table 2 經常並列多個變項的調整後迴歸係數(exposure 與全部 confounder 一起),讀者很容易把每一行都當成「因果效應」。Westreich & Greenland 2013 Am J Epidemiol「The Table 2 Fallacy」直言:一個多變量迴歸模型 只有一個因果估計量(你 a priori 指定的 exposure),其他共變量只是為了 deconfound 那個 exposure 而納入,其係數可能本身被 collider / mediator 偏誤污染,不可直接因果解讀。正確做法:(1) 為每個感興趣的 exposure 畫獨立的 DAG(Hernán-Robins Causal Inference: What If);(2) 為每個 exposure 找出對應的「最小調整集」(minimum sufficient adjustment set);(3) 在表格中只報告該 exposure 的調整後估計,不要列其他「合身」的係數;或標註它們僅供 deconfounding 之用。
Manuscript Table 2's commonly list the adjusted regression coefficients for the exposure alongside every covariate, inviting readers to interpret each row causally. Westreich & Greenland 2013 Am J Epidemiol ("The Table 2 Fallacy") states bluntly: a multivariable regression has exactly one causal estimand — the a priori specified exposure — and the other covariates are only there to deconfound it. Their coefficients may themselves be contaminated by collider / mediator bias and cannot be interpreted causally. Best practice: (1) draw a separate DAG for each exposure you care about (Hernán-Robins Causal Inference: What If); (2) derive the minimum sufficient adjustment set per exposure; (3) report only the adjusted estimate for the target exposure, or annotate co-coefficients as "for deconfounding only".
Sources: Westreich D, Greenland S (2013) Am J Epidemiol 177:292. DOI: 10.1093/aje/kws412; Hernán MA, Robins JM (2020) Causal Inference: What If.
Survival:Immortal Time Bias
藥物流行病學常見偏誤:把 exposure 發生之前的 follow-up time(一段必然「存活」的時間)算進 exposed 組,使其看起來壽命較長。例:用「曾經拿過 statin 處方」當 exposure,從 cohort 進入時就計時,那段「尚未拿藥但必然還活著」的時間是 immortal time,把它計入 exposed 組會人為製造存活優勢。Lévesque LE et al. 2010 BMJ「Problem of immortal time bias in cohort studies」是揭露此問題的經典。解決方法:(1) time-varying covariate,個體在 exposure 發生前算作 unexposed、之後才算 exposed;(2) target trial emulation(Hernán-Robins),把觀察性研究重塑為假想 RCT,從 time-zero 就分組;(3) marginal structural models with IPTW 處理隨時間變動的 confounder。
A canonical pharmacoepidemiology pitfall: follow-up time accrued before the exposure occurs (a period during which the individual is necessarily alive) gets attributed to the exposed group, manufacturing a spurious survival advantage. Example: defining "ever-statin-user" as exposure but starting the clock at cohort entry — the time before the prescription is immortal time; counting it as exposed exaggerates survival. Lévesque LE et al. 2010 BMJ ("Problem of immortal time bias in cohort studies") is the standard reference. Fixes: (1) time-varying covariate, with individuals unexposed before the event and exposed after; (2) target trial emulation (Hernán-Robins), recasting the observational study as a hypothetical RCT with everyone aligned at time-zero; (3) marginal structural models with IPTW to handle time-varying confounding.
Sources: Lévesque LE, Hanley JA, Kezouh A, Suissa S (2010) BMJ 340:b5087. DOI: 10.1136/bmj.b5087; Hernán MA, Robins JM (2020) Causal Inference: What If.
Logistic:0.5 閾值的迷思
許多分類器套件預設用 predict_proba ≥ 0.5 來輸出二元預測;但 0.5 只是在 (a) 兩類等比例 且 (b) 偽陽性與偽陰性成本相等 時最大化 accuracy 的閾值——這兩個條件在臨床現實幾乎都不成立。常見替代方案:(1) Youden's J 指數 ≔ max(TPR − FPR),在 ROC 曲線左上角找轉折點;(2) Decision Curve Analysis(Vickers & Elkin 2006 Med Decis Making):依照臨床「願意接受的 FP:FN 比」決定臨床淨效益最大的閾值;(3) 不平衡資料下改報 precision-recall(Saito-Rehmsmeier 2015 PLoS One)而非 ROC-AUC。實務原則:在 train/validate split 上選定閾值,在 test split 上才評估——不可在 test 上重複試閾值。
Most classifier libraries default to predict_proba ≥ 0.5 for binary output, but 0.5 is only the accuracy-maximising threshold when (a) classes are balanced and (b) FP and FN costs are equal — almost never the clinical reality. Standard alternatives: (1) Youden's J statistic ≔ max(TPR − FPR), the upper-left corner of the ROC; (2) Decision Curve Analysis (Vickers & Elkin 2006 Med Decis Making), choosing the threshold that maximises net clinical benefit at a chosen FP:FN ratio; (3) for imbalanced data, report precision-recall (Saito & Rehmsmeier 2015 PLoS One) instead of ROC-AUC. Rule: select the threshold on a train/validate split, evaluate only on the held-out test set — never repeatedly tune it on test data.
Sources: Youden WJ (1950) Cancer 3:32; Vickers AJ, Elkin EB (2006) Med Decis Making 26:565. DOI: 10.1177/0272989X06295361; Saito T, Rehmsmeier M (2015) PLoS One 10:e0118432.