📚 參考資料

13 章生物統計教學對應的經典教科書、官方聲明、報告規範與爭議延伸。

Classic textbooks, official statements, reporting guidelines, and current debates for all 13 chapters.

如何使用這份資料?

所有 DOI 連結指向出版社官方頁面。經典教科書(Rosner / Altman / Kleinbaum)為入門 +進階首選;報告規範(CONSORT / STROBE / TRIPOD)為論文寫作必讀;ASA 聲明指出 p-value 常被誤用的方式。最下方「教學註記」是閱讀本教學時應一併知道的方法論爭議與更新。

All DOI links point to publisher pages. Classic textbooks (Rosner / Altman / Kleinbaum) are the canonical entry points; reporting guidelines (CONSORT / STROBE / TRIPOD) are required reading for paper writing; the ASA statements catalogue common p-value misuses. The "Tutorial Notes" section at the bottom records methodological controversies and updates to be aware of while reading the tutorial.

📖 經典教科書

📜 ASA 聲明

📋 報告規範

🔑 章節關鍵論文

🛠️ R / Python 工具

📖
查找指南: 所有 DOI 連結指向出版社官方頁面。若需開放近用版本,可搭配 All DOI links point to publisher pages. For open-access PDFs, try PubMed · PMC · arXiv · bioRxiv 找開放近用版本。.

📌 教學註記與細節

下方為閱讀本教學時應一併知道的方法論爭議與更新。本教學中的章節說明僅為入門框架,實務應用時請以最新規範與爭議文獻為準。

The notes below record methodological controversies and current updates worth knowing while reading this tutorial. Chapter explanations are introductory frameworks; practical applications should defer to the latest guidelines and current literature.

Hypothesis:p-value 的真意

即使教學以「p < 0.05 → 拒絕 H₀」呈現,仍需強調 Wasserstein & Lazar 2016 Am Stat(ASA 聲明)六大原則:(1) p-value 不是 H₁ 為真的機率;(2) p-value 不衡量效應大小;(3) 統計結論不應僅憑 p-value 門檻決定;(4) 透明、完整報告才能避免「p-hacking / forking paths」。Gelman & Loken 2014「Garden of Forking Paths」進一步指出,無意識的資料探索也會膨脹偽陽性;預先註冊(preregistration)+ 報告 effect size + CI 才是正確做法。

Even if the tutorial frames "p < 0.05 → reject H₀", emphasise the six principles of Wasserstein & Lazar 2016 Am Stat (ASA statement): (1) p is not the probability that H₁ is true; (2) p does not measure effect size; (3) decisions should not depend solely on a p-value threshold; (4) transparency and complete reporting prevent p-hacking / forking paths. Gelman & Loken 2014 (Garden of Forking Paths) shows even unintentional data exploration inflates false positives. The right answer is preregistration + effect sizes + CIs.

Sources: Wasserstein RL, Lazar NA (2016) Am Stat 70:129. DOI: 10.1080/00031305.2016.1154108; Gelman A, Loken E (2014) Am Sci 102:460.

t-tests:Welch 預設

傳統「先測等變異 (Levene) → 等變異用 Student、不等用 Welch」是過時流程。Delacre, Lakens & Leys 2017 Int Rev Soc Psychol 證明:等變異時 Welch 與 Student 表現幾乎相同,但 Welch 在不等變異時穩健;先測 Levene 反而引入兩段檢定(multiple testing)。實務原則:t 檢定直接用 Welch,不需 Levene 前測。R 的 t.test() 預設 var.equal=FALSE 即 Welch;Python scipy.stats.ttest_ind(equal_var=False) 同。

The traditional "Levene first → Student if equal-variance, Welch otherwise" pipeline is outdated. Delacre, Lakens & Leys 2017 Int Rev Soc Psychol show Welch is essentially equivalent to Student under equal variance and far more robust when not. The pre-test introduces multiple testing. Rule: use Welch by default, no Levene pre-test. R's t.test() uses var.equal=FALSE by default; Python scipy.stats.ttest_ind(equal_var=False).

Sources: Delacre M, Lakens D, Leys C (2017) Int Rev Soc Psychol 30:92. DOI: 10.5334/irsp.82.

Chi-square:小樣本改 Fisher's

卡方檢定的卡方近似需要 每格 expected count ≥ 5(Cochran 1954 規則)。當 2×2 表有任一格 expected < 5,改用 Fisher's exact test。Yates continuity correction 為早期試圖修正連續性誤差的做法,現代爭議是 over-conservative,多數情況不必加(Sokal & Rohlf 2012; Camilli 1995)。實務原則:R 的 chisq.test(correct=FALSE)、Python 的 scipy.stats.chi2_contingency(correction=False);小樣本直接 fisher.test() / scipy.stats.fisher_exact

The chi-square approximation requires expected count ≥ 5 per cell (Cochran 1954). For 2×2 tables with any expected < 5, use Fisher's exact test. Yates continuity correction was an early fix but modern view considers it over-conservative (Sokal & Rohlf 2012; Camilli 1995); usually omit it. Use chisq.test(correct=FALSE) in R, chi2_contingency(correction=False) in Python; for small samples switch to fisher.test() / fisher_exact.

Sources: Cochran WG (1954) Biometrics 10:417; Camilli G (1995) Psychol Bull 117:135.

Post-hoc power 的禁忌

論文 reviewer 偶爾會要求「報告 observed power」——這是統計學界一致反對的做法。Hoenig & Heisey 2001 Am Stat「The Abuse of Power」證明:observed power 與 p-value 是一對一函數(p < 0.05 ⇔ power > 0.50),所以 post-hoc power 不帶任何超出 p-value 的資訊;它的功能是讓不顯著結果看起來「sample size 不夠」。正確做法:報告 effect size + 其 95% CI,讓讀者自行判斷臨床/生物意義。如要回答「若再做一次研究的 power?」應指定 a priori effect size(如最小臨床意義差異 MCID)後重算。

Reviewers sometimes ask for "observed power" — this is universally opposed by statisticians. Hoenig & Heisey 2001 Am Stat ("The Abuse of Power") show observed power is a one-to-one function of the p-value (p < 0.05 ⇔ power > 0.50), carrying no information beyond the p-value. Its only function is to make non-significant results look "underpowered". The right approach: report effect size with 95% CI and let readers judge clinical / biological relevance. To ask "what power would another study have?" specify an a priori effect size (e.g. minimum clinically important difference) and compute from scratch.

Sources: Hoenig JM, Heisey DM (2001) Am Stat 55:19. DOI: 10.1198/000313001300339897.

FDR:BH 的假設與 BY 替代

Benjamini-Hochberg 1995 控制 FDR 在 positive regression dependency (PRDS) 條件下,包括獨立與多種正相依結構(基因檢定通常滿足)。對任意相依(如時間序列、特定 GWAS 結構),應改用 Benjamini-Yekutieli 2001 Ann Stat(BY),其保守度約為 BH 的 ln(m)+0.577 倍。R 的 p.adjust(p, method="BY");Python 的 multipletests(method='fdr_by')Storey q-value(Storey 2002 JRSS-B)adapt 至 π₀(真 H₀ 的比例),高 power 但要求 m 大、π₀ 可估,scRNA-seq DE 常用。

Benjamini-Hochberg 1995 controls FDR under positive regression dependency (PRDS) — independence plus many positive-dependence structures (gene tests usually qualify). Under arbitrary dependence (time series, certain GWAS structures), switch to Benjamini-Yekutieli 2001 Ann Stat (BY), about ln(m)+0.577 times more conservative. Use p.adjust(method="BY") in R, multipletests(method='fdr_by') in Python. Storey q-value (Storey 2002 JRSS-B) adapts to π₀ (proportion of true nulls) for higher power; requires large m and estimable π₀, common in scRNA-seq DE.

Sources: Benjamini Y, Hochberg Y (1995) JRSS-B 57:289. DOI: 10.1111/j.2517-6161.1995.tb02031.x; Benjamini Y, Yekutieli D (2001) Ann Stat 29:1165.

Cox PH 違反時的處理

Cox 模型核心是 hazard ratio 隨時間恆定(proportional hazards)。當 Schoenfeld residual 檢驗 (Grambsch & Therneau 1994 Biometrika) 顯示違反,選項包括:(1) 加入 time-varying covariate,例如 group : tt(time);(2) 改用 stratified Cox(每層自己的基準 hazard,但組內仍 PH);(3) 用 parametric AFT 模型(accelerated failure time,如 Weibull / log-normal);(4) 改報 restricted mean survival time (RMST) 而非 HR(Royston & Parmar 2013 Stat Med)。實務原則:(a) 任何 Cox 模型都要查 cox.zph() 或 lifelines 的 check_assumptions();(b) 報告該檢驗結果,不只 HR + CI。

Cox models hinge on constant hazard ratios (proportional hazards). When Schoenfeld residual tests (Grambsch & Therneau 1994 Biometrika) flag a violation, options: (1) add a time-varying covariate like group : tt(time); (2) switch to stratified Cox (each stratum has its own baseline hazard, PH still required within); (3) use parametric AFT models (Weibull, log-normal); (4) report restricted mean survival time (RMST) instead of HR (Royston & Parmar 2013 Stat Med). Rules: (a) always run cox.zph() or lifelines' check_assumptions(); (b) report the test result, not just HR + CI.

Sources: Grambsch PM, Therneau TM (1994) Biometrika 81:515. DOI: 10.1093/biomet/81.3.515; Royston P, Parmar MKB (2013) Stat Med 32:1259.

混合模型 vs pseudo-replication

生物實驗最常見錯誤之一:同一隻老鼠取多個細胞當「獨立樣本」、同一塊組織切多片當 n 個觀察。Hurlbert 1984 Ecol Monogr「Pseudoreplication and the design of ecological field experiments」是揭露 pseudo-replication 的經典論文。正確做法:(1) 用 random intercept 把 cluster 結構納入:lmer(y ~ treatment + (1|mouse));(2) 設計階段就盤算 biological replicate (mouse) vs technical replicate (cell/slice) 數量;(3) 報告 sample size 必須註明 mouse 數與 cell 數。常見領域:electrophysiology (cells per mouse)、histology (sections per tissue)、cell biology (wells per plate)。

A common mistake in biology: treating multiple cells from one mouse, or multiple slices from one tissue, as "independent" observations. Hurlbert 1984 Ecol Monogr ("Pseudoreplication and the design of ecological field experiments") is the canonical exposé of this error. Right approach: (1) absorb the cluster structure with a random intercept: lmer(y ~ treatment + (1|mouse)); (2) at design time, decide biological replicate count (mice) vs technical replicate count (cells/slices); (3) reports must state both mouse and cell counts. Common fields: electrophysiology (cells per mouse), histology (sections per tissue), cell biology (wells per plate).

Sources: Hurlbert SH (1984) Ecol Monogr 54:187; Lazic SE (2010) BMC Neurosci 11:5; Aarts E et al. (2014) Nat Neurosci 17:491.

Regression:Table 2 Fallacy

論文 Table 2 經常並列多個變項的調整後迴歸係數(exposure 與全部 confounder 一起),讀者很容易把每一行都當成「因果效應」。Westreich & Greenland 2013 Am J Epidemiol「The Table 2 Fallacy」直言:一個多變量迴歸模型 只有一個因果估計量(你 a priori 指定的 exposure),其他共變量只是為了 deconfound 那個 exposure 而納入,其係數可能本身被 collider / mediator 偏誤污染,不可直接因果解讀。正確做法:(1) 為每個感興趣的 exposure 畫獨立的 DAG(Hernán-Robins Causal Inference: What If);(2) 為每個 exposure 找出對應的「最小調整集」(minimum sufficient adjustment set);(3) 在表格中只報告該 exposure 的調整後估計,不要列其他「合身」的係數;或標註它們僅供 deconfounding 之用。

Manuscript Table 2's commonly list the adjusted regression coefficients for the exposure alongside every covariate, inviting readers to interpret each row causally. Westreich & Greenland 2013 Am J Epidemiol ("The Table 2 Fallacy") states bluntly: a multivariable regression has exactly one causal estimand — the a priori specified exposure — and the other covariates are only there to deconfound it. Their coefficients may themselves be contaminated by collider / mediator bias and cannot be interpreted causally. Best practice: (1) draw a separate DAG for each exposure you care about (Hernán-Robins Causal Inference: What If); (2) derive the minimum sufficient adjustment set per exposure; (3) report only the adjusted estimate for the target exposure, or annotate co-coefficients as "for deconfounding only".

Sources: Westreich D, Greenland S (2013) Am J Epidemiol 177:292. DOI: 10.1093/aje/kws412; Hernán MA, Robins JM (2020) Causal Inference: What If.

Survival:Immortal Time Bias

藥物流行病學常見偏誤:把 exposure 發生之前的 follow-up time(一段必然「存活」的時間)算進 exposed 組,使其看起來壽命較長。例:用「曾經拿過 statin 處方」當 exposure,從 cohort 進入時就計時,那段「尚未拿藥但必然還活著」的時間是 immortal time,把它計入 exposed 組會人為製造存活優勢。Lévesque LE et al. 2010 BMJ「Problem of immortal time bias in cohort studies」是揭露此問題的經典。解決方法:(1) time-varying covariate,個體在 exposure 發生前算作 unexposed、之後才算 exposed;(2) target trial emulation(Hernán-Robins),把觀察性研究重塑為假想 RCT,從 time-zero 就分組;(3) marginal structural models with IPTW 處理隨時間變動的 confounder。

A canonical pharmacoepidemiology pitfall: follow-up time accrued before the exposure occurs (a period during which the individual is necessarily alive) gets attributed to the exposed group, manufacturing a spurious survival advantage. Example: defining "ever-statin-user" as exposure but starting the clock at cohort entry — the time before the prescription is immortal time; counting it as exposed exaggerates survival. Lévesque LE et al. 2010 BMJ ("Problem of immortal time bias in cohort studies") is the standard reference. Fixes: (1) time-varying covariate, with individuals unexposed before the event and exposed after; (2) target trial emulation (Hernán-Robins), recasting the observational study as a hypothetical RCT with everyone aligned at time-zero; (3) marginal structural models with IPTW to handle time-varying confounding.

Sources: Lévesque LE, Hanley JA, Kezouh A, Suissa S (2010) BMJ 340:b5087. DOI: 10.1136/bmj.b5087; Hernán MA, Robins JM (2020) Causal Inference: What If.

Logistic:0.5 閾值的迷思

許多分類器套件預設用 predict_proba ≥ 0.5 來輸出二元預測;但 0.5 只是在 (a) 兩類等比例 且 (b) 偽陽性與偽陰性成本相等 時最大化 accuracy 的閾值——這兩個條件在臨床現實幾乎都不成立。常見替代方案:(1) Youden's J 指數 ≔ max(TPR − FPR),在 ROC 曲線左上角找轉折點;(2) Decision Curve Analysis(Vickers & Elkin 2006 Med Decis Making):依照臨床「願意接受的 FP:FN 比」決定臨床淨效益最大的閾值;(3) 不平衡資料下改報 precision-recall(Saito-Rehmsmeier 2015 PLoS One)而非 ROC-AUC。實務原則:在 train/validate split 上選定閾值,在 test split 上才評估——不可在 test 上重複試閾值。

Most classifier libraries default to predict_proba ≥ 0.5 for binary output, but 0.5 is only the accuracy-maximising threshold when (a) classes are balanced and (b) FP and FN costs are equal — almost never the clinical reality. Standard alternatives: (1) Youden's J statistic ≔ max(TPR − FPR), the upper-left corner of the ROC; (2) Decision Curve Analysis (Vickers & Elkin 2006 Med Decis Making), choosing the threshold that maximises net clinical benefit at a chosen FP:FN ratio; (3) for imbalanced data, report precision-recall (Saito & Rehmsmeier 2015 PLoS One) instead of ROC-AUC. Rule: select the threshold on a train/validate split, evaluate only on the held-out test set — never repeatedly tune it on test data.

Sources: Youden WJ (1950) Cancer 3:32; Vickers AJ, Elkin EB (2006) Med Decis Making 26:565. DOI: 10.1177/0272989X06295361; Saito T, Rehmsmeier M (2015) PLoS One 10:e0118432.