選對檢定 = 問對問題
「我該用 t 還是 Wilcoxon?」是錯誤的問法──兩者檢定的虛無假設不同:t 檢定的是「兩組母體平均相等」;Wilcoxon (Mann-Whitney U) 檢定的是「兩組分布完全相同 / 隨機優勢相等」,並不是中位數相等。選錯不只是「方法錯」,是問錯問題。
本章建立三步驟流程:(1) 樣本是否配對?(2) 比 2 組還是 k 組?(3) 參數假設成立嗎?再對應到 t / Welch / paired t / Wilcoxon / Mann-Whitney / χ² / Fisher / ANOVA / Kruskal-Wallis。
"Should I use t or Wilcoxon?" is the wrong question — these two tests have different nulls: t tests "the two population means are equal"; Wilcoxon / Mann-Whitney U tests "the two distributions are identical / stochastic dominance is zero" — NOT "the medians are equal". Choosing wrong is not a method error — it's a question error.
This chapter gives a three-step pipeline: (1) paired or independent? (2) 2 vs k groups? (3) parametric assumptions OK? Then map to t / Welch / paired t / Wilcoxon / Mann-Whitney / χ² / Fisher / ANOVA / Kruskal-Wallis.
t.test(x, y) 預設 var.equal = FALSE 即 Welch──不假設變異數相等,幾乎所有實務情境都更穩健,建議保持預設。
R defaults to Welch, not Student: t.test(x, y) uses var.equal = FALSE by default — Welch's t, which does not require equal variances. It is more robust in almost every real setting; keep the default.
一、檢定家族總覽
兩組連續
- Student's t:假設常態 + 變異數相等
- Welch's t:常態但變異數不等 (預設首選)
- Paired t:配對資料 (tumor-normal 同病人)
- Mann-Whitney U:對 P(X > Y) ≠ 0.5 敏感──不是中位數檢定
- Wilcoxon signed-rank:配對非參數版
- Student's t: normal + equal variances
- Welch's t: normal but unequal variances (default choice)
- Paired t: paired data (tumor-normal per patient)
- Mann-Whitney U: sensitive to P(X > Y) ≠ 0.5 — not a median test
- Wilcoxon signed-rank: paired nonparametric
計數 / 列聯
- χ² 獨立性 / goodness-of-fit:期望次數每格 ≥ 5
- Fisher's exact:小樣本或邊際稀疏──但偏保守,可用 mid-p
- McNemar:配對二元 (前後 diagnosis)
- Cochran-Mantel-Haenszel:分層 2×2
- χ² independence / goodness-of-fit: expected cells ≥ 5
- Fisher's exact: small samples / sparse margins — conservative, use mid-p variant if needed
- McNemar: paired binary (pre/post diagnosis)
- Cochran-Mantel-Haenszel: stratified 2×2
k 組比較
- One-way ANOVA:等同
lm(y ~ group)的 F 檢定,要求殘差近常態、組變異齊性 - Welch's ANOVA:變異數不齊的修正版
- Kruskal-Wallis:非參數版,用秩
- 事後 (post-hoc):Tukey HSD (ANOVA)、Dunn (Kruskal)
- One-way ANOVA: F-test from
lm(y ~ group); needs near-normal residuals + homogeneity of variance - Welch's ANOVA: unequal-variance fix
- Kruskal-Wallis: nonparametric version on ranks
- Post-hoc: Tukey HSD (ANOVA), Dunn (Kruskal)
scRNA-seq
Seurat FindMarkers 預設用 Wilcoxon──因為單細胞表達分布遠非常態 (尖峰 + 大量零 + 重尾),且每個 cluster 數千細胞,秩檢定的 power 已足。它檢定的是「兩 cluster 間的分布是否相同」,不是「平均 logFC 是否為 0」──所以 avg_log2FC 是另外算的描述性統計量。
Seurat's FindMarkers defaults to Wilcoxon — single-cell expression is wildly non-normal (spike at 0, heavy tail) and clusters often have thousands of cells, so rank-based power is plenty. It tests "are the two distributions identical?", not "is average log2FC = 0" — avg_log2FC is reported separately as a descriptive statistic.
二、檢定選擇決策樹
🌳 從資料結構到推薦檢定
互動決策樹 + 即時檢定
依下方按鈕選擇你的資料情境,下方會「點亮」推薦檢定,並在 toy dataset 上計算統計量與 p-value,畫出兩組 / 多組的箱型/長條圖。
Click the buttons to describe your scenario. The recommended test lights up, and we compute its statistic and p-value on a built-in toy dataset, with a quick bar / box panel.
內建玩具資料示範。
三、常見誤用
- 「常態就 t,不常態就 Wilcoxon」是錯的:兩者檢定的虛無不同 (見上)。若你想比較平均,即使非常態,n 足夠時 Welch t 仍有效;想檢定分布是否相同才用 Wilcoxon。
- 對 k 組直接做 k(k−1)/2 個兩兩 t 檢定:Type I 大幅膨脹 (k=4 → 0.265)。先做整體檢定 (ANOVA / KW),再做受控的 post-hoc (Tukey、Dunn)。
- Fisher's exact 太保守:對小樣本,傳統 Fisher 偏保守 (true Type I < α);可用 mid-p Fisher 或 Boschloo 提高 power。
- 對單一基因 χ² 套用「期望 < 5」規則卻忘了 20,000 基因的計算可行性:低 count 基因用 Fisher,trade-off 是運算時間;DESeq2 / edgeR 用 NB GLM 是更好的解。
- scRNA-seq 用 Wilcoxon 後解讀為「平均表達不同」:Wilcoxon 不檢定平均;要報效果量請報 logFC。
- "Use t if normal, Wilcoxon if not" is wrong: different nulls. For comparing means with non-normal data and adequate n, Welch's t is still valid. Use Wilcoxon only if you want to test whether the distributions match.
- Running all pairwise t-tests across k groups: Type I balloons (k=4 → 0.265). Run an omnibus test (ANOVA / Kruskal-Wallis) first, then a controlled post-hoc (Tukey, Dunn).
- Fisher's exact is conservative: for small samples, classic Fisher under-rejects (true Type I < α). Use mid-p Fisher or Boschloo to recover power.
- Applying "expected < 5 → Fisher" gene-by-gene across 20,000 tests: Fisher is slow at scale; NB GLM (DESeq2 / edgeR) is the right tool for differential expression.
- scRNA-seq Wilcoxon reported as "mean differs": Wilcoxon does not test means. Report logFC alongside as an effect-size descriptor.
實作對照
# --- R --- 常見檢定一覽 # 兩組連續 t.test(x, y) # Welch (預設) t.test(x, y, var.equal=TRUE) # Student t.test(x, y, paired=TRUE) # Paired wilcox.test(x, y) # Mann-Whitney U wilcox.test(x, y, paired=TRUE) # Wilcoxon signed-rank # 列聯 / 計數 M <- matrix(c(12, 8, 3, 17), 2, 2) chisq.test(M); fisher.test(M); mcnemar.test(M) # k 組比較 aov_fit <- aov(y ~ group, data=d) summary(aov_fit); TukeyHSD(aov_fit) # ANOVA + post-hoc oneway.test(y ~ group, data=d) # Welch ANOVA kruskal.test(y ~ group, data=d) library(FSA); dunnTest(y ~ group, data=d) # KW 後 Dunn # Tidy output library(broom); tidy(t.test(x, y))
# --- Python --- 常見檢定一覽 from scipy import stats import pingouin as pg, scikit_posthocs as sp # 兩組連續 stats.ttest_ind(x, y, equal_var=False) # Welch stats.ttest_ind(x, y, equal_var=True) # Student stats.ttest_rel(x, y) # Paired stats.mannwhitneyu(x, y, alternative="two-sided") stats.wilcoxon(x, y) # signed-rank # 列聯 / 計數 M = [[12, 8], [3, 17]] stats.chi2_contingency(M) stats.fisher_exact(M) from statsmodels.stats.contingency_tables import mcnemar mcnemar(M, exact=True) # k 組比較 stats.f_oneway(g1, g2, g3) # ANOVA pg.welch_anova(data=d, dv="y", between="group") stats.kruskal(g1, g2, g3) sp.posthoc_dunn(d, val_col="y", group_col="group") pg.pairwise_tukey(data=d, dv="y", between="group")
📝 自我檢測
1. 何時應該選 Welch's t 而非 Student's t?
1. When should you prefer Welch's t over Student's t?
2. Mann-Whitney U 檢定的正確虛無假設是?
2. The correct null hypothesis for Mann-Whitney U is?
3. 為何 k 組比較要先做整體 ANOVA / KW,而不是直接做所有兩兩 t 檢定?
3. Why run an omnibus ANOVA / Kruskal-Wallis before pairwise tests instead of doing all pairwise t-tests directly?