STEP 5 / 17

常見檢定方法

t / Wilcoxon / χ² / Fisher / ANOVA / Kruskal-Wallis──以及它們各自檢定的虛無假設是什麼。

t / Wilcoxon / χ² / Fisher / ANOVA / Kruskal-Wallis — and exactly what null hypothesis each one tests.

選對檢定 = 問對問題

「我該用 t 還是 Wilcoxon?」是錯誤的問法──兩者檢定的虛無假設不同:t 檢定的是「兩組母體平均相等」;Wilcoxon (Mann-Whitney U) 檢定的是「兩組分布完全相同 / 隨機優勢相等」,並不是中位數相等。選錯不只是「方法錯」,是問錯問題

本章建立三步驟流程:(1) 樣本是否配對?(2) 比 2 組還是 k 組?(3) 參數假設成立嗎?再對應到 t / Welch / paired t / Wilcoxon / Mann-Whitney / χ² / Fisher / ANOVA / Kruskal-Wallis。

"Should I use t or Wilcoxon?" is the wrong question — these two tests have different nulls: t tests "the two population means are equal"; Wilcoxon / Mann-Whitney U tests "the two distributions are identical / stochastic dominance is zero" — NOT "the medians are equal". Choosing wrong is not a method error — it's a question error.

This chapter gives a three-step pipeline: (1) paired or independent? (2) 2 vs k groups? (3) parametric assumptions OK? Then map to t / Welch / paired t / Wilcoxon / Mann-Whitney / χ² / Fisher / ANOVA / Kruskal-Wallis.

💡
R 預設用 Welch 不是 Student: t.test(x, y) 預設 var.equal = FALSE 即 Welch──不假設變異數相等,幾乎所有實務情境都更穩健,建議保持預設。 R defaults to Welch, not Student: t.test(x, y) uses var.equal = FALSE by default — Welch's t, which does not require equal variances. It is more robust in almost every real setting; keep the default.

一、檢定家族總覽

📏

兩組連續

  • Student's t:假設常態 + 變異數相等
  • Welch's t:常態但變異數不等 (預設首選)
  • Paired t:配對資料 (tumor-normal 同病人)
  • Mann-Whitney U:對 P(X > Y) ≠ 0.5 敏感──不是中位數檢定
  • Wilcoxon signed-rank:配對非參數版
  • Student's t: normal + equal variances
  • Welch's t: normal but unequal variances (default choice)
  • Paired t: paired data (tumor-normal per patient)
  • Mann-Whitney U: sensitive to P(X > Y) ≠ 0.5 — not a median test
  • Wilcoxon signed-rank: paired nonparametric
🧮

計數 / 列聯

  • χ² 獨立性 / goodness-of-fit:期望次數每格 ≥ 5
  • Fisher's exact:小樣本或邊際稀疏──但偏保守,可用 mid-p
  • McNemar:配對二元 (前後 diagnosis)
  • Cochran-Mantel-Haenszel:分層 2×2
  • χ² independence / goodness-of-fit: expected cells ≥ 5
  • Fisher's exact: small samples / sparse margins — conservative, use mid-p variant if needed
  • McNemar: paired binary (pre/post diagnosis)
  • Cochran-Mantel-Haenszel: stratified 2×2
📊

k 組比較

  • One-way ANOVA:等同 lm(y ~ group) 的 F 檢定,要求殘差近常態、組變異齊性
  • Welch's ANOVA:變異數不齊的修正版
  • Kruskal-Wallis:非參數版,用秩
  • 事後 (post-hoc):Tukey HSD (ANOVA)、Dunn (Kruskal)
  • One-way ANOVA: F-test from lm(y ~ group); needs near-normal residuals + homogeneity of variance
  • Welch's ANOVA: unequal-variance fix
  • Kruskal-Wallis: nonparametric version on ranks
  • Post-hoc: Tukey HSD (ANOVA), Dunn (Kruskal)
🧬

scRNA-seq

Seurat FindMarkers 預設用 Wilcoxon──因為單細胞表達分布非常態 (尖峰 + 大量零 + 重尾),且每個 cluster 數千細胞,秩檢定的 power 已足。它檢定的是「兩 cluster 間的分布是否相同」,不是「平均 logFC 是否為 0」──所以 avg_log2FC 是另外算的描述性統計量。

Seurat's FindMarkers defaults to Wilcoxon — single-cell expression is wildly non-normal (spike at 0, heavy tail) and clusters often have thousands of cells, so rank-based power is plenty. It tests "are the two distributions identical?", not "is average log2FC = 0" — avg_log2FC is reported separately as a descriptive statistic.

二、檢定選擇決策樹

🌳 從資料結構到推薦檢定

Q1:
反應變數是計數 / 類別嗎?→ 是 → 走 Q2;否 (連續) → Q3。
Q2:
每格期望次數均 ≥ 5?→ 是 → χ² 獨立性;否 → Fisher's exact。配對 (pre/post) → McNemar
Q3:
樣本配對 (同個體、tumor-normal、前後)?→ 是 → Q4;否 → Q5。
Q4:
差值近常態?→ 是 → Paired t;否 → Wilcoxon signed-rank
Q5:
幾組?2 組 → Q6;≥ 3 組 → Q7。
Q6:
常態 + 變異齊?→ 是 → Student's t;常態但變異不齊 → Welch's t (R 預設);嚴重偏態或極小 n → Mann-Whitney U
Q7:
常態 + 變異齊?→ 是 → One-way ANOVA + Tukey HSD;變異不齊 → Welch's ANOVA;非常態 → Kruskal-Wallis + Dunn post-hoc。
Q1:
Response is count / category? → Yes → Q2; otherwise (continuous) → Q3.
Q2:
All expected cells ≥ 5? → Yes → χ² independence; else Fisher's exact. Paired (pre/post) → McNemar.
Q3:
Are samples paired (same subject, tumor-normal, pre/post)? → Yes → Q4; else → Q5.
Q4:
Differences approximately normal? → Yes → Paired t; else Wilcoxon signed-rank.
Q5:
How many groups? 2 → Q6; ≥ 3 → Q7.
Q6:
Normal + equal variances? → Yes → Student's t; normal + unequal → Welch's t (R default); very skewed or tiny n → Mann-Whitney U.
Q7:
Normal + equal variances? → Yes → One-way ANOVA + Tukey HSD; unequal var → Welch's ANOVA; non-normal → Kruskal-Wallis + Dunn post-hoc.

互動決策樹 + 即時檢定

依下方按鈕選擇你的資料情境,下方會「點亮」推薦檢定,並在 toy dataset 上計算統計量與 p-value,畫出兩組 / 多組的箱型/長條圖。

Click the buttons to describe your scenario. The recommended test lights up, and we compute its statistic and p-value on a built-in toy dataset, with a quick bar / box panel.

— 推薦檢定 —stat = —p = —

內建玩具資料示範。

三、常見誤用

  • 「常態就 t,不常態就 Wilcoxon」是錯的:兩者檢定的虛無不同 (見上)。若你想比較平均,即使非常態,n 足夠時 Welch t 仍有效;想檢定分布是否相同才用 Wilcoxon。
  • 對 k 組直接做 k(k−1)/2 個兩兩 t 檢定:Type I 大幅膨脹 (k=4 → 0.265)。先做整體檢定 (ANOVA / KW),再做受控的 post-hoc (Tukey、Dunn)。
  • Fisher's exact 太保守:對小樣本,傳統 Fisher 偏保守 (true Type I < α);可用 mid-p Fisher 或 Boschloo 提高 power。
  • 對單一基因 χ² 套用「期望 < 5」規則卻忘了 20,000 基因的計算可行性:低 count 基因用 Fisher,trade-off 是運算時間;DESeq2 / edgeR 用 NB GLM 是更好的解。
  • scRNA-seq 用 Wilcoxon 後解讀為「平均表達不同」:Wilcoxon 不檢定平均;要報效果量請報 logFC。
  • "Use t if normal, Wilcoxon if not" is wrong: different nulls. For comparing means with non-normal data and adequate n, Welch's t is still valid. Use Wilcoxon only if you want to test whether the distributions match.
  • Running all pairwise t-tests across k groups: Type I balloons (k=4 → 0.265). Run an omnibus test (ANOVA / Kruskal-Wallis) first, then a controlled post-hoc (Tukey, Dunn).
  • Fisher's exact is conservative: for small samples, classic Fisher under-rejects (true Type I < α). Use mid-p Fisher or Boschloo to recover power.
  • Applying "expected < 5 → Fisher" gene-by-gene across 20,000 tests: Fisher is slow at scale; NB GLM (DESeq2 / edgeR) is the right tool for differential expression.
  • scRNA-seq Wilcoxon reported as "mean differs": Wilcoxon does not test means. Report logFC alongside as an effect-size descriptor.

實作對照

# --- R --- 常見檢定一覽
# 兩組連續
t.test(x, y)                                  # Welch (預設)
t.test(x, y, var.equal=TRUE)                # Student
t.test(x, y, paired=TRUE)                   # Paired
wilcox.test(x, y)                             # Mann-Whitney U
wilcox.test(x, y, paired=TRUE)              # Wilcoxon signed-rank

# 列聯 / 計數
M <- matrix(c(12, 8, 3, 17), 2, 2)
chisq.test(M); fisher.test(M); mcnemar.test(M)

# k 組比較
aov_fit <- aov(y ~ group, data=d)
summary(aov_fit); TukeyHSD(aov_fit)         # ANOVA + post-hoc
oneway.test(y ~ group, data=d)                # Welch ANOVA
kruskal.test(y ~ group, data=d)
library(FSA); dunnTest(y ~ group, data=d)  # KW 後 Dunn

# Tidy output
library(broom); tidy(t.test(x, y))
# --- Python --- 常見檢定一覽
from scipy import stats
import pingouin as pg, scikit_posthocs as sp

# 兩組連續
stats.ttest_ind(x, y, equal_var=False)         # Welch
stats.ttest_ind(x, y, equal_var=True)          # Student
stats.ttest_rel(x, y)                            # Paired
stats.mannwhitneyu(x, y, alternative="two-sided")
stats.wilcoxon(x, y)                             # signed-rank

# 列聯 / 計數
M = [[12, 8], [3, 17]]
stats.chi2_contingency(M)
stats.fisher_exact(M)
from statsmodels.stats.contingency_tables import mcnemar
mcnemar(M, exact=True)

# k 組比較
stats.f_oneway(g1, g2, g3)                       # ANOVA
pg.welch_anova(data=d, dv="y", between="group")
stats.kruskal(g1, g2, g3)
sp.posthoc_dunn(d, val_col="y", group_col="group")
pg.pairwise_tukey(data=d, dv="y", between="group")

📝 自我檢測

1. 何時應該選 Welch's t 而非 Student's t?

1. When should you prefer Welch's t over Student's t?

A. 只在小樣本A. Only with small samples
B. 幾乎所有實務情境──Welch 對變異數不等更穩健,是 R 的預設B. Almost always — Welch is robust to unequal variances; it is R's default
C. 只在配對資料C. Only with paired data
D. 只在資料非常態D. Only when data are non-normal

2. Mann-Whitney U 檢定的正確虛無假設是?

2. The correct null hypothesis for Mann-Whitney U is?

A. 兩組的中位數相等A. The two medians are equal
B. 兩組的平均相等B. The two means are equal
C. 兩組來自相同分布 (P(X > Y) = 0.5)C. Two distributions are equal (P(X > Y) = 0.5)
D. 兩組變異數相等D. The two variances are equal

3. 為何 k 組比較要先做整體 ANOVA / KW,而不是直接做所有兩兩 t 檢定?

3. Why run an omnibus ANOVA / Kruskal-Wallis before pairwise tests instead of doing all pairwise t-tests directly?

A. 因為 ANOVA 比較快A. ANOVA is faster
B. 因為 t 檢定不能用於 k 組B. Because t-tests cannot be used with k groups
C. 多重兩兩比較會嚴重膨脹 Type I 錯誤;整體檢定 + 受控 post-hoc 維持族系誤差率C. Multiple pairwise tests inflate Type I greatly; omnibus + controlled post-hoc preserves family-wise error
D. ANOVA 比 t 更精準D. ANOVA is more accurate than t