Step 5: Common Tests — Statistical Inference Tutorial

概覽

選對檢定 = 問對問題

「我該用 t 還是 Wilcoxon？」是錯誤的問法──兩者檢定的虛無假設不同：t 檢定的是「兩組母體平均相等」；Wilcoxon (Mann-Whitney U) 檢定的是「兩組分布完全相同 / 隨機優勢相等」，並不是中位數相等。選錯不只是「方法錯」，是問錯問題。

本章建立三步驟流程：(1) 樣本是否配對？(2) 比 2 組還是 k 組？(3) 參數假設成立嗎？再對應到 t / Welch / paired t / Wilcoxon / Mann-Whitney / χ² / Fisher / ANOVA / Kruskal-Wallis。

"Should I use t or Wilcoxon?" is the wrong question — these two tests have different nulls: t tests "the two population means are equal"; Wilcoxon / Mann-Whitney U tests "the two distributions are identical / stochastic dominance is zero" — NOT "the medians are equal". Choosing wrong is not a method error — it's a question error.

This chapter gives a three-step pipeline: (1) paired or independent? (2) 2 vs k groups? (3) parametric assumptions OK? Then map to t / Welch / paired t / Wilcoxon / Mann-Whitney / χ² / Fisher / ANOVA / Kruskal-Wallis.

💡

R 預設用 Welch 不是 Student： t.test(x, y) 預設 var.equal = FALSE 即 Welch──不假設變異數相等，幾乎所有實務情境都更穩健，建議保持預設。 R defaults to Welch, not Student: t.test(x, y) uses var.equal = FALSE by default — Welch's t, which does not require equal variances. It is more robust in almost every real setting; keep the default.

核心概念

一、檢定家族總覽

📏

兩組連續

Student's t：假設常態 + 變異數相等
Welch's t：常態但變異數不等 (預設首選)
Paired t：配對資料 (tumor-normal 同病人)
Mann-Whitney U：對 P(X > Y) ≠ 0.5 敏感──不是中位數檢定
Wilcoxon signed-rank：配對非參數版

Student's t: normal + equal variances
Welch's t: normal but unequal variances (default choice)
Paired t: paired data (tumor-normal per patient)
Mann-Whitney U: sensitive to P(X > Y) ≠ 0.5 — not a median test
Wilcoxon signed-rank: paired nonparametric

🧮

計數 / 列聯

χ² 獨立性 / goodness-of-fit：期望次數每格 ≥ 5
Fisher's exact：小樣本或邊際稀疏──但偏保守，可用 mid-p
McNemar：配對二元 (前後 diagnosis)
Cochran-Mantel-Haenszel：分層 2×2

χ² independence / goodness-of-fit: expected cells ≥ 5
Fisher's exact: small samples / sparse margins — conservative, use mid-p variant if needed
McNemar: paired binary (pre/post diagnosis)
Cochran-Mantel-Haenszel: stratified 2×2

📊

k 組比較

One-way ANOVA：等同 lm(y ~ group) 的 F 檢定，要求殘差近常態、組變異齊性
Welch's ANOVA：變異數不齊的修正版
Kruskal-Wallis：非參數版，用秩
事後 (post-hoc)：Tukey HSD (ANOVA)、Dunn (Kruskal)

One-way ANOVA: F-test from lm(y ~ group); needs near-normal residuals + homogeneity of variance
Welch's ANOVA: unequal-variance fix
Kruskal-Wallis: nonparametric version on ranks
Post-hoc: Tukey HSD (ANOVA), Dunn (Kruskal)

🧬

scRNA-seq

Seurat FindMarkers 預設用 Wilcoxon──因為單細胞表達分布遠非常態 (尖峰 + 大量零 + 重尾)，且每個 cluster 數千細胞，秩檢定的 power 已足。它檢定的是「兩 cluster 間的分布是否相同」，不是「平均 logFC 是否為 0」──所以 avg_log2FC 是另外算的描述性統計量。

Seurat's FindMarkers defaults to Wilcoxon — single-cell expression is wildly non-normal (spike at 0, heavy tail) and clusters often have thousands of cells, so rank-based power is plenty. It tests "are the two distributions identical?", not "is average log2FC = 0" — avg_log2FC is reported separately as a descriptive statistic.

決策樹

二、檢定選擇決策樹

🌳 從資料結構到推薦檢定

Q1:

反應變數是計數 / 類別嗎？→ 是 → 走 Q2；否 (連續) → Q3。

Q2:

每格期望次數均 ≥ 5？→ 是 → χ² 獨立性；否 → Fisher's exact。配對 (pre/post) → McNemar。

Q3:

樣本配對 (同個體、tumor-normal、前後)？→ 是 → Q4；否 → Q5。

Q4:

差值近常態？→ 是 → Paired t；否 → Wilcoxon signed-rank。

Q5:

幾組？2 組 → Q6；≥ 3 組 → Q7。

Q6:

常態 + 變異齊？→ 是 → Student's t；常態但變異不齊 → Welch's t (R 預設)；嚴重偏態或極小 n → Mann-Whitney U。

Q7:

常態 + 變異齊？→ 是 → One-way ANOVA + Tukey HSD；變異不齊 → Welch's ANOVA；非常態 → Kruskal-Wallis + Dunn post-hoc。

Q1:

Response is count / category? → Yes → Q2; otherwise (continuous) → Q3.

Q2:

All expected cells ≥ 5? → Yes → χ² independence; else Fisher's exact. Paired (pre/post) → McNemar.

Q3:

Are samples paired (same subject, tumor-normal, pre/post)? → Yes → Q4; else → Q5.

Q4:

Differences approximately normal? → Yes → Paired t; else Wilcoxon signed-rank.

Q5:

How many groups? 2 → Q6; ≥ 3 → Q7.

Q6:

Normal + equal variances? → Yes → Student's t; normal + unequal → Welch's t (R default); very skewed or tiny n → Mann-Whitney U.

Q7:

Normal + equal variances? → Yes → One-way ANOVA + Tukey HSD; unequal var → Welch's ANOVA; non-normal → Kruskal-Wallis + Dunn post-hoc.

互動模擬

互動決策樹 + 即時檢定

依下方按鈕選擇你的資料情境，下方會「點亮」推薦檢定，並在 toy dataset 上計算統計量與 p-value，畫出兩組 / 多組的箱型/長條圖。

Click the buttons to describe your scenario. The recommended test lights up, and we compute its statistic and p-value on a built-in toy dataset, with a quick bar / box panel.

— 推薦檢定 —stat = —p = —

內建玩具資料示範。

陷阱與誤用

三、常見誤用

「常態就 t，不常態就 Wilcoxon」是錯的：兩者檢定的虛無不同 (見上)。若你想比較平均，即使非常態，n 足夠時 Welch t 仍有效；想檢定分布是否相同才用 Wilcoxon。
對 k 組直接做 k(k−1)/2 個兩兩 t 檢定：Type I 大幅膨脹 (k=4 → 0.265)。先做整體檢定 (ANOVA / KW)，再做受控的 post-hoc (Tukey、Dunn)。
Fisher's exact 太保守：對小樣本，傳統 Fisher 偏保守 (true Type I < α)；可用 mid-p Fisher 或 Boschloo 提高 power。
對單一基因 χ² 套用「期望 < 5」規則卻忘了 20,000 基因的計算可行性：低 count 基因用 Fisher，trade-off 是運算時間；DESeq2 / edgeR 用 NB GLM 是更好的解。
scRNA-seq 用 Wilcoxon 後解讀為「平均表達不同」：Wilcoxon 不檢定平均；要報效果量請報 logFC。

"Use t if normal, Wilcoxon if not" is wrong: different nulls. For comparing means with non-normal data and adequate n, Welch's t is still valid. Use Wilcoxon only if you want to test whether the distributions match.
Running all pairwise t-tests across k groups: Type I balloons (k=4 → 0.265). Run an omnibus test (ANOVA / Kruskal-Wallis) first, then a controlled post-hoc (Tukey, Dunn).
Fisher's exact is conservative: for small samples, classic Fisher under-rejects (true Type I < α). Use mid-p Fisher or Boschloo to recover power.
Applying "expected < 5 → Fisher" gene-by-gene across 20,000 tests: Fisher is slow at scale; NB GLM (DESeq2 / edgeR) is the right tool for differential expression.
scRNA-seq Wilcoxon reported as "mean differs": Wilcoxon does not test means. Report logFC alongside as an effect-size descriptor.

程式碼

實作對照

# --- R --- 常見檢定一覽
# 兩組連續
t.test(x, y)                                  # Welch (預設)
t.test(x, y, var.equal=TRUE)                # Student
t.test(x, y, paired=TRUE)                   # Paired
wilcox.test(x, y)                             # Mann-Whitney U
wilcox.test(x, y, paired=TRUE)              # Wilcoxon signed-rank

# 列聯 / 計數
M <- matrix(c(12, 8, 3, 17), 2, 2)
chisq.test(M); fisher.test(M); mcnemar.test(M)

# k 組比較
aov_fit <- aov(y ~ group, data=d)
summary(aov_fit); TukeyHSD(aov_fit)         # ANOVA + post-hoc
oneway.test(y ~ group, data=d)                # Welch ANOVA
kruskal.test(y ~ group, data=d)
library(FSA); dunnTest(y ~ group, data=d)  # KW 後 Dunn

# Tidy output
library(broom); tidy(t.test(x, y))

# --- Python --- 常見檢定一覽
from scipy import stats
import pingouin as pg, scikit_posthocs as sp

# 兩組連續
stats.ttest_ind(x, y, equal_var=False)         # Welch
stats.ttest_ind(x, y, equal_var=True)          # Student
stats.ttest_rel(x, y)                            # Paired
stats.mannwhitneyu(x, y, alternative="two-sided")
stats.wilcoxon(x, y)                             # signed-rank

# 列聯 / 計數
M = [[12, 8], [3, 17]]
stats.chi2_contingency(M)
stats.fisher_exact(M)
from statsmodels.stats.contingency_tables import mcnemar
mcnemar(M, exact=True)

# k 組比較
stats.f_oneway(g1, g2, g3)                       # ANOVA
pg.welch_anova(data=d, dv="y", between="group")
stats.kruskal(g1, g2, g3)
sp.posthoc_dunn(d, val_col="y", group_col="group")
pg.pairwise_tukey(data=d, dv="y", between="group")

📝 自我檢測

1. 何時應該選 Welch's t 而非 Student's t？

1. When should you prefer Welch's t over Student's t?

A. 只在小樣本A. Only with small samples

B. 幾乎所有實務情境──Welch 對變異數不等更穩健，是 R 的預設B. Almost always — Welch is robust to unequal variances; it is R's default

C. 只在配對資料C. Only with paired data

D. 只在資料非常態D. Only when data are non-normal

2. Mann-Whitney U 檢定的正確虛無假設是？

2. The correct null hypothesis for Mann-Whitney U is?

A. 兩組的中位數相等A. The two medians are equal

B. 兩組的平均相等B. The two means are equal

C. 兩組來自相同分布 (P(X > Y) = 0.5)C. Two distributions are equal (P(X > Y) = 0.5)

D. 兩組變異數相等D. The two variances are equal

3. 為何 k 組比較要先做整體 ANOVA / KW，而不是直接做所有兩兩 t 檢定？

3. Why run an omnibus ANOVA / Kruskal-Wallis before pairwise tests instead of doing all pairwise t-tests directly?

A. 因為 ANOVA 比較快A. ANOVA is faster

B. 因為 t 檢定不能用於 k 組B. Because t-tests cannot be used with k groups

C. 多重兩兩比較會嚴重膨脹 Type I 錯誤；整體檢定 + 受控 post-hoc 維持族系誤差率C. Multiple pairwise tests inflate Type I greatly; omnibus + controlled post-hoc preserves family-wise error

D. ANOVA 比 t 更精準D. ANOVA is more accurate than t