為什麼是 Gosset、為什麼是 t、為什麼預設 Welch?
1908 年,William Sealy Gosset 在都柏林的 Guinness 釀酒廠擔任化學家。他需要用很少的麥芽樣本估計母體平均,但發現 Gauss 的 z 檢定假設 σ 已知——在實際實驗裡這幾乎不可能。Gosset 推導出在 σ 未知、必須用樣本 s 替代時,統計量的分布並非常態,而是多了一條 (n−1) 的自由度、尾巴更厚的新分布。由於 Guinness 禁止員工發表科學論文,Gosset 以筆名「Student」在 Biometrika 投稿,這就是 Student's t-distribution 的由來。
從那以後,t 檢定成為實驗科學最普及的推論工具。但今天還有兩個常被忽略的事實:(1) R 的 t.test() 與 SciPy 的 ttest_ind(equal_var=False) 預設使用 Welch t(不假設變異數相等),這是 Delacre, Lakens, Leys (2017 Int Rev Soc Psychol) 強烈推薦的;(2) 許多教科書仍教「先做 Levene 檢定再選 Student 或 Welch」——這個 two-step procedure 會放大型一錯誤,應淘汰。
In 1908, William Sealy Gosset, a chemist at Guinness Brewery in Dublin, needed to estimate population means from very small malt samples. Gauss's z-test assumed σ was known — almost never true in actual experiments. Gosset derived the distribution of the statistic when σ is replaced by the sample s, finding it is not normal: it has an extra (n−1) degrees of freedom and heavier tails. Because Guinness forbade staff publishing science (trade-secret protection), Gosset submitted to Biometrika under the pen-name "Student." That is the origin of Student's t-distribution.
The t-test then became the workhorse of experimental science. Two facts often missed today: (1) R's t.test() and SciPy's ttest_ind(equal_var=False) default to Welch's t (no equal-variance assumption), which Delacre, Lakens & Leys (2017 Int Rev Soc Psychol) strongly endorse; (2) many textbooks still teach "pre-test variance with Levene, then choose Student or Welch" — that two-step procedure inflates Type I error and should be retired.
一、t 檢定的三種變體
t 檢定其實是「把估計值除以它的標準誤」這個共同概念的三個變體。差別只在於「估計什麼」與「標準誤怎麼算」。
Every t-test follows the same recipe: (estimate − null value) / standard error of the estimate. The three flavours only differ in what you estimate and how you compute its SE.
① 單樣本 t
樣本平均對比一個參考值 μ₀(例如:實驗室校正基準、文獻給定的健康血糖值)。
t = (x̄ − μ₀) / (s/√n), df = n − 1.
例:你測 12 位健康成人靜脈血鈉 (Na⁺),看是否與臨床參考 140 mmol/L 有差。
Sample mean vs a reference value μ₀ (e.g., a lab-calibration target, a literature value for healthy glucose).
t = (x̄ − μ₀) / (s/√n), df = n − 1.
Example: Measure serum Na⁺ in 12 healthy adults; test against the clinical reference of 140 mmol/L.
② 兩樣本 t
兩組獨立樣本的平均差是否為 0。又分兩種:
Student(合併變異數):假設 σ₁ = σ₂,使用 s_p² = ((n₁−1)s₁² + (n₂−1)s₂²)/(n₁+n₂−2)。
Welch(不假設等變異數):t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂),df 由 Satterthwaite 近似。
例:drug A vs placebo 的血壓變化。
Two independent groups' mean difference vs 0. Two versions:
Student (pooled variance): assumes σ₁ = σ₂, uses s_p² = ((n₁−1)s₁² + (n₂−1)s₂²)/(n₁+n₂−2).
Welch (unequal variances allowed): t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂), df via Satterthwaite.
Example: Drug A vs placebo, change in blood pressure.
③ 配對 t
同一單位(同一隻老鼠、同一位病人、同一塊組織)在兩個條件下的測值。把差值 d = x_after − x_before 當成單樣本對 0 檢定。
t = d̄ / (s_d / √n), df = n − 1.
例:治療前後的 HbA1c;左右肢的肌力;同一隻老鼠 baseline vs treatment。
The same unit (mouse, patient, tissue block) measured under two conditions. Treat differences d = x_after − x_before as a one-sample test against 0.
t = d̄ / (s_d / √n), df = n − 1.
Example: HbA1c before vs after therapy; left vs right limb strength; baseline vs treatment in the same mouse.
平均差 × 樣本數 → t 與 p 值
調整標準化平均差 Δ(單位 SD)與每組樣本數 n。下方計算 Welch t(假設等 SD = 1)、df、雙尾 p 值,並畫出兩個常態密度。觀察重點:(1) Δ 固定、n 變大 → SE 變小 → |t| 變大 → p 變小;(2) n 很小時,即使 Δ 很大 p 也未必顯著(n=5 vs n=50 看差別);(3) Δ = 0.2 (small) 在 n = 30 通常 不顯著,這就是檢定力不足的本質。
Tune the standardised gap Δ (units of SD) and the per-group n. The panel computes Welch's t (assuming equal SD = 1), df, two-tailed p, and draws the two normal densities. Watch: (1) holding Δ fixed and growing n shrinks SE, inflates |t|, shrinks p; (2) tiny n yields non-significant p even when Δ is large (compare n=5 vs n=50); (3) Δ = 0.2 (small) at n=30 is usually not significant — this is the essence of underpowered design.
藍色 = 第一組密度 · 紅色 = 第二組密度Blue = group 1 density · Red = group 2 density
二、為什麼 Welch 應該是預設?
Welch t (1947 Biometrika) 與 Student t 的唯一差別:不假設兩組變異數相等。它用 Satterthwaite 公式調整 df(通常非整數)。Delacre, Lakens & Leys (2017) 用大規模模擬比較:
- 當 σ₁ = σ₂ 時,Welch 與 Student 幾乎沒有檢定力差異(Welch 的 df 略小,差距 < 1%)。
- 當 σ₁ ≠ σ₂(特別是 n 不平衡時),Student t 會嚴重失控:α 名義 5% 但實際可能膨脹到 10%–20%,或反向縮小到 1%。
- Welch 在所有條件下都更接近名義 α。
結論:沒有任何情境讓 Student t 顯著優於 Welch。所以 R 把 Welch 設為預設。
Welch t (1947 Biometrika) differs from Student in one thing: it does not assume equal group variances. It uses Satterthwaite's formula to adjust df (typically non-integer). Delacre, Lakens & Leys (2017) ran large-scale simulations:
- When σ₁ = σ₂, Welch and Student have essentially identical power (Welch's df is slightly smaller, < 1% difference).
- When σ₁ ≠ σ₂ (especially with unequal n), Student t loses control: nominal α = 5% can balloon to 10–20%, or shrink to 1%.
- Welch tracks the nominal α much more closely across all conditions.
The verdict: no realistic scenario makes Student t strictly better. That's why R defaults to Welch.
配對設計與獨立設計的差別
「同一隻老鼠 baseline 與 treatment」是配對;「實驗組老鼠 vs 對照組老鼠」是獨立。把配對誤用為兩樣本會嚴重浪費資訊:
配對 t 的 SE 用差值 d 的 SD,而差值的變異 = Var(x_after) + Var(x_before) − 2·Cov(after, before)。當 after 與 before 正相關(同個體常常如此),共變數項抵消掉很大一塊變異 → SE 變小 → 檢定力暴增。
反過來,把獨立樣本誤用為配對則會偽造相依性,做出無意義的結果。
等價的另一種寫法:配對 t 等於「對差值 d 跑單樣本 t 對 0」,也等於「在 d 上跑截距迴歸」。在更複雜的縱向資料分析裡,這會自然延伸成混合效應模型(mixed model)(Step 13)或gain-score regression。
"Same mouse, baseline vs treatment" is paired. "Treatment-arm mice vs control-arm mice" is independent. Mis-using a paired design as independent throws away information:
The paired SE uses the SD of the difference d, whose variance is Var(after) + Var(before) − 2·Cov(after, before). When after and before are positively correlated (often true within an individual), the covariance term wipes out a chunk of variance → smaller SE → much higher power.
Going the other way — treating independent samples as paired — fabricates dependence and renders the test meaningless.
Equivalent reformulations: a paired t is identical to "one-sample t on d vs 0" and to "intercept-only regression on d." For richer longitudinal data this naturally generalises to mixed-effects models (Step 13) and gain-score regression.
信賴區間覆蓋率示範
從已知母體 μ = 100, σ = 15 重複抽取 100 個獨立樣本,每個樣本算出一條 95% CI(或調整為 80–99%)。下圖每條水平短線代表一條 CI;紅色表示未覆蓋真值 μ = 100。關鍵啟示:"95% CI" 並不是「真值落在這條 CI 內的機率是 95%」,而是「在無限重複實驗下,CI 覆蓋真值的長期比例 ≈ 95%」。Hoekstra et al. (2014) 報告:超過一半的研究者與學生都做出錯誤詮釋。
Repeatedly sample n observations from a known population (μ = 100, σ = 15) 100 times, computing a 95% CI (or 80–99%) for each. Each horizontal segment below is one CI; red CIs miss the true μ = 100. Lesson: "95% CI" does not mean "there is a 95% probability the true value is in this interval." It means: in infinitely many repetitions, the long-run fraction of intervals that cover the true value is ≈ 95%. Hoekstra et al. (2014) found more than half of researchers and students endorse the wrong interpretation.
綠色 = 覆蓋 μ=100 · 紅色 = 未覆蓋 · 灰虛線 = 真值Green = covers μ=100 · Red = misses · Grey dashed = true μ
三、該用哪一種?
🌳 t 檢定選擇決策樹
四、四種檢定的比較
| 檢定 | 假設 | 估計什麼 | 何時用 | 風險 | ||||
|---|---|---|---|---|---|---|---|---|
| Student t (Gosset 1908) | 兩組獨立、常態、等變異數 | 平均差 | σ₁ = σ₂ 明確可知(罕見) | σ 不等 + n 不平衡時 α 嚴重失控 | Two indep, normal, equal variance | Mean difference | When σ₁ = σ₂ is genuinely known (rare) | α inflates badly with unequal σ + unbalanced n |
| Welch t (Welch 1947) ⭐ | 兩組獨立、近常態(不需等變異數) | 平均差 | 預設選項,R/Python 都已預設 | df 非整數 (Satterthwaite) | Two indep, near-normal (no equal-var) | Mean difference | The default, in both R and Python | df is non-integer (Satterthwaite) |
| Paired t | 差值近常態 | 差值平均 | 同單位前後測 / 配對設計 | 假設配對是真的——randomization 不是配對 | Differences near-normal | Mean of differences | Pre/post on same unit, paired design | Pairing must be real — randomisation alone isn't pairing |
| Wilcoxon rank-sum / Mann-Whitney U (Wilcoxon 1945; Mann-Whitney 1947) | 兩組獨立、序數或連續、分布形狀相似 | P(X > Y) 或位置移動 | 嚴重偏態、小 n、序數資料 | 變異數不等時仍會偏,並非完全 distribution-free | Two indep, ordinal or continuous, similar shapes | P(X > Y) or location shift | Heavy skew, small n, ordinal data | Still biased with unequal variances |
| Wilcoxon signed-rank (paired) | 差值對稱 | 差值的位置 | 小 n 配對且差值偏態 | 差值不對稱 → 結果可能仍偏 | Symmetry of differences | Location of differences | Small-n paired with skewed differences | Asymmetric differences still bias the test |
| Sign test | 只需 +/− 符號獨立 | 中位數的方向 | 極端穩健,但檢定力最低 | 資訊只用了方向,浪費連續測量 | Signs only | Median direction | Most robust but lowest power | Ignores magnitude — wastes continuous measurement |
五、四種變體的實作
# R: t.test() defaults to Welch (var.equal = FALSE) library(tidyverse) library(effectsize) # Cohen's d ## --- (1) One-sample t: serum Na vs reference 140 --- na <- c(138,141,142,139,137,143,140,144,138,141,139,142) t.test(na, mu = 140) ## --- (2) Two-sample Welch t: drug A vs placebo, BP change --- drug <- c(-12,-15,-9,-18,-11,-14,-17,-10,-13,-16) placebo <- c(-2,1,-3,0,-1,2,-4,1,3,-2) t.test(drug, placebo) # Welch by default t.test(drug, placebo, var.equal=TRUE)# Student (NOT recommended) effectsize::cohens_d(drug, placebo, pooled_sd = FALSE) ## --- (3) Paired t: HbA1c before vs after therapy --- before <- c(8.2,7.9,8.5,9.1,7.6,8.8,8.0,9.3) after <- c(7.1,7.0,7.6,8.0,6.9,7.9,7.2,8.4) t.test(after, before, paired = TRUE) # or t.test(after - before) ## --- (4) Non-parametric: Wilcoxon & Mann-Whitney --- wilcox.test(drug, placebo) # Mann-Whitney U wilcox.test(after, before, paired = TRUE) # signed-rank ## --- CI for mean difference (Welch) --- fit <- t.test(drug, placebo) fit$estimate; fit$conf.int # mean diff + 95% CI
import numpy as np from scipy import stats import pingouin as pg # Cohen's d & CI ## --- (1) One-sample t --- na = np.array([138,141,142,139,137,143,140,144,138,141,139,142]) stats.ttest_1samp(na, popmean=140) ## --- (2) Two-sample Welch t (equal_var=False) --- drug = np.array([-12,-15,-9,-18,-11,-14,-17,-10,-13,-16]) placebo = np.array([-2, 1,-3, 0,-1, 2,-4, 1, 3,-2]) stats.ttest_ind(drug, placebo, equal_var=False) # Welch pg.ttest(drug, placebo, correction=True) # Welch + d + CI ## --- (3) Paired t --- before = np.array([8.2,7.9,8.5,9.1,7.6,8.8,8.0,9.3]) after = np.array([7.1,7.0,7.6,8.0,6.9,7.9,7.2,8.4]) stats.ttest_rel(after, before) ## --- (4) Non-parametric --- stats.mannwhitneyu(drug, placebo, alternative="two-sided") stats.wilcoxon(after, before) # signed-rank ## --- 95% CI for the mean difference (Welch) --- diff = drug.mean() - placebo.mean() v1, v2 = drug.var(ddof=1), placebo.var(ddof=1) n1, n2 = len(drug), len(placebo) se = np.sqrt(v1/n1 + v2/n2) df = se**4 / ((v1/n1)**2/(n1-1) + (v2/n2)**2/(n2-1)) tcrit = stats.t.ppf(0.975, df) ci = (diff - tcrit*se, diff + tcrit*se)
六、六個常見錯誤
❌ Student t 而非 Welch
2017 以後仍主動把 var.equal = TRUE 是退步。Welch 在等變異數情形下與 Student 幾乎沒差,但在不等變異數時穩健得多。無理由就讓它保持預設。
Manually setting var.equal = TRUE in 2017+ is a step backwards. Welch is nearly identical to Student under equal variances, and far more robust when they differ. Leave the default alone.
❌ 配對誤用為兩樣本
同隻老鼠 baseline 與 treatment 是配對。把它當成兩個獨立組會把個體間變異當噪音,浪費掉同個體高度相關帶來的檢定力。反向錯誤(獨立卻用 paired)則偽造相依,p 值無意義。
Baseline and treatment in the same mouse are paired. Treating them as independent inflates SE by including inter-individual variance, wasting the within-subject correlation. The reverse error (paired test on independent data) fakes dependence and renders p meaningless.
❌ CI 解讀錯誤
"95% 機率真值在這個 CI 內" 是錯誤的(Hoekstra 2014)。正確:「在重複實驗下,這個程序產生覆蓋真值區間的長期比例 ≈ 95%」。建議直接報「95% CI [a, b]」就好。
"There's a 95% probability the true value is in this CI" is wrong (Hoekstra 2014). Correct: "Under repeated sampling, the procedure produces an interval covering the truth 95% of the time." Just report "95% CI [a, b]" — no probabilistic gloss needed.
❌ Levene 預檢再選
「先 Levene → 不顯著就 Student → 顯著就 Welch」是條件式檢定錯誤,整體 α 不再是名義值。解方:直接跑 Welch,跳過 Levene。
"Levene first, Student if not-significant, Welch if significant" is a conditional-test error; the actual α is no longer nominal. Fix: skip Levene, just run Welch.
❌ 只報 p 值
「p < 0.05」單獨呈現幾乎沒資訊:你不知道效應多大、估計多精確。ASA 2016 與 ICMJE 都要求同時報 mean difference + 95% CI + 效應量(Cohen's d / Hedges' g)。
"p < 0.05" alone is almost information-free — you can't tell how big or how precise. ASA 2016 and the ICMJE both require mean difference + 95% CI + effect size (Cohen's d or Hedges' g).
❌ 三組以上跑多次 t
三組做三次成對 t,整體 α 已膨脹至 ~14%(而非 5%)。三組以上請用 ANOVA(Step 7),需要事後比較時再用 Tukey HSD 或 Bonferroni / Holm。
Three groups, three pairwise t-tests → family-wise α ≈ 14%, not 5%. For ≥ 3 groups use ANOVA (Step 7); follow up with Tukey HSD or Bonferroni/Holm when needed.
📝 自我檢測
1. 比較 drug A vs placebo 兩組獨立、連續資料,最佳預設選項是?
1. Comparing drug A vs placebo (two independent continuous groups), what's the recommended default?
2. 「95% CI」最正確的詮釋是?
2. The correct interpretation of a "95% CI" is:
3. 12 隻老鼠在基線與用藥後各測一次 HbA1c。最佳分析?
3. Twelve mice measured at baseline and after drug for HbA1c. Best analysis?
4. 同事說「我做了 Levene,p = 0.04,所以改用 Welch」,最佳建議是?
4. A colleague says "Levene p = 0.04, so I switched to Welch." Best advice?