Step 5: t-tests & Confidence Intervals

總覽

為什麼是 Gosset、為什麼是 t、為什麼預設 Welch？

1908 年，William Sealy Gosset 在都柏林的 Guinness 釀酒廠擔任化學家。他需要用很少的麥芽樣本估計母體平均，但發現 Gauss 的 z 檢定假設 σ 已知——在實際實驗裡這幾乎不可能。Gosset 推導出在 σ 未知、必須用樣本 s 替代時，統計量的分布並非常態，而是多了一條 (n−1) 的自由度、尾巴更厚的新分布。由於 Guinness 禁止員工發表科學論文，Gosset 以筆名「Student」在 Biometrika 投稿，這就是 Student's t-distribution 的由來。

從那以後，t 檢定成為實驗科學最普及的推論工具。但今天還有兩個常被忽略的事實：(1) R 的 t.test() 與 SciPy 的 ttest_ind(equal_var=False) 預設使用 Welch t（不假設變異數相等），這是 Delacre, Lakens, Leys (2017 Int Rev Soc Psychol) 強烈推薦的；(2) 許多教科書仍教「先做 Levene 檢定再選 Student 或 Welch」——這個 two-step procedure 會放大型一錯誤，應淘汰。

In 1908, William Sealy Gosset, a chemist at Guinness Brewery in Dublin, needed to estimate population means from very small malt samples. Gauss's z-test assumed σ was known — almost never true in actual experiments. Gosset derived the distribution of the statistic when σ is replaced by the sample s, finding it is not normal: it has an extra (n−1) degrees of freedom and heavier tails. Because Guinness forbade staff publishing science (trade-secret protection), Gosset submitted to Biometrika under the pen-name "Student." That is the origin of Student's t-distribution.

The t-test then became the workhorse of experimental science. Two facts often missed today: (1) R's t.test() and SciPy's ttest_ind(equal_var=False) default to Welch's t (no equal-variance assumption), which Delacre, Lakens & Leys (2017 Int Rev Soc Psychol) strongly endorse; (2) many textbooks still teach "pre-test variance with Levene, then choose Student or Welch" — that two-step procedure inflates Type I error and should be retired.

💡

三句話總結：(1) 比較兩組連續變數時，預設用 Welch t；(2) 同一單位的前後測請用 paired t；(3) 報告結果時除了 p value，必附 mean difference + 95% CI + 效應量（Cohen's d）——這是 ASA 2016 與 ICMJE 的共同期待。 Three sentences: (1) When comparing two continuous groups, default to Welch's t; (2) for pre/post on the same unit, use a paired t; (3) alongside the p-value, always report mean difference + 95% CI + effect size (Cohen's d) — this is the joint expectation of ASA 2016 and the ICMJE.

三種變體

一、t 檢定的三種變體

t 檢定其實是「把估計值除以它的標準誤」這個共同概念的三個變體。差別只在於「估計什麼」與「標準誤怎麼算」。

Every t-test follows the same recipe: (estimate − null value) / standard error of the estimate. The three flavours only differ in what you estimate and how you compute its SE.

🎯

① 單樣本 t

樣本平均對比一個參考值 μ₀（例如：實驗室校正基準、文獻給定的健康血糖值）。

t = (x̄ − μ₀) / (s/√n), df = n − 1.

例：你測 12 位健康成人靜脈血鈉 (Na⁺)，看是否與臨床參考 140 mmol/L 有差。

Sample mean vs a reference value μ₀ (e.g., a lab-calibration target, a literature value for healthy glucose).

t = (x̄ − μ₀) / (s/√n), df = n − 1.

Example: Measure serum Na⁺ in 12 healthy adults; test against the clinical reference of 140 mmol/L.

⚖️

② 兩樣本 t

兩組獨立樣本的平均差是否為 0。又分兩種：

Student（合併變異數）：假設 σ₁ = σ₂，使用 s_p² = ((n₁−1)s₁² + (n₂−1)s₂²)/(n₁+n₂−2)。

Welch（不假設等變異數）：t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)，df 由 Satterthwaite 近似。

例：drug A vs placebo 的血壓變化。

Two independent groups' mean difference vs 0. Two versions:

Student (pooled variance): assumes σ₁ = σ₂, uses s_p² = ((n₁−1)s₁² + (n₂−1)s₂²)/(n₁+n₂−2).

Welch (unequal variances allowed): t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂), df via Satterthwaite.

Example: Drug A vs placebo, change in blood pressure.

🔗

③ 配對 t

同一單位（同一隻老鼠、同一位病人、同一塊組織）在兩個條件下的測值。把差值 d = x_after − x_before 當成單樣本對 0 檢定。

t = d̄ / (s_d / √n), df = n − 1.

例：治療前後的 HbA1c；左右肢的肌力；同一隻老鼠 baseline vs treatment。

The same unit (mouse, patient, tissue block) measured under two conditions. Treat differences d = x_after − x_before as a one-sample test against 0.

t = d̄ / (s_d / √n), df = n − 1.

Example: HbA1c before vs after therapy; left vs right limb strength; baseline vs treatment in the same mouse.

⌜ t = (估計值 − 零假設值) / SE(估計值) · CI = 估計值 ± t_α/2,df · SE · Cohen's d = (x̄₁ − x̄₂) / s_pooled ⌝ t 檢定的三變體共用一個骨架；報告時 p 值 + 平均差 + 95% CI + 效應量 缺一不可。Cohen (1988): d = 0.2 小、0.5 中、0.8 大——但 effect size 的「大小」應以領域脈絡判斷，不是死標準。 ⌜ t = (estimate − null) / SE(estimate) · CI = estimate ± t_α/2,df · SE · Cohen's d = (x̄₁ − x̄₂) / s_pooled ⌝ All three t-tests share one skeleton; reporting always requires p + mean diff + 95% CI + effect size. Cohen (1988): d = 0.2 small, 0.5 medium, 0.8 large — but the meaning of "large" is field-specific, not a universal cut-off.

互動模擬 ①

平均差 × 樣本數 → t 與 p 值

調整標準化平均差 Δ（單位 SD）與每組樣本數 n。下方計算 Welch t（假設等 SD = 1）、df、雙尾 p 值，並畫出兩個常態密度。觀察重點：(1) Δ 固定、n 變大 → SE 變小 → |t| 變大 → p 變小；(2) n 很小時，即使 Δ 很大 p 也未必顯著（n=5 vs n=50 看差別）；(3) Δ = 0.2 (small) 在 n = 30 通常不顯著，這就是檢定力不足的本質。

Tune the standardised gap Δ (units of SD) and the per-group n. The panel computes Welch's t (assuming equal SD = 1), df, two-tailed p, and draws the two normal densities. Watch: (1) holding Δ fixed and growing n shrinks SE, inflates |t|, shrinks p; (2) tiny n yields non-significant p even when Δ is large (compare n=5 vs n=50); (3) Δ = 0.2 (small) at n=30 is usually not significant — this is the essence of underpowered design.

標準化平均差 Δ 0.5

每組樣本數 n 30

藍色 = 第一組密度 · 紅色 = 第二組密度Blue = group 1 density · Red = group 2 density

深入討論

二、為什麼 Welch 應該是預設？

Welch t (1947 Biometrika) 與 Student t 的唯一差別：不假設兩組變異數相等。它用 Satterthwaite 公式調整 df（通常非整數）。Delacre, Lakens & Leys (2017) 用大規模模擬比較：

當 σ₁ = σ₂ 時，Welch 與 Student 幾乎沒有檢定力差異（Welch 的 df 略小，差距 < 1%）。
當 σ₁ ≠ σ₂（特別是 n 不平衡時），Student t 會嚴重失控：α 名義 5% 但實際可能膨脹到 10%–20%，或反向縮小到 1%。
Welch 在所有條件下都更接近名義 α。

結論：沒有任何情境讓 Student t 顯著優於 Welch。所以 R 把 Welch 設為預設。

Welch t (1947 Biometrika) differs from Student in one thing: it does not assume equal group variances. It uses Satterthwaite's formula to adjust df (typically non-integer). Delacre, Lakens & Leys (2017) ran large-scale simulations:

When σ₁ = σ₂, Welch and Student have essentially identical power (Welch's df is slightly smaller, < 1% difference).
When σ₁ ≠ σ₂ (especially with unequal n), Student t loses control: nominal α = 5% can balloon to 10–20%, or shrink to 1%.
Welch tracks the nominal α much more closely across all conditions.

The verdict: no realistic scenario makes Student t strictly better. That's why R defaults to Welch.

陷阱：Levene 預檢再選很多舊版教科書教「先 Levene → p > 0.05 用 Student、否則用 Welch」。這是典型的條件式檢定錯誤（conditional-test error）：(1) 第一次檢定（Levene）本身會錯，把 σ 不同當成相同；(2) 把第二個檢定的選擇條件化於第一個的結果，使整體 α 不再是名義值。Zimmerman (2004) 與 Rasch et al. (2011) 都明確示警。正解：永遠跑 Welch。 Many older textbooks teach "first run Levene → if p > 0.05 use Student, else use Welch." This is a conditional-test error: (1) the Levene test itself can be wrong, calling unequal variances equal; (2) conditioning the second test on the first wrecks the nominal α. Zimmerman (2004) and Rasch et al. (2011) both make this explicit. The fix: just run Welch every time.

配對設計與獨立設計的差別

「同一隻老鼠 baseline 與 treatment」是配對；「實驗組老鼠 vs 對照組老鼠」是獨立。把配對誤用為兩樣本會嚴重浪費資訊：

配對 t 的 SE 用差值 d 的 SD，而差值的變異 = Var(x_after) + Var(x_before) − 2·Cov(after, before)。當 after 與 before 正相關（同個體常常如此），共變數項抵消掉很大一塊變異 → SE 變小 → 檢定力暴增。

反過來，把獨立樣本誤用為配對則會偽造相依性，做出無意義的結果。

等價的另一種寫法：配對 t 等於「對差值 d 跑單樣本 t 對 0」，也等於「在 d 上跑截距迴歸」。在更複雜的縱向資料分析裡，這會自然延伸成混合效應模型（mixed model）（Step 13）或gain-score regression。

"Same mouse, baseline vs treatment" is paired. "Treatment-arm mice vs control-arm mice" is independent. Mis-using a paired design as independent throws away information:

The paired SE uses the SD of the difference d, whose variance is Var(after) + Var(before) − 2·Cov(after, before). When after and before are positively correlated (often true within an individual), the covariance term wipes out a chunk of variance → smaller SE → much higher power.

Going the other way — treating independent samples as paired — fabricates dependence and renders the test meaningless.

Equivalent reformulations: a paired t is identical to "one-sample t on d vs 0" and to "intercept-only regression on d." For richer longitudinal data this naturally generalises to mixed-effects models (Step 13) and gain-score regression.

非參數替代：Wilcoxon 與 Mann-Whitney。當資料極度偏態、樣本極小、或測量只是序數時，可改用 Wilcoxon signed-rank（配對）與 Wilcoxon rank-sum / Mann-Whitney U（獨立兩組）。它們檢驗的是「分布的位置移動」而非平均。注意：當組內變異數差很大時，Mann-Whitney 並非完美——它對 stochastic dominance 仍然敏感（Hodges & Lehmann 1963；Fagerland & Sandvik 2009）。第二選項：bootstrap percentile CI，對形狀完全不敏感，且自然產生效應量區間。 Non-parametric alternatives: Wilcoxon and Mann-Whitney. For heavy skew, tiny n, or ordinal data, use Wilcoxon signed-rank (paired) or Wilcoxon rank-sum / Mann-Whitney U (independent two groups). These test for a location shift rather than a mean. Caveat: when within-group variances differ, Mann-Whitney isn't a free pass — it remains sensitive to stochastic dominance (Hodges & Lehmann 1963; Fagerland & Sandvik 2009). A second option: bootstrap percentile CIs — shape-agnostic and yielding effect-size intervals directly.

互動模擬 ②

信賴區間覆蓋率示範

從已知母體 μ = 100, σ = 15 重複抽取 100 個獨立樣本，每個樣本算出一條 95% CI（或調整為 80–99%）。下圖每條水平短線代表一條 CI；紅色表示未覆蓋真值 μ = 100。關鍵啟示："95% CI" 並不是「真值落在這條 CI 內的機率是 95%」，而是「在無限重複實驗下，CI 覆蓋真值的長期比例 ≈ 95%」。Hoekstra et al. (2014) 報告：超過一半的研究者與學生都做出錯誤詮釋。

Repeatedly sample n observations from a known population (μ = 100, σ = 15) 100 times, computing a 95% CI (or 80–99%) for each. Each horizontal segment below is one CI; red CIs miss the true μ = 100. Lesson: "95% CI" does not mean "there is a 95% probability the true value is in this interval." It means: in infinitely many repetitions, the long-run fraction of intervals that cover the true value is ≈ 95%. Hoekstra et al. (2014) found more than half of researchers and students endorse the wrong interpretation.

樣本數 n 20

CI 信賴水準 95%

綠色 = 覆蓋 μ=100 · 紅色 = 未覆蓋 · 灰虛線 = 真值Green = covers μ=100 · Red = misses · Grey dashed = true μ

⚠️

常見 CI 誤讀（Hoekstra 2014 Psychon Bull Rev）：(1) ❌「真值有 95% 機率在這個區間內」——錯誤，真值是固定的；(2) ❌「重複實驗，平均 95% 會落在這個區間」——錯誤，這混淆了 CI 與預測區間；(3) ✅「在重複設計下，產生 CI 的程序有 95% 機率產出涵蓋真值的區間」——正確；(4) 報告時建議寫「The 95% CI for the mean difference was [a, b]」，不要說「There is a 95% chance...」。 Common CI misreadings (Hoekstra 2014 Psychon Bull Rev): (1) ❌ "There is a 95% probability the true value lies in this interval" — wrong, the true value is fixed; (2) ❌ "On replication, 95% of estimates fall in this interval" — wrong, that confuses CI with prediction; (3) ✅ "Under repeated sampling, the CI-construction procedure produces intervals containing the true value 95% of the time" — correct; (4) preferred wording: "The 95% CI for the mean difference was [a, b]" — not "There is a 95% chance...".

決策引導

三、該用哪一種？

🌳 t 檢定選擇決策樹

Q1:

樣本之間有「同單位前後測 / 同對象兩條件」嗎？→ 是 → Paired t（或 Wilcoxon signed-rank 若 n 小且偏態）。

Q2:

有沒有參考值（reference value）要對比？→ 是 → One-sample t（n 小且偏態 → Wilcoxon signed-rank 對 μ₀）。

Q3:

兩組獨立、資料連續、近似常態（或 n > 30 且偏態不嚴重）？→ 是 → Welch t（預設）。

Q4:

資料嚴重偏態 / 含極端 outlier / n 很小？→ 是 → Mann-Whitney U，或 log 轉換後 Welch t，或 bootstrap percentile CI。

Q5:

三組以上？→ 是 → 跳到 Step 7 ANOVA，不要重複跑 t 檢定。

Q6:

有 baseline 共變量、隨機區組、重複測量？→ 是 → 跳到 Step 13 Mixed models。

Q1:

Pre/post on the same unit, or two conditions on the same subject? → Yes → Paired t (or Wilcoxon signed-rank for small skewed n).

Q2:

Comparing to a known reference value? → Yes → One-sample t (Wilcoxon signed-rank vs μ₀ if small + skewed).

Q3:

Two independent groups, continuous, roughly normal (or n > 30, not heavily skewed)? → Yes → Welch t (default).

Q4:

Heavy skew / extreme outliers / very small n? → Yes → Mann-Whitney U, or log-transform then Welch, or bootstrap percentile CI.

Q5:

Three or more groups? → Yes → Go to Step 7 ANOVA — don't pair-wise t-test.

Q6:

Baseline covariates, blocking, repeated measures? → Yes → Step 13 Mixed models.

變體比較

四、四種檢定的比較

檢定	假設	估計什麼	何時用	風險
Student t (Gosset 1908)	兩組獨立、常態、等變異數	平均差	σ₁ = σ₂ 明確可知（罕見）	σ 不等 + n 不平衡時 α 嚴重失控	Two indep, normal, equal variance	Mean difference	When σ₁ = σ₂ is genuinely known (rare)	α inflates badly with unequal σ + unbalanced n
Welch t (Welch 1947) ⭐	兩組獨立、近常態（不需等變異數）	平均差	預設選項，R/Python 都已預設	df 非整數 (Satterthwaite)	Two indep, near-normal (no equal-var)	Mean difference	The default, in both R and Python	df is non-integer (Satterthwaite)
Paired t	差值近常態	差值平均	同單位前後測 / 配對設計	假設配對是真的——randomization 不是配對	Differences near-normal	Mean of differences	Pre/post on same unit, paired design	Pairing must be real — randomisation alone isn't pairing
Wilcoxon rank-sum / Mann-Whitney U (Wilcoxon 1945; Mann-Whitney 1947)	兩組獨立、序數或連續、分布形狀相似	P(X > Y) 或位置移動	嚴重偏態、小 n、序數資料	變異數不等時仍會偏，並非完全 distribution-free	Two indep, ordinal or continuous, similar shapes	P(X > Y) or location shift	Heavy skew, small n, ordinal data	Still biased with unequal variances
Wilcoxon signed-rank (paired)	差值對稱	差值的位置	小 n 配對且差值偏態	差值不對稱 → 結果可能仍偏	Symmetry of differences	Location of differences	Small-n paired with skewed differences	Asymmetric differences still bias the test
Sign test	只需 +/− 符號獨立	中位數的方向	極端穩健，但檢定力最低	資訊只用了方向，浪費連續測量	Signs only	Median direction	Most robust but lowest power	Ignores magnitude — wastes continuous measurement

程式碼

五、四種變體的實作

# R: t.test() defaults to Welch (var.equal = FALSE)
library(tidyverse)
library(effectsize)   # Cohen's d

## --- (1) One-sample t: serum Na vs reference 140 ---
na <- c(138,141,142,139,137,143,140,144,138,141,139,142)
t.test(na, mu = 140)

## --- (2) Two-sample Welch t: drug A vs placebo, BP change ---
drug    <- c(-12,-15,-9,-18,-11,-14,-17,-10,-13,-16)
placebo <- c(-2,1,-3,0,-1,2,-4,1,3,-2)
t.test(drug, placebo)                # Welch by default
t.test(drug, placebo, var.equal=TRUE)# Student (NOT recommended)
effectsize::cohens_d(drug, placebo, pooled_sd = FALSE)

## --- (3) Paired t: HbA1c before vs after therapy ---
before <- c(8.2,7.9,8.5,9.1,7.6,8.8,8.0,9.3)
after  <- c(7.1,7.0,7.6,8.0,6.9,7.9,7.2,8.4)
t.test(after, before, paired = TRUE)   # or t.test(after - before)

## --- (4) Non-parametric: Wilcoxon & Mann-Whitney ---
wilcox.test(drug, placebo)            # Mann-Whitney U
wilcox.test(after, before, paired = TRUE) # signed-rank

## --- CI for mean difference (Welch) ---
fit <- t.test(drug, placebo)
fit$estimate; fit$conf.int            # mean diff + 95% CI

import numpy as np
from scipy import stats
import pingouin as pg     # Cohen's d & CI

## --- (1) One-sample t ---
na = np.array([138,141,142,139,137,143,140,144,138,141,139,142])
stats.ttest_1samp(na, popmean=140)

## --- (2) Two-sample Welch t (equal_var=False) ---
drug    = np.array([-12,-15,-9,-18,-11,-14,-17,-10,-13,-16])
placebo = np.array([-2, 1,-3, 0,-1, 2,-4, 1, 3,-2])
stats.ttest_ind(drug, placebo, equal_var=False)   # Welch
pg.ttest(drug, placebo, correction=True)         # Welch + d + CI

## --- (3) Paired t ---
before = np.array([8.2,7.9,8.5,9.1,7.6,8.8,8.0,9.3])
after  = np.array([7.1,7.0,7.6,8.0,6.9,7.9,7.2,8.4])
stats.ttest_rel(after, before)

## --- (4) Non-parametric ---
stats.mannwhitneyu(drug, placebo, alternative="two-sided")
stats.wilcoxon(after, before)            # signed-rank

## --- 95% CI for the mean difference (Welch) ---
diff = drug.mean() - placebo.mean()
v1, v2 = drug.var(ddof=1), placebo.var(ddof=1)
n1, n2 = len(drug), len(placebo)
se = np.sqrt(v1/n1 + v2/n2)
df = se**4 / ((v1/n1)**2/(n1-1) + (v2/n2)**2/(n2-1))
tcrit = stats.t.ppf(0.975, df)
ci = (diff - tcrit*se, diff + tcrit*se)

💡

論文寫法範例（ICMJE/CONSORT 風格）："Compared with placebo, drug A reduced systolic BP by 13.0 mmHg (95% CI 9.6 to 16.4; Welch's t = 8.2, df = 14.1, p < 0.001; Cohen's d = 3.6)." 注意：p 值不單獨報，效應量與 CI 同樣重要。 Sample write-up (ICMJE/CONSORT style): "Compared with placebo, drug A reduced systolic BP by 13.0 mmHg (95% CI 9.6 to 16.4; Welch's t = 8.2, df = 14.1, p < 0.001; Cohen's d = 3.6)." Note: never report p alone — effect size and CI are equally required.

常見陷阱

六、六個常見錯誤

❌ Student t 而非 Welch

2017 以後仍主動把 var.equal = TRUE 是退步。Welch 在等變異數情形下與 Student 幾乎沒差，但在不等變異數時穩健得多。無理由就讓它保持預設。

Manually setting var.equal = TRUE in 2017+ is a step backwards. Welch is nearly identical to Student under equal variances, and far more robust when they differ. Leave the default alone.

❌ 配對誤用為兩樣本

同隻老鼠 baseline 與 treatment 是配對。把它當成兩個獨立組會把個體間變異當噪音，浪費掉同個體高度相關帶來的檢定力。反向錯誤（獨立卻用 paired）則偽造相依，p 值無意義。

Baseline and treatment in the same mouse are paired. Treating them as independent inflates SE by including inter-individual variance, wasting the within-subject correlation. The reverse error (paired test on independent data) fakes dependence and renders p meaningless.

❌ CI 解讀錯誤

"95% 機率真值在這個 CI 內" 是錯誤的（Hoekstra 2014）。正確：「在重複實驗下，這個程序產生覆蓋真值區間的長期比例 ≈ 95%」。建議直接報「95% CI [a, b]」就好。

"There's a 95% probability the true value is in this CI" is wrong (Hoekstra 2014). Correct: "Under repeated sampling, the procedure produces an interval covering the truth 95% of the time." Just report "95% CI [a, b]" — no probabilistic gloss needed.

❌ Levene 預檢再選

「先 Levene → 不顯著就 Student → 顯著就 Welch」是條件式檢定錯誤，整體 α 不再是名義值。解方：直接跑 Welch，跳過 Levene。

"Levene first, Student if not-significant, Welch if significant" is a conditional-test error; the actual α is no longer nominal. Fix: skip Levene, just run Welch.

❌ 只報 p 值

「p < 0.05」單獨呈現幾乎沒資訊：你不知道效應多大、估計多精確。ASA 2016 與 ICMJE 都要求同時報 mean difference + 95% CI + 效應量（Cohen's d / Hedges' g）。

"p < 0.05" alone is almost information-free — you can't tell how big or how precise. ASA 2016 and the ICMJE both require mean difference + 95% CI + effect size (Cohen's d or Hedges' g).

❌ 三組以上跑多次 t

三組做三次成對 t，整體 α 已膨脹至 ~14%（而非 5%）。三組以上請用 ANOVA（Step 7），需要事後比較時再用 Tukey HSD 或 Bonferroni / Holm。

Three groups, three pairwise t-tests → family-wise α ≈ 14%, not 5%. For ≥ 3 groups use ANOVA (Step 7); follow up with Tukey HSD or Bonferroni/Holm when needed.

📝 自我檢測

1. 比較 drug A vs placebo 兩組獨立、連續資料，最佳預設選項是？

1. Comparing drug A vs placebo (two independent continuous groups), what's the recommended default?

A. 先 Levene 預檢驗，p > 0.05 用 Student、否則 WelchA. Levene first; Student if p > 0.05, else Welch

B. Student t（合併變異數）B. Student t (pooled variance)

C. Welch t（不假設等變異數，R/Python 預設）C. Welch t (no equal-var assumption, default in R/Python)

D. z 檢定D. z-test

2. 「95% CI」最正確的詮釋是？

2. The correct interpretation of a "95% CI" is:

A. 真值有 95% 機率在這個區間內A. There is a 95% probability the true value lies in this interval

B. 95% 的資料點在這個區間內B. 95% of the data fall in this interval

C. 在重複實驗下，這個程序產生覆蓋真值區間的長期比例 ≈ 95%C. Under repeated sampling, the procedure produces intervals covering the truth 95% of the time

D. 95% 的樣本平均落在這個區間D. 95% of sample means fall in this interval

3. 12 隻老鼠在基線與用藥後各測一次 HbA1c。最佳分析？

3. Twelve mice measured at baseline and after drug for HbA1c. Best analysis?

A. 兩樣本 Welch t（baseline 組 vs 用藥組）A. Two-sample Welch t (baseline vs post)

B. 配對 t 對差值 d = after − beforeB. Paired t on d = after − before

C. 三組 ANOVAC. Three-group ANOVA

D. χ² 檢定D. Chi-square test

4. 同事說「我做了 Levene，p = 0.04，所以改用 Welch」，最佳建議是？

4. A colleague says "Levene p = 0.04, so I switched to Welch." Best advice?

A. 很好，這是正確流程A. Great, that's the right flow

B. 應該改用 Mann-WhitneyB. Switch to Mann-Whitney instead

C. 應該改用 Bonferroni 修正C. Apply a Bonferroni correction

D. 條件式檢定會放大型一錯誤——直接跑 Welch，跳過 LeveneD. Conditional testing inflates Type I error — skip Levene, just run Welch every time