STEP 5 / 13

t 檢定與信賴區間 (t-tests & CIs)

從一位釀酒廠化學家的匿名手稿,到 R 與 SciPy 的預設選項——t 檢定如何成為生物醫學最常用、也最常誤用的推論工具。

From an anonymous manuscript by a brewery chemist to the default in R and SciPy — how the t-test became biomedicine's most-used and most-misused inferential tool.

為什麼是 Gosset、為什麼是 t、為什麼預設 Welch?

1908 年,William Sealy Gosset 在都柏林的 Guinness 釀酒廠擔任化學家。他需要用很少的麥芽樣本估計母體平均,但發現 Gauss 的 z 檢定假設 σ 已知——在實際實驗裡這幾乎不可能。Gosset 推導出在 σ 未知、必須用樣本 s 替代時,統計量的分布並非常態,而是多了一條 (n−1) 的自由度、尾巴更厚的新分布。由於 Guinness 禁止員工發表科學論文,Gosset 以筆名「Student」在 Biometrika 投稿,這就是 Student's t-distribution 的由來。

從那以後,t 檢定成為實驗科學最普及的推論工具。但今天還有兩個常被忽略的事實:(1) R 的 t.test() 與 SciPy 的 ttest_ind(equal_var=False) 預設使用 Welch t(不假設變異數相等),這是 Delacre, Lakens, Leys (2017 Int Rev Soc Psychol) 強烈推薦的;(2) 許多教科書仍教「先做 Levene 檢定再選 Student 或 Welch」——這個 two-step procedure 會放大型一錯誤應淘汰

In 1908, William Sealy Gosset, a chemist at Guinness Brewery in Dublin, needed to estimate population means from very small malt samples. Gauss's z-test assumed σ was known — almost never true in actual experiments. Gosset derived the distribution of the statistic when σ is replaced by the sample s, finding it is not normal: it has an extra (n−1) degrees of freedom and heavier tails. Because Guinness forbade staff publishing science (trade-secret protection), Gosset submitted to Biometrika under the pen-name "Student." That is the origin of Student's t-distribution.

The t-test then became the workhorse of experimental science. Two facts often missed today: (1) R's t.test() and SciPy's ttest_ind(equal_var=False) default to Welch's t (no equal-variance assumption), which Delacre, Lakens & Leys (2017 Int Rev Soc Psychol) strongly endorse; (2) many textbooks still teach "pre-test variance with Levene, then choose Student or Welch" — that two-step procedure inflates Type I error and should be retired.

💡
三句話總結:(1) 比較兩組連續變數時,預設用 Welch t;(2) 同一單位的前後測請用 paired t;(3) 報告結果時除了 p value,必附 mean difference + 95% CI + 效應量(Cohen's d)——這是 ASA 2016 與 ICMJE 的共同期待。 Three sentences: (1) When comparing two continuous groups, default to Welch's t; (2) for pre/post on the same unit, use a paired t; (3) alongside the p-value, always report mean difference + 95% CI + effect size (Cohen's d) — this is the joint expectation of ASA 2016 and the ICMJE.

一、t 檢定的三種變體

t 檢定其實是「把估計值除以它的標準誤」這個共同概念的三個變體。差別只在於「估計什麼」與「標準誤怎麼算」。

Every t-test follows the same recipe: (estimate − null value) / standard error of the estimate. The three flavours only differ in what you estimate and how you compute its SE.

🎯

① 單樣本 t

樣本平均對比一個參考值 μ₀(例如:實驗室校正基準、文獻給定的健康血糖值)。

t = (x̄ − μ₀) / (s/√n), df = n − 1.

例:你測 12 位健康成人靜脈血鈉 (Na⁺),看是否與臨床參考 140 mmol/L 有差。

Sample mean vs a reference value μ₀ (e.g., a lab-calibration target, a literature value for healthy glucose).

t = (x̄ − μ₀) / (s/√n), df = n − 1.

Example: Measure serum Na⁺ in 12 healthy adults; test against the clinical reference of 140 mmol/L.

⚖️

② 兩樣本 t

兩組獨立樣本的平均差是否為 0。又分兩種:

Student(合併變異數):假設 σ₁ = σ₂,使用 s_p² = ((n₁−1)s₁² + (n₂−1)s₂²)/(n₁+n₂−2)

Welch(不假設等變異數):t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂),df 由 Satterthwaite 近似

例:drug A vs placebo 的血壓變化。

Two independent groups' mean difference vs 0. Two versions:

Student (pooled variance): assumes σ₁ = σ₂, uses s_p² = ((n₁−1)s₁² + (n₂−1)s₂²)/(n₁+n₂−2).

Welch (unequal variances allowed): t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂), df via Satterthwaite.

Example: Drug A vs placebo, change in blood pressure.

🔗

③ 配對 t

同一單位(同一隻老鼠、同一位病人、同一塊組織)在兩個條件下的測值。把差值 d = x_after − x_before 當成單樣本對 0 檢定。

t = d̄ / (s_d / √n), df = n − 1.

例:治療前後的 HbA1c;左右肢的肌力;同一隻老鼠 baseline vs treatment。

The same unit (mouse, patient, tissue block) measured under two conditions. Treat differences d = x_after − x_before as a one-sample test against 0.

t = d̄ / (s_d / √n), df = n − 1.

Example: HbA1c before vs after therapy; left vs right limb strength; baseline vs treatment in the same mouse.

t = (估計值 − 零假設值) / SE(估計值)   ·   CI = 估計值 ± tα/2,df · SE   ·   Cohen's d = (x̄₁ − x̄₂) / spooledt 檢定的三變體共用一個骨架;報告時 p 值 + 平均差 + 95% CI + 效應量 缺一不可。Cohen (1988): d = 0.2 小、0.5 中、0.8 大——但 effect size 的「大小」應以領域脈絡判斷,不是死標準。 t = (estimate − null) / SE(estimate)   ·   CI = estimate ± tα/2,df · SE   ·   Cohen's d = (x̄₁ − x̄₂) / spooledAll three t-tests share one skeleton; reporting always requires p + mean diff + 95% CI + effect size. Cohen (1988): d = 0.2 small, 0.5 medium, 0.8 large — but the meaning of "large" is field-specific, not a universal cut-off.

平均差 × 樣本數 → t 與 p 值

調整標準化平均差 Δ(單位 SD)每組樣本數 n。下方計算 Welch t(假設等 SD = 1)、df、雙尾 p 值,並畫出兩個常態密度。觀察重點:(1) Δ 固定、n 變大 → SE 變小 → |t| 變大 → p 變小;(2) n 很小時,即使 Δ 很大 p 也未必顯著(n=5 vs n=50 看差別);(3) Δ = 0.2 (small) 在 n = 30 通常 顯著,這就是檢定力不足的本質。

Tune the standardised gap Δ (units of SD) and the per-group n. The panel computes Welch's t (assuming equal SD = 1), df, two-tailed p, and draws the two normal densities. Watch: (1) holding Δ fixed and growing n shrinks SE, inflates |t|, shrinks p; (2) tiny n yields non-significant p even when Δ is large (compare n=5 vs n=50); (3) Δ = 0.2 (small) at n=30 is usually not significant — this is the essence of underpowered design.

藍色 = 第一組密度 · 紅色 = 第二組密度Blue = group 1 density · Red = group 2 density

二、為什麼 Welch 應該是預設?

Welch t (1947 Biometrika) 與 Student t 的唯一差別:不假設兩組變異數相等。它用 Satterthwaite 公式調整 df(通常非整數)。Delacre, Lakens & Leys (2017) 用大規模模擬比較:

  • 當 σ₁ = σ₂ 時,Welch 與 Student 幾乎沒有檢定力差異(Welch 的 df 略小,差距 < 1%)。
  • 當 σ₁ ≠ σ₂(特別是 n 不平衡時),Student t 會嚴重失控:α 名義 5% 但實際可能膨脹到 10%–20%,或反向縮小到 1%。
  • Welch 在所有條件下都更接近名義 α

結論:沒有任何情境讓 Student t 顯著優於 Welch。所以 R 把 Welch 設為預設

Welch t (1947 Biometrika) differs from Student in one thing: it does not assume equal group variances. It uses Satterthwaite's formula to adjust df (typically non-integer). Delacre, Lakens & Leys (2017) ran large-scale simulations:

  • When σ₁ = σ₂, Welch and Student have essentially identical power (Welch's df is slightly smaller, < 1% difference).
  • When σ₁ ≠ σ₂ (especially with unequal n), Student t loses control: nominal α = 5% can balloon to 10–20%, or shrink to 1%.
  • Welch tracks the nominal α much more closely across all conditions.

The verdict: no realistic scenario makes Student t strictly better. That's why R defaults to Welch.

陷阱:Levene 預檢再選 很多舊版教科書教「先 Levene → p > 0.05 用 Student、否則用 Welch」。這是典型的條件式檢定錯誤(conditional-test error):(1) 第一次檢定(Levene)本身會錯,把 σ 不同當成相同;(2) 把第二個檢定的選擇條件化於第一個的結果,使整體 α 不再是名義值。Zimmerman (2004) 與 Rasch et al. (2011) 都明確示警。正解:永遠跑 Welch。 Many older textbooks teach "first run Levene → if p > 0.05 use Student, else use Welch." This is a conditional-test error: (1) the Levene test itself can be wrong, calling unequal variances equal; (2) conditioning the second test on the first wrecks the nominal α. Zimmerman (2004) and Rasch et al. (2011) both make this explicit. The fix: just run Welch every time.

配對設計與獨立設計的差別

「同一隻老鼠 baseline 與 treatment」是配對;「實驗組老鼠 vs 對照組老鼠」是獨立。把配對誤用為兩樣本會嚴重浪費資訊:

配對 t 的 SE 用差值 d 的 SD,而差值的變異 = Var(x_after) + Var(x_before) − 2·Cov(after, before)。當 after 與 before 正相關(同個體常常如此),共變數項抵消掉很大一塊變異 → SE 變小 → 檢定力暴增。

反過來,把獨立樣本誤用為配對則會偽造相依性,做出無意義的結果。

等價的另一種寫法:配對 t 等於「對差值 d 跑單樣本 t 對 0」,也等於「在 d 上跑截距迴歸」。在更複雜的縱向資料分析裡,這會自然延伸成混合效應模型(mixed model)(Step 13)或gain-score regression

"Same mouse, baseline vs treatment" is paired. "Treatment-arm mice vs control-arm mice" is independent. Mis-using a paired design as independent throws away information:

The paired SE uses the SD of the difference d, whose variance is Var(after) + Var(before) − 2·Cov(after, before). When after and before are positively correlated (often true within an individual), the covariance term wipes out a chunk of variance → smaller SE → much higher power.

Going the other way — treating independent samples as paired — fabricates dependence and renders the test meaningless.

Equivalent reformulations: a paired t is identical to "one-sample t on d vs 0" and to "intercept-only regression on d." For richer longitudinal data this naturally generalises to mixed-effects models (Step 13) and gain-score regression.

非參數替代:Wilcoxon 與 Mann-Whitney。當資料極度偏態、樣本極小、或測量只是序數時,可改用 Wilcoxon signed-rank(配對)Wilcoxon rank-sum / Mann-Whitney U(獨立兩組)。它們檢驗的是「分布的位置移動」而非平均。注意:當組內變異數差很大時,Mann-Whitney 並非完美——它對 stochastic dominance 仍然敏感(Hodges & Lehmann 1963;Fagerland & Sandvik 2009)。第二選項:bootstrap percentile CI,對形狀完全不敏感,且自然產生效應量區間。 Non-parametric alternatives: Wilcoxon and Mann-Whitney. For heavy skew, tiny n, or ordinal data, use Wilcoxon signed-rank (paired) or Wilcoxon rank-sum / Mann-Whitney U (independent two groups). These test for a location shift rather than a mean. Caveat: when within-group variances differ, Mann-Whitney isn't a free pass — it remains sensitive to stochastic dominance (Hodges & Lehmann 1963; Fagerland & Sandvik 2009). A second option: bootstrap percentile CIs — shape-agnostic and yielding effect-size intervals directly.

信賴區間覆蓋率示範

已知母體 μ = 100, σ = 15 重複抽取 100 個獨立樣本,每個樣本算出一條 95% CI(或調整為 80–99%)。下圖每條水平短線代表一條 CI;紅色表示未覆蓋真值 μ = 100。關鍵啟示:"95% CI" 並不是「真值落在這條 CI 內的機率是 95%」,而是「在無限重複實驗下,CI 覆蓋真值的長期比例 ≈ 95%」。Hoekstra et al. (2014) 報告:超過一半的研究者與學生都做出錯誤詮釋。

Repeatedly sample n observations from a known population (μ = 100, σ = 15) 100 times, computing a 95% CI (or 80–99%) for each. Each horizontal segment below is one CI; red CIs miss the true μ = 100. Lesson: "95% CI" does not mean "there is a 95% probability the true value is in this interval." It means: in infinitely many repetitions, the long-run fraction of intervals that cover the true value is ≈ 95%. Hoekstra et al. (2014) found more than half of researchers and students endorse the wrong interpretation.

綠色 = 覆蓋 μ=100 · 紅色 = 未覆蓋 · 灰虛線 = 真值Green = covers μ=100 · Red = misses · Grey dashed = true μ

⚠️
常見 CI 誤讀(Hoekstra 2014 Psychon Bull Rev):(1) ❌「真值有 95% 機率在這個區間內」——錯誤,真值是固定的;(2) ❌「重複實驗,平均 95% 會落在這個區間」——錯誤,這混淆了 CI 與預測區間;(3) ✅「在重複設計下,產生 CI 的程序有 95% 機率產出涵蓋真值的區間」——正確;(4) 報告時建議寫「The 95% CI for the mean difference was [a, b]」,不要說「There is a 95% chance...」。 Common CI misreadings (Hoekstra 2014 Psychon Bull Rev): (1) ❌ "There is a 95% probability the true value lies in this interval" — wrong, the true value is fixed; (2) ❌ "On replication, 95% of estimates fall in this interval" — wrong, that confuses CI with prediction; (3) ✅ "Under repeated sampling, the CI-construction procedure produces intervals containing the true value 95% of the time" — correct; (4) preferred wording: "The 95% CI for the mean difference was [a, b]" — not "There is a 95% chance...".

三、該用哪一種?

🌳 t 檢定選擇決策樹

Q1:
樣本之間有「同單位前後測 / 同對象兩條件」嗎?→ 是 → Paired t(或 Wilcoxon signed-rank 若 n 小且偏態)。
Q2:
有沒有參考值(reference value)要對比?→ 是 → One-sample t(n 小且偏態 → Wilcoxon signed-rank 對 μ₀)。
Q3:
兩組獨立、資料連續、近似常態(或 n > 30 且偏態不嚴重)?→ 是 → Welch t(預設)。
Q4:
資料嚴重偏態 / 含極端 outlier / n 很小?→ 是 → Mann-Whitney U,或 log 轉換後 Welch t,或 bootstrap percentile CI
Q5:
三組以上?→ 是 → 跳到 Step 7 ANOVA,不要重複跑 t 檢定。
Q6:
有 baseline 共變量、隨機區組、重複測量?→ 是 → 跳到 Step 13 Mixed models
Q1:
Pre/post on the same unit, or two conditions on the same subject? → Yes → Paired t (or Wilcoxon signed-rank for small skewed n).
Q2:
Comparing to a known reference value? → Yes → One-sample t (Wilcoxon signed-rank vs μ₀ if small + skewed).
Q3:
Two independent groups, continuous, roughly normal (or n > 30, not heavily skewed)? → Yes → Welch t (default).
Q4:
Heavy skew / extreme outliers / very small n? → Yes → Mann-Whitney U, or log-transform then Welch, or bootstrap percentile CI.
Q5:
Three or more groups? → Yes → Go to Step 7 ANOVA — don't pair-wise t-test.
Q6:
Baseline covariates, blocking, repeated measures? → Yes → Step 13 Mixed models.

四、四種檢定的比較

檢定 假設 估計什麼 何時用 風險
Student t (Gosset 1908)兩組獨立、常態、等變異數平均差σ₁ = σ₂ 明確可知(罕見)σ 不等 + n 不平衡時 α 嚴重失控Two indep, normal, equal varianceMean differenceWhen σ₁ = σ₂ is genuinely known (rare)α inflates badly with unequal σ + unbalanced n
Welch t (Welch 1947) ⭐兩組獨立、近常態(不需等變異數)平均差預設選項,R/Python 都已預設df 非整數 (Satterthwaite)Two indep, near-normal (no equal-var)Mean differenceThe default, in both R and Pythondf is non-integer (Satterthwaite)
Paired t差值近常態差值平均同單位前後測 / 配對設計假設配對是真的——randomization 不是配對Differences near-normalMean of differencesPre/post on same unit, paired designPairing must be real — randomisation alone isn't pairing
Wilcoxon rank-sum / Mann-Whitney U (Wilcoxon 1945; Mann-Whitney 1947)兩組獨立、序數或連續、分布形狀相似P(X > Y) 或位置移動嚴重偏態、小 n、序數資料變異數不等時仍會偏,並非完全 distribution-freeTwo indep, ordinal or continuous, similar shapesP(X > Y) or location shiftHeavy skew, small n, ordinal dataStill biased with unequal variances
Wilcoxon signed-rank (paired)差值對稱差值的位置小 n 配對且差值偏態差值不對稱 → 結果可能仍偏Symmetry of differencesLocation of differencesSmall-n paired with skewed differencesAsymmetric differences still bias the test
Sign test只需 +/− 符號獨立中位數的方向極端穩健,但檢定力最低資訊只用了方向,浪費連續測量Signs onlyMedian directionMost robust but lowest powerIgnores magnitude — wastes continuous measurement

五、四種變體的實作

# R: t.test() defaults to Welch (var.equal = FALSE)
library(tidyverse)
library(effectsize)   # Cohen's d

## --- (1) One-sample t: serum Na vs reference 140 ---
na <- c(138,141,142,139,137,143,140,144,138,141,139,142)
t.test(na, mu = 140)

## --- (2) Two-sample Welch t: drug A vs placebo, BP change ---
drug    <- c(-12,-15,-9,-18,-11,-14,-17,-10,-13,-16)
placebo <- c(-2,1,-3,0,-1,2,-4,1,3,-2)
t.test(drug, placebo)                # Welch by default
t.test(drug, placebo, var.equal=TRUE)# Student (NOT recommended)
effectsize::cohens_d(drug, placebo, pooled_sd = FALSE)

## --- (3) Paired t: HbA1c before vs after therapy ---
before <- c(8.2,7.9,8.5,9.1,7.6,8.8,8.0,9.3)
after  <- c(7.1,7.0,7.6,8.0,6.9,7.9,7.2,8.4)
t.test(after, before, paired = TRUE)   # or t.test(after - before)

## --- (4) Non-parametric: Wilcoxon & Mann-Whitney ---
wilcox.test(drug, placebo)            # Mann-Whitney U
wilcox.test(after, before, paired = TRUE) # signed-rank

## --- CI for mean difference (Welch) ---
fit <- t.test(drug, placebo)
fit$estimate; fit$conf.int            # mean diff + 95% CI
import numpy as np
from scipy import stats
import pingouin as pg     # Cohen's d & CI

## --- (1) One-sample t ---
na = np.array([138,141,142,139,137,143,140,144,138,141,139,142])
stats.ttest_1samp(na, popmean=140)

## --- (2) Two-sample Welch t (equal_var=False) ---
drug    = np.array([-12,-15,-9,-18,-11,-14,-17,-10,-13,-16])
placebo = np.array([-2, 1,-3, 0,-1, 2,-4, 1, 3,-2])
stats.ttest_ind(drug, placebo, equal_var=False)   # Welch
pg.ttest(drug, placebo, correction=True)         # Welch + d + CI

## --- (3) Paired t ---
before = np.array([8.2,7.9,8.5,9.1,7.6,8.8,8.0,9.3])
after  = np.array([7.1,7.0,7.6,8.0,6.9,7.9,7.2,8.4])
stats.ttest_rel(after, before)

## --- (4) Non-parametric ---
stats.mannwhitneyu(drug, placebo, alternative="two-sided")
stats.wilcoxon(after, before)            # signed-rank

## --- 95% CI for the mean difference (Welch) ---
diff = drug.mean() - placebo.mean()
v1, v2 = drug.var(ddof=1), placebo.var(ddof=1)
n1, n2 = len(drug), len(placebo)
se = np.sqrt(v1/n1 + v2/n2)
df = se**4 / ((v1/n1)**2/(n1-1) + (v2/n2)**2/(n2-1))
tcrit = stats.t.ppf(0.975, df)
ci = (diff - tcrit*se, diff + tcrit*se)
💡
論文寫法範例(ICMJE/CONSORT 風格):"Compared with placebo, drug A reduced systolic BP by 13.0 mmHg (95% CI 9.6 to 16.4; Welch's t = 8.2, df = 14.1, p < 0.001; Cohen's d = 3.6)." 注意:p 值不單獨報,效應量與 CI 同樣重要。 Sample write-up (ICMJE/CONSORT style): "Compared with placebo, drug A reduced systolic BP by 13.0 mmHg (95% CI 9.6 to 16.4; Welch's t = 8.2, df = 14.1, p < 0.001; Cohen's d = 3.6)." Note: never report p alone — effect size and CI are equally required.

六、六個常見錯誤

Student t 而非 Welch

2017 以後仍主動把 var.equal = TRUE退步。Welch 在等變異數情形下與 Student 幾乎沒差,但在不等變異數時穩健得多。無理由就讓它保持預設。

Manually setting var.equal = TRUE in 2017+ is a step backwards. Welch is nearly identical to Student under equal variances, and far more robust when they differ. Leave the default alone.

配對誤用為兩樣本

同隻老鼠 baseline 與 treatment 是配對。把它當成兩個獨立組會把個體間變異當噪音,浪費掉同個體高度相關帶來的檢定力。反向錯誤(獨立卻用 paired)則偽造相依,p 值無意義。

Baseline and treatment in the same mouse are paired. Treating them as independent inflates SE by including inter-individual variance, wasting the within-subject correlation. The reverse error (paired test on independent data) fakes dependence and renders p meaningless.

CI 解讀錯誤

"95% 機率真值在這個 CI 內" 是錯誤的(Hoekstra 2014)。正確:「在重複實驗下,這個程序產生覆蓋真值區間的長期比例 ≈ 95%」。建議直接報「95% CI [a, b]」就好。

"There's a 95% probability the true value is in this CI" is wrong (Hoekstra 2014). Correct: "Under repeated sampling, the procedure produces an interval covering the truth 95% of the time." Just report "95% CI [a, b]" — no probabilistic gloss needed.

Levene 預檢再選

「先 Levene → 不顯著就 Student → 顯著就 Welch」是條件式檢定錯誤,整體 α 不再是名義值。解方:直接跑 Welch,跳過 Levene。

"Levene first, Student if not-significant, Welch if significant" is a conditional-test error; the actual α is no longer nominal. Fix: skip Levene, just run Welch.

只報 p 值

「p < 0.05」單獨呈現幾乎沒資訊:你不知道效應多大、估計多精確。ASA 2016 與 ICMJE 都要求同時報 mean difference + 95% CI + 效應量(Cohen's d / Hedges' g)。

"p < 0.05" alone is almost information-free — you can't tell how big or how precise. ASA 2016 and the ICMJE both require mean difference + 95% CI + effect size (Cohen's d or Hedges' g).

三組以上跑多次 t

三組做三次成對 t,整體 α 已膨脹至 ~14%(而非 5%)。三組以上請用 ANOVA(Step 7),需要事後比較時再用 Tukey HSD 或 Bonferroni / Holm。

Three groups, three pairwise t-tests → family-wise α ≈ 14%, not 5%. For ≥ 3 groups use ANOVA (Step 7); follow up with Tukey HSD or Bonferroni/Holm when needed.

📝 自我檢測

1. 比較 drug A vs placebo 兩組獨立、連續資料,最佳預設選項是?

1. Comparing drug A vs placebo (two independent continuous groups), what's the recommended default?

A. 先 Levene 預檢驗,p > 0.05 用 Student、否則 WelchA. Levene first; Student if p > 0.05, else Welch
B. Student t(合併變異數)B. Student t (pooled variance)
C. Welch t(不假設等變異數,R/Python 預設)C. Welch t (no equal-var assumption, default in R/Python)
D. z 檢定D. z-test

2. 「95% CI」最正確的詮釋是?

2. The correct interpretation of a "95% CI" is:

A. 真值有 95% 機率在這個區間內A. There is a 95% probability the true value lies in this interval
B. 95% 的資料點在這個區間內B. 95% of the data fall in this interval
C. 在重複實驗下,這個程序產生覆蓋真值區間的長期比例 ≈ 95%C. Under repeated sampling, the procedure produces intervals covering the truth 95% of the time
D. 95% 的樣本平均落在這個區間D. 95% of sample means fall in this interval

3. 12 隻老鼠在基線與用藥後各測一次 HbA1c。最佳分析?

3. Twelve mice measured at baseline and after drug for HbA1c. Best analysis?

A. 兩樣本 Welch t(baseline 組 vs 用藥組)A. Two-sample Welch t (baseline vs post)
B. 配對 t 對差值 d = after − beforeB. Paired t on d = after − before
C. 三組 ANOVAC. Three-group ANOVA
D. χ² 檢定D. Chi-square test

4. 同事說「我做了 Levene,p = 0.04,所以改用 Welch」,最佳建議是?

4. A colleague says "Levene p = 0.04, so I switched to Welch." Best advice?

A. 很好,這是正確流程A. Great, that's the right flow
B. 應該改用 Mann-WhitneyB. Switch to Mann-Whitney instead
C. 應該改用 Bonferroni 修正C. Apply a Bonferroni correction
D. 條件式檢定會放大型一錯誤——直接跑 Welch,跳過 LeveneD. Conditional testing inflates Type I error — skip Levene, just run Welch every time