Step 7: ANOVA — Biostatistics Tutorial

總覽

為什麼需要 ANOVA？

當你有 三組或以上 的連續資料要比較平均（例如：安慰劑 / 低劑量 / 中劑量 / 高劑量四組血壓），最直覺的想法是「兩兩 t 檢定」。錯。K 組做 C(K,2) 次 t 檢定，整體 Type I error 會爆炸：3 組 3 次 → 1 − 0.95³ ≈ 14%；4 組 6 次 → ≈ 26%；6 組 15 次 → ≈ 54%。這就是著名的「multiple comparisons problem」。

Ronald A. Fisher 在 1925 年《Statistical Methods for Research Workers》提出 ANOVA：一次性檢定「至少一對平均不同」，並把總變異拆解成「組間 (between)」與「組內 (within)」兩個來源。比值 F = MS_between / MS_within 服從 F 分布（Fisher–Snedecor），當 F 顯著大時拒絕「所有平均相等」的虛無假設。

更深的洞察：ANOVA 其實就是用 dummy variable 做的線性迴歸。aov(y ~ group) 與 lm(y ~ factor(group)) 在數學上完全等價——這是現代統計把 ANOVA 視為「廣義線性模型 (GLM) 的特例」的根本原因（McCullagh & Nelder 1989）。

With three or more groups of continuous data (placebo / low / medium / high dose blood pressure), the instinct is "do all pairwise t-tests". Wrong. K groups need C(K,2) t-tests and Type I error explodes: 3 groups, 3 tests → 1 − 0.95³ ≈ 14%; 4 groups, 6 tests → ≈ 26%; 6 groups, 15 tests → ≈ 54%. That is the classic multiple-comparisons problem.

Ronald A. Fisher's 1925 Statistical Methods for Research Workers introduced ANOVA: a single test for "at least one pair of means differs" that partitions total variability into a between-group piece and a within-group piece. The ratio F = MS_between / MS_within follows the Fisher–Snedecor F-distribution; a large F rejects the null that all means are equal.

Deeper insight: ANOVA is just linear regression with dummy variables. aov(y ~ group) is mathematically identical to lm(y ~ factor(group)) — which is why modern statistics treats ANOVA as a special case of the GLM (McCullagh & Nelder 1989).

💡

歷史小註：Fisher 在 Rothamsted Experimental Station 做農業實驗（不同肥料的小麥產量），需要同時比較多組——這正是 ANOVA 誕生的場景。F 分布的「F」就是為了紀念 Fisher，由 George W. Snedecor 1934 年命名。 Historical note: Fisher invented ANOVA at Rothamsted Experimental Station while analysing wheat yields across fertilizer treatments — the prototypical multi-group comparison. The "F" in F-distribution honours Fisher; the name was coined by George W. Snedecor in 1934.

核心公式

一、F 統計量的本質

令第 i 組第 j 個觀察值為 y_ij，組平均 ȳ_i·，總平均 ȳ_··。ANOVA 的核心恆等式是「SS_total = SS_between + SS_within」：

Let y_ij be observation j in group i, with group mean ȳ_i· and grand mean ȳ_··. The core ANOVA identity is SS_total = SS_between + SS_within:

⌜ SS_between = Σ_i n_i (ȳ_i· − ȳ_··)² · SS_within = Σ_iΣ_j (y_ij − ȳ_i·)² · F = MS_between / MS_within = (SS_between/(K−1)) / (SS_within/(N−K)) df_between = K−1（K 組），df_within = N−K（N 總觀察數）。當所有 μᵢ 相等時，E[MS_between] = E[MS_within] = σ²，所以 F ≈ 1；組間真有差距時，E[MS_between] > σ²，F 偏大。 ⌜ SS_between = Σ_i n_i (ȳ_i· − ȳ_··)² · SS_within = Σ_iΣ_j (y_ij − ȳ_i·)² · F = MS_between / MS_within = (SS_between/(K−1)) / (SS_within/(N−K)) df_between = K−1 (K groups), df_within = N−K (N total observations). When all μᵢ are equal, E[MS_between] = E[MS_within] = σ², so F ≈ 1; if a real difference exists, E[MS_between] > σ² and F grows.

🔢

One-way

一個分類自變項（如 4 種處理），檢定「所有組平均皆相等」。最常見的形式。等價於 lm(y ~ group)，omnibus F-test 同 lm 的整體 F-test。

One categorical predictor (e.g., four treatments); tests "all group means equal". The most common form. Equivalent to lm(y ~ group); the omnibus F equals the regression F.

🔀

Two-way

兩個分類自變項 + 交互作用 (interaction)。例：藥物 × 劑量。看主效果 (main effect) 與「藥物效果是否隨劑量改變」。違反 additivity 時 interaction 顯著。

Two categorical predictors + an interaction. Example: drug × dose. Read main effects and "does the drug effect vary with dose?". A significant interaction means non-additivity.

🔁

RM-ANOVA

同一受試者反覆測量（baseline / week 2 / week 4）。需配對 within-subject 相關，假設 sphericity（球度）——Mauchly 檢定違反就用 Greenhouse–Geisser 修正。現在多被混合模型 (mixed model) 取代（見 Step 13）。

The same subject measured repeatedly (baseline / week 2 / week 4). Accounts for within-subject correlation; assumes sphericity — when Mauchly's test fails, apply Greenhouse–Geisser. Now largely superseded by mixed-effects models (see Step 13).

⚖️

Welch / B-F

Brown & Forsythe 1974 與 Welch (1951) 提出：當各組變異數不等時，傳統 ANOVA 的 Type I error 失控。Welch ANOVA 不假設等變異，應作為預設選項，呼應 Welch t-test。

Brown & Forsythe 1974 and Welch (1951): when group variances are unequal, classical ANOVA's Type I error misbehaves. Welch ANOVA does not assume equal variances and should be the default, mirroring Welch t-test.

🪜

Kruskal–Wallis

Kruskal & Wallis 1952 JASA：rank-based 無母數版本，不假設常態。當 n 小且明顯偏態時的替代。後續配對用 Dunn's test 或 pairwise Wilcoxon + BH。

Kruskal & Wallis 1952 JASA: rank-based non-parametric alternative, no normality assumption. Use when n is small with clear skew. Follow up with Dunn's test or pairwise Wilcoxon + BH.

📐

ANOVA = 迴歸

K 組對應 K−1 個 dummy variable（reference coding）。lm(y ~ group) 的 F-test 與 aov 的 F 相同；β 係數就是「該組 vs reference 的平均差」。理解這點就能無縫進入 ANCOVA（加共變量）。

K groups → K−1 dummy variables (reference coding). The F-test from lm(y ~ group) equals aov's F; each β is "group vs reference mean difference". Grasping this unlocks ANCOVA (add covariates).

互動模擬 ①

三組 F 統計量遊樂場

調整三組平均 μ₁、μ₂、μ₃，與共同的組內 SD 以及每組 n。觀察 F 與 p 值如何變化。核心直覺：組間差距越大 → SS_between 越大 → F 越大；組內 SD 越大（噪音多）→ SS_within 越大 → F 越小；n 越大 → F 對「真實小差距」越敏感。η² 是效果量（見下方）。

Tune the three means μ₁, μ₂, μ₃, the common within-group SD, and per-group n. Watch F and p change. Intuition: bigger group separation → larger SS_between → larger F; bigger within-group SD (more noise) → larger SS_within → smaller F; larger n → F is more sensitive to small real differences. η² is the effect size (see below).

μ₁ (Group A) 50

μ₂ (Group B) 52

μ₃ (Group C) 55

組內 SD 5

每組 n 20

三組模擬資料的點圖（每點為一觀察值，橫條 = 組平均）Three-group dotplot (each point = one observation, bar = group mean)

假設檢核

二、三大假設與如何檢核

1️⃣ 殘差常態性

不是「每組資料常態」，而是模型殘差 e_ij = y_ij − ȳ_i· 近似常態。檢查方式：殘差 QQ plot（最直觀）、Shapiro–Wilk（n < 50）。CLT 加持下，n 大時 ANOVA 對輕度違反相當穩健（Glass et al. 1972）。

Not "data are normal in each group" — the model residuals e_ij = y_ij − ȳ_i· should be approximately normal. Check via residual QQ plot (most informative) and Shapiro–Wilk (n < 50). With CLT support, ANOVA is fairly robust to mild violations when n is large (Glass et al. 1972).

2️⃣ 同質變異

各組 σ² 相同。Levene's test（中位數版較穩健）、Bartlett's test（對非常態敏感）。但同 t-test 章節的批評：先做檢定再決定是 Student/Welch 屬於資料驅動決策，會放大 Type I error（Zimmerman 2004）。直接用 Welch ANOVA 是更乾淨的選擇。

Equal σ² across groups. Levene's test (median version is robust); Bartlett's test (sensitive to non-normality). Same critique as the t-test chapter: testing first and then choosing classical vs Welch is data-driven and inflates Type I error (Zimmerman 2004). Defaulting to Welch ANOVA is the cleaner choice.

3️⃣ 獨立性

觀察值彼此獨立。最難檢核也最致命——同隻老鼠多切片、同培養皿多 well、同病人多次採血都違反此假設（pseudoreplication, Hurlbert 1984）。違反就要用 RM-ANOVA 或 mixed model（Step 13）。

Observations are independent. The hardest to check and the most dangerous — multiple slices from one mouse, multiple wells per dish, repeated draws from one patient all violate this (pseudoreplication, Hurlbert 1984). Switch to RM-ANOVA or a mixed model (Step 13).

⚠️

常見錯誤：分別對每一組跑 Shapiro–Wilk。這在 K 大、n 小時非常容易誤判（每組功效低），而且檢核錯了對象——ANOVA 假設的是「殘差」常態，不是「組內資料」常態。正確做法：先擬合模型，從 resid(fit) 上做 QQ plot。 Common mistake: running Shapiro–Wilk on each group separately. With small n and many groups, this gives low power per test and checks the wrong thing — ANOVA's normality assumption is on the residuals, not raw group data. Right way: fit the model, then QQ-plot resid(fit).

後續配對

三、Post-hoc：誰跟誰不同？

ANOVA 的 F 顯著只告訴你「至少一對不同」，不告訴你哪一對。要回答這個問題，就需要 post-hoc 檢定，並且每種方法控制不同類型的錯誤率：FWER (family-wise) 或 FDR (false discovery rate)。最常用的五種：

A significant omnibus F tells you "at least one pair differs" but not which pair. That requires a post-hoc test, and each option controls a different error rate — family-wise (FWER) or false discovery rate (FDR). The five workhorses:

方法	範圍	控制	保守度	情境
Tukey HSD	所有 K(K−1)/2 配對	FWER	中	標準首選；等樣本數時最佳 (Tukey 1949)	All K(K−1)/2 pairs	FWER	Medium	Default; optimal with equal n (Tukey 1949)
Bonferroni	任意 m 個檢定	FWER	高（最保守）	少量比較、簡單透明；m 大時功效太低	Any m tests	FWER	High (most conservative)	Few comparisons; transparent; underpowered for large m
Holm	任意 m 個檢定	FWER	中（uniformly < Bonferroni）	Bonferroni 的「step-down」版，功效更高 (Holm 1979)	Any m tests	FWER	Medium (uniformly < Bonferroni)	Step-down version, higher power (Holm 1979)
Dunnett	K−1 個（每組 vs 單一 control）	FWER	中	劑量試驗、藥物 vs 安慰劑 (Dunnett 1955)	K−1 (each vs single control)	FWER	Medium	Dose-response; drug vs placebo (Dunnett 1955)
Scheffé	所有可能線性對比 (contrasts)	FWER	最高（極保守）	事後想到的複雜對比（如 (A+B)/2 vs C）(Scheffé 1959)	All linear contrasts	FWER	Highest (very conservative)	Post-hoc complex contrasts like (A+B)/2 vs C (Scheffé 1959)

實務建議：(1) 全對比 → Tukey HSD；(2) 與 control 比 → Dunnett（功效比 Tukey 高，因為比較數少）；(3) 計畫好的少數對比 → Holm；(4) 探索性、事後想到的複雜對比 → Scheffé。不要先看資料才決定哪些配對要比（cherry-picking）——這是 p-hacking。 Practical rules: (1) all pairs → Tukey HSD; (2) vs control → Dunnett (higher power than Tukey because fewer comparisons); (3) pre-planned few contrasts → Holm; (4) post-hoc complex exploratory contrasts → Scheffé. Don't peek at the data and then decide which pairs to test — that's p-hacking.

互動模擬 ②

Post-hoc 調整怎麼隨 K 變化

滑動 K（組數）。觀察當 K 增加時，每個配對檢定的「有效 α」如何被三種方法調整。Bonferroni：α/m，隨 m=C(K,2) 線性下降。Tukey HSD：使用 studentized range distribution，調整較溫和。Holm step-down：sequential 調整，比 Bonferroni 寬鬆。

Slide K (number of groups). Watch how the per-test effective α shrinks under each method as K grows. Bonferroni: α/m, linear in m=C(K,2). Tukey HSD: uses the studentized range distribution; gentler adjustment. Holm step-down: sequential, less conservative than Bonferroni.

組數 K 5

y 軸 = 每個配對檢定的有效 α 門檻（越低越保守）y-axis = effective per-test α threshold (lower = more conservative)

決策引導

四、決策樹

🌳 ANOVA 決策樹

Q1:

資料明顯非常態且 n 小 (< 15/組)？→ 是 → Kruskal–Wallis + Dunn post-hoc 或 pairwise Wilcoxon + BH。

Q2:

同一受試者多時點測量？→ 是 → 優先 linear mixed model（Step 13）；傳統 RM-ANOVA 須先 Mauchly 檢球度，違反就 Greenhouse–Geisser 校正。

Q3:

兩個分類因子要看交互作用？→ 是 → Two-way ANOVA；不平衡設計用 Type II 或 III SS（car::Anova）。

Q4:

Levene's test 顯著或 boxplot 看出變異不等？→ 是 → Welch ANOVA（oneway.test）；後續用 Games–Howell post-hoc。

Q5:

單一分類因子、變異近似相等、殘差近常態？→ 是 → One-way ANOVA。後續：全對比 → Tukey HSD；vs control → Dunnett。

Q6:

有共變量要控制（年齡、baseline 值）？→ 是 → ANCOVA（lm(y ~ group + covariate)），等價於迴歸。

Q1:

Clearly non-normal with small n (< 15/group)? → Yes → Kruskal–Wallis + Dunn or pairwise Wilcoxon + BH.

Q2:

Same subjects across timepoints? → Yes → Prefer a linear mixed model (Step 13); classical RM-ANOVA needs Mauchly + Greenhouse–Geisser if sphericity fails.

Q3:

Two categorical factors with possible interaction? → Yes → Two-way ANOVA; for unbalanced designs use Type II/III SS (car::Anova).

Q4:

Levene's test significant or visibly unequal spread? → Yes → Welch ANOVA (oneway.test); pair with Games–Howell post-hoc.

Q5:

One factor, roughly equal variances, residuals near-normal? → Yes → One-way ANOVA. Post-hoc: all pairs → Tukey HSD; vs control → Dunnett.

Q6:

Need to adjust for covariates (age, baseline)? → Yes → ANCOVA (lm(y ~ group + covariate)) — equivalent to regression.

深入討論

五、效果量與不平衡設計

效果量

η² (eta squared) = SS_between / SS_total，組間變異佔總變異的比例。Cohen 1988 標準：0.01 小、0.06 中、0.14 大。

ω² (omega squared)：較不偏估，小樣本時優於 η²。partial η² 用於 multi-factor ANOVA，分母只用該效果 + 殘差，方便分別解讀各 main effect。報告時建議同時給 p、F、df、η²/ω²、95% CI。

η² (eta squared) = SS_between / SS_total — proportion of total variance explained by groups. Cohen 1988 benchmarks: 0.01 small, 0.06 medium, 0.14 large.

ω² (omega squared): less biased, better for small samples. partial η² for multi-factor designs uses only "this effect + residual" as denominator, so each main effect reads independently. Report p, F, df, η²/ω², 95% CI together.

Type I / II / III SS

當組樣本數不等時，主效果的 SS 計算順序會影響結果。Type I（序貫）：R 預設 anova()，加入順序敏感；Type II：忽略 interaction 計算主效果（推薦無顯著 interaction 時）；Type III：SAS 預設，car::Anova(fit, type=3)，必須設正交對比 (contr.sum) 才正確。

Langsrud 2003 Statistics & Computing 詳述差異。最安全：平衡設計 (equal n) 三種結果相同；不平衡時論文 Methods 明確寫使用哪一種。

With unequal group sizes the order in which SS for main effects is computed matters. Type I (sequential): R's default anova(), order-sensitive; Type II: main effects ignoring interaction (recommended when interaction is non-significant); Type III: SAS default, car::Anova(fit, type=3) — requires orthogonal contrasts (contr.sum) to be valid.

Langsrud 2003 Statistics & Computing walks through the differences. Safest: with balanced designs all three agree; in unbalanced designs state explicitly which type you used.

🚨

「我跑 anova() 跟 car::Anova() 結果不一樣！」——這幾乎一定是因為設計不平衡 + SS 類型不同。R 預設 Type I（序貫），SAS 與多數教科書預設 Type III。寫論文時不要只貼 R 預設輸出就交差，要說明：「Type III sums of squares were computed using car::Anova with sum-to-zero contrasts.」 "My anova() and car::Anova() give different answers!" — almost always unbalanced design + different SS types. R defaults to Type I (sequential); SAS and most textbooks default to Type III. Don't ship R's default output without noting: "Type III sums of squares computed via car::Anova with sum-to-zero contrasts."

程式碼

六、實作範例

# R: classical, Welch, Kruskal–Wallis, post-hoc, two-way
library(tidyverse); library(car); library(emmeans); library(rstatix)

# Drug dose-response: placebo / low / medium / high (n=20 each)
df <- tibble(
  dose  = factor(rep(c("placebo","low","med","high"), each=20),
                levels=c("placebo","low","med","high")),
  bp    = c(rnorm(20,140,8), rnorm(20,135,8),
           rnorm(20,130,8), rnorm(20,122,8)))

# --- 1. Classical one-way ANOVA (assumes equal variance) ---
fit <- aov(bp ~ dose, data = df)
summary(fit)                       # F, df, p
summary(lm(bp ~ dose, data = df))   # identical F — ANOVA = regression

# --- 2. Assumption checks on RESIDUALS ---
plot(fit, which = 2)                # residual QQ plot
shapiro.test(resid(fit))           # residual normality
car::leveneTest(bp ~ dose, data = df, center = median)

# --- 3. Welch ANOVA (default if variances unequal) ---
oneway.test(bp ~ dose, data = df, var.equal = FALSE)

# --- 4. Post-hoc ---
TukeyHSD(fit)                       # all pairs, FWER
emmeans::emmeans(fit, pairwise ~ dose, adjust = "tukey")
# Dunnett: each vs placebo (control)
emmeans::emmeans(fit, trt.vs.ctrl ~ dose, ref = "placebo")
# Games–Howell for unequal variance
rstatix::games_howell_test(df, bp ~ dose)

# --- 5. Kruskal–Wallis non-parametric ---
kruskal.test(bp ~ dose, data = df)
rstatix::dunn_test(df, bp ~ dose, p.adjust.method = "BH")

# --- 6. Two-way ANOVA + interaction ---
fit2 <- lm(bp ~ drug * dose, data = df2)
car::Anova(fit2, type = 3)         # Type III SS (unbalanced safe)

# --- 7. Effect size ---
rstatix::anova_summary(fit, effect.size = "pes")   # partial η²
effectsize::omega_squared(fit)              # ω²

import numpy as np, pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pingouin as pg          # clean ANOVA + effect size

rng = np.random.default_rng(1)
df = pd.DataFrame({
  "dose": np.repeat(["placebo","low","med","high"], 20),
  "bp":   np.concatenate([rng.normal(m,8,20) for m in [140,135,130,122]])
})

# --- 1. Classical one-way ANOVA ---
fit = ols("bp ~ C(dose)", data=df).fit()
sm.stats.anova_lm(fit, typ=2)
stats.f_oneway(*[df[df.dose==g].bp for g in df.dose.unique()])

# pingouin gives F, df, p, η², ω² in one line
pg.anova(data=df, dv="bp", between="dose", effsize="np2")

# --- 2. Assumption checks on residuals ---
stats.shapiro(fit.resid)
stats.levene(*[df[df.dose==g].bp for g in df.dose.unique()], center="median")

# --- 3. Welch ANOVA ---
pg.welch_anova(data=df, dv="bp", between="dose")

# --- 4. Post-hoc ---
pairwise_tukeyhsd(df.bp, df.dose)              # Tukey HSD
pg.pairwise_gameshowell(data=df, dv="bp", between="dose")
pg.pairwise_tests(data=df, dv="bp", between="dose",
                   parametric=True, padjust="holm")

# --- 5. Kruskal–Wallis + Dunn ---
stats.kruskal(*[df[df.dose==g].bp for g in df.dose.unique()])
import scikit_posthocs as sp
sp.posthoc_dunn(df, val_col="bp", group_col="dose", p_adjust="fdr_bh")

# --- 6. Two-way + interaction, Type III SS ---
fit2 = ols("bp ~ C(drug) * C(dose)", data=df2).fit()
sm.stats.anova_lm(fit2, typ=3)

💡

建議的最小報告組合："One-way ANOVA showed a significant effect of dose on systolic BP (F(3,76)=18.4, p<0.001, ω²=0.39). Tukey HSD: high vs placebo Δ=−18 mmHg, 95% CI [−24, −12], p<0.001; med vs placebo Δ=−10 mmHg, 95% CI [−16, −4], p=0.001." Minimal recommended reporting: "One-way ANOVA showed a significant effect of dose on systolic BP (F(3,76)=18.4, p<0.001, ω²=0.39). Tukey HSD: high vs placebo Δ=−18 mmHg, 95% CI [−24, −12], p<0.001; med vs placebo Δ=−10 mmHg, 95% CI [−16, −4], p=0.001."

常見陷阱

七、六大陷阱

❌ Omnibus ≠ 特定對比

F 顯著只說「至少一對不同」，不能直接結論「treatment vs placebo 有效」。必須跑 post-hoc 才能說特定配對；甚至 omnibus 不顯著時，pre-planned contrast 仍可能顯著（Hsu 1996）。

A significant F means "at least one pair differs" — it does not by itself prove "treatment vs placebo works". Run a post-hoc to claim a specific pair; and a pre-planned contrast can still be significant even when the omnibus isn't (Hsu 1996).

❌ Pseudoreplication

3 隻老鼠每隻 4 個切片 ≠ n=12。Hurlbert 1984 的經典論文指出生態與生物實驗最常見的錯誤。獨立單位是「mouse」，切片是 within-subject 重複——須用混合模型把 mouse 設成 random effect。

3 mice × 4 slices ≠ n=12. Hurlbert 1984's classic paper documents this as the most common error in biological and ecological studies. The independent unit is the mouse; slices are within-subject replicates — fit a mixed model with mouse as a random effect.

❌ 不做後續調整

「omnibus p=0.04 ✓，再隨便跑幾個 t-test 看哪對顯著」——這是 garden-of-forking-paths 的典型路徑。當你比較 m 對，要報 adjusted p（Tukey/Bonferroni/Holm）並在 Methods 寫清楚。

"Omnibus p=0.04 ✓; now run a few t-tests to find which pair is significant" — textbook garden-of-forking-paths. With m pairs, report adjusted p (Tukey/Bonferroni/Holm) and state the method in Methods.

❌ 不平衡未說 SS 類型

Type I/II/III 在不平衡時結果不同。Langsrud 2003：論文要明確說「Type III SS via car::Anova with contr.sum」否則他人無法重現。R 預設 Type I 是常見地雷。

Type I/II/III diverge on unbalanced data. Langsrud 2003: papers must state "Type III SS via car::Anova with contr.sum" — otherwise the analysis cannot be reproduced. R's default (Type I) is a frequent trap.

❌ Outlier 毒殺 F

SS 用平方距離，一個極端值就能讓 MS_within 暴增、F 變不顯著。先看 boxplot、QQ plot；嚴重時改用 Kruskal–Wallis 或 robust ANOVA（Wilcox 2017，WRS2::t1way 用 trimmed means）。

SS uses squared distances, so a single extreme point inflates MS_within and kills F. Inspect boxplots and QQ plots; if needed, switch to Kruskal–Wallis or robust ANOVA (Wilcox 2017, WRS2::t1way uses trimmed means).

❌ p>0.05 ≠ 沒差別

無法拒絕虛無假設不代表組間相等——可能只是 n 太小或變異太大（Type II error）。報告效果量 + 95% CI 才能看出「真的沒差」或「沒功效偵測差距」。等價檢定 (equivalence test) 是更嚴謹做法。

Failing to reject doesn't imply equality — small n or large variance (Type II error) can hide a real difference. Report effect size + 95% CI to tell "truly equal" from "underpowered". For a strong claim of no difference, run an equivalence test.

📝 自我檢測

1. 你比較 4 種藥物濃度的細胞存活率（每組 n=15）。Levene's test p=0.02，殘差 QQ plot 大致直線。最合適的主檢定？

1. You compare cell viability across 4 drug concentrations (n=15/group). Levene's test p=0.02, residual QQ plot roughly linear. The best primary test?

A. 直接做 4 次 t-test 找差距A. Just run six pairwise t-tests

B. 傳統 one-way ANOVA（aov）B. Classical one-way ANOVA (aov)

C. Welch ANOVA（oneway.test，var.equal=FALSE）+ Games–Howell post-hocC. Welch ANOVA (oneway.test, var.equal=FALSE) + Games–Howell

D. Kruskal–WallisD. Kruskal–Wallis

2. 一篇論文寫「ANOVA 顯著 (F=4.2, p=0.01)，因此 treatment 比 control 有效」。最大的問題是？

2. A paper writes "ANOVA significant (F=4.2, p=0.01), so treatment beats control". What's the main problem?

A. 應該用 t-test 而不是 ANOVAA. Should use t-test instead

B. F 太小不可信B. F is too small to trust

C. p 應該 <0.001 才有意義C. p should be <0.001 to count

D. Omnibus F 只說「至少一對不同」，不能直接推斷「treatment vs control」這對；需 post-hoc（如 Tukey / Dunnett）D. Omnibus F only says "at least one pair differs"; the specific treatment-vs-control claim needs a post-hoc (Tukey/Dunnett)

3. 你有 6 種 cytokine 處理 vs 同一個 untreated control。最有效率的 post-hoc？

3. You compare 6 cytokine treatments against a single untreated control. Most efficient post-hoc?

A. Tukey HSD（所有 15 配對）A. Tukey HSD (all 15 pairs)

B. Dunnett（只跑 6 個 vs control，功效更高）B. Dunnett (only 6 tests vs control; higher power)

C. Bonferroni 跑全 15 對C. Bonferroni on all 15

D. SchefféD. Scheffé

4. 同一隻老鼠取了 5 個腦切片，3 隻老鼠 × 5 切片 = 15 觀察值放進 ANOVA。問題是？

4. Five brain slices from each of three mice (3×5=15) are fed into ANOVA. The problem is?

A. n=15 太少A. n=15 is too small

B. 應該用 Bonferroni 調整B. Should apply Bonferroni

C. Pseudoreplication：切片不是獨立觀察，獨立單位是 mouse；應用 mixed model 把 mouse 設為 random effectC. Pseudoreplication: slices aren't independent; the unit is mouse — fit a mixed model with mouse as a random effect

D. 應該用 Welch ANOVAD. Should use Welch ANOVA