為什麼需要 ANOVA?
當你有 三組或以上 的連續資料要比較平均(例如:安慰劑 / 低劑量 / 中劑量 / 高劑量四組血壓),最直覺的想法是「兩兩 t 檢定」。錯。K 組做 C(K,2) 次 t 檢定,整體 Type I error 會爆炸:3 組 3 次 → 1 − 0.95³ ≈ 14%;4 組 6 次 → ≈ 26%;6 組 15 次 → ≈ 54%。這就是著名的「multiple comparisons problem」。
Ronald A. Fisher 在 1925 年《Statistical Methods for Research Workers》提出 ANOVA:一次性檢定「至少一對平均不同」,並把總變異拆解成「組間 (between)」與「組內 (within)」兩個來源。比值 F = MSbetween / MSwithin 服從 F 分布(Fisher–Snedecor),當 F 顯著大時拒絕「所有平均相等」的虛無假設。
更深的洞察:ANOVA 其實就是用 dummy variable 做的線性迴歸。aov(y ~ group) 與 lm(y ~ factor(group)) 在數學上完全等價——這是現代統計把 ANOVA 視為「廣義線性模型 (GLM) 的特例」的根本原因(McCullagh & Nelder 1989)。
With three or more groups of continuous data (placebo / low / medium / high dose blood pressure), the instinct is "do all pairwise t-tests". Wrong. K groups need C(K,2) t-tests and Type I error explodes: 3 groups, 3 tests → 1 − 0.95³ ≈ 14%; 4 groups, 6 tests → ≈ 26%; 6 groups, 15 tests → ≈ 54%. That is the classic multiple-comparisons problem.
Ronald A. Fisher's 1925 Statistical Methods for Research Workers introduced ANOVA: a single test for "at least one pair of means differs" that partitions total variability into a between-group piece and a within-group piece. The ratio F = MSbetween / MSwithin follows the Fisher–Snedecor F-distribution; a large F rejects the null that all means are equal.
Deeper insight: ANOVA is just linear regression with dummy variables. aov(y ~ group) is mathematically identical to lm(y ~ factor(group)) — which is why modern statistics treats ANOVA as a special case of the GLM (McCullagh & Nelder 1989).
一、F 統計量的本質
令第 i 組第 j 個觀察值為 yij,組平均 ȳi·,總平均 ȳ··。ANOVA 的核心恆等式是「SS_total = SS_between + SS_within」:
Let yij be observation j in group i, with group mean ȳi· and grand mean ȳ··. The core ANOVA identity is SS_total = SS_between + SS_within:
One-way
一個分類自變項(如 4 種處理),檢定「所有組平均皆相等」。最常見的形式。等價於 lm(y ~ group),omnibus F-test 同 lm 的整體 F-test。
One categorical predictor (e.g., four treatments); tests "all group means equal". The most common form. Equivalent to lm(y ~ group); the omnibus F equals the regression F.
Two-way
兩個分類自變項 + 交互作用 (interaction)。例:藥物 × 劑量。看主效果 (main effect) 與「藥物效果是否隨劑量改變」。違反 additivity 時 interaction 顯著。
Two categorical predictors + an interaction. Example: drug × dose. Read main effects and "does the drug effect vary with dose?". A significant interaction means non-additivity.
RM-ANOVA
同一受試者反覆測量(baseline / week 2 / week 4)。需配對 within-subject 相關,假設 sphericity(球度)——Mauchly 檢定違反就用 Greenhouse–Geisser 修正。現在多被混合模型 (mixed model) 取代(見 Step 13)。
The same subject measured repeatedly (baseline / week 2 / week 4). Accounts for within-subject correlation; assumes sphericity — when Mauchly's test fails, apply Greenhouse–Geisser. Now largely superseded by mixed-effects models (see Step 13).
Welch / B-F
Brown & Forsythe 1974 與 Welch (1951) 提出:當各組變異數不等時,傳統 ANOVA 的 Type I error 失控。Welch ANOVA 不假設等變異,應作為預設選項,呼應 Welch t-test。
Brown & Forsythe 1974 and Welch (1951): when group variances are unequal, classical ANOVA's Type I error misbehaves. Welch ANOVA does not assume equal variances and should be the default, mirroring Welch t-test.
Kruskal–Wallis
Kruskal & Wallis 1952 JASA:rank-based 無母數版本,不假設常態。當 n 小且明顯偏態時的替代。後續配對用 Dunn's test 或 pairwise Wilcoxon + BH。
Kruskal & Wallis 1952 JASA: rank-based non-parametric alternative, no normality assumption. Use when n is small with clear skew. Follow up with Dunn's test or pairwise Wilcoxon + BH.
ANOVA = 迴歸
K 組對應 K−1 個 dummy variable(reference coding)。lm(y ~ group) 的 F-test 與 aov 的 F 相同;β 係數就是「該組 vs reference 的平均差」。理解這點就能無縫進入 ANCOVA(加共變量)。
K groups → K−1 dummy variables (reference coding). The F-test from lm(y ~ group) equals aov's F; each β is "group vs reference mean difference". Grasping this unlocks ANCOVA (add covariates).
三組 F 統計量遊樂場
調整三組平均 μ₁、μ₂、μ₃,與共同的組內 SD 以及每組 n。觀察 F 與 p 值如何變化。核心直覺:組間差距越大 → SS_between 越大 → F 越大;組內 SD 越大(噪音多)→ SS_within 越大 → F 越小;n 越大 → F 對「真實小差距」越敏感。η² 是效果量(見下方)。
Tune the three means μ₁, μ₂, μ₃, the common within-group SD, and per-group n. Watch F and p change. Intuition: bigger group separation → larger SS_between → larger F; bigger within-group SD (more noise) → larger SS_within → smaller F; larger n → F is more sensitive to small real differences. η² is the effect size (see below).
三組模擬資料的點圖(每點為一觀察值,橫條 = 組平均)Three-group dotplot (each point = one observation, bar = group mean)
二、三大假設與如何檢核
1️⃣ 殘差常態性
不是「每組資料常態」,而是模型殘差 eij = yij − ȳi· 近似常態。檢查方式:殘差 QQ plot(最直觀)、Shapiro–Wilk(n < 50)。CLT 加持下,n 大時 ANOVA 對輕度違反相當穩健(Glass et al. 1972)。
Not "data are normal in each group" — the model residuals eij = yij − ȳi· should be approximately normal. Check via residual QQ plot (most informative) and Shapiro–Wilk (n < 50). With CLT support, ANOVA is fairly robust to mild violations when n is large (Glass et al. 1972).
2️⃣ 同質變異
各組 σ² 相同。Levene's test(中位數版較穩健)、Bartlett's test(對非常態敏感)。但同 t-test 章節的批評:先做檢定再決定是 Student/Welch 屬於資料驅動決策,會放大 Type I error(Zimmerman 2004)。直接用 Welch ANOVA 是更乾淨的選擇。
Equal σ² across groups. Levene's test (median version is robust); Bartlett's test (sensitive to non-normality). Same critique as the t-test chapter: testing first and then choosing classical vs Welch is data-driven and inflates Type I error (Zimmerman 2004). Defaulting to Welch ANOVA is the cleaner choice.
3️⃣ 獨立性
觀察值彼此獨立。最難檢核也最致命——同隻老鼠多切片、同培養皿多 well、同病人多次採血都違反此假設(pseudoreplication, Hurlbert 1984)。違反就要用 RM-ANOVA 或 mixed model(Step 13)。
Observations are independent. The hardest to check and the most dangerous — multiple slices from one mouse, multiple wells per dish, repeated draws from one patient all violate this (pseudoreplication, Hurlbert 1984). Switch to RM-ANOVA or a mixed model (Step 13).
resid(fit) 上做 QQ plot。
Common mistake: running Shapiro–Wilk on each group separately. With small n and many groups, this gives low power per test and checks the wrong thing — ANOVA's normality assumption is on the residuals, not raw group data. Right way: fit the model, then QQ-plot resid(fit).
三、Post-hoc:誰跟誰不同?
ANOVA 的 F 顯著只告訴你「至少一對不同」,不告訴你哪一對。要回答這個問題,就需要 post-hoc 檢定,並且每種方法控制不同類型的錯誤率:FWER (family-wise) 或 FDR (false discovery rate)。最常用的五種:
A significant omnibus F tells you "at least one pair differs" but not which pair. That requires a post-hoc test, and each option controls a different error rate — family-wise (FWER) or false discovery rate (FDR). The five workhorses:
| 方法 | 範圍 | 控制 | 保守度 | 情境 | ||||
|---|---|---|---|---|---|---|---|---|
| Tukey HSD | 所有 K(K−1)/2 配對 | FWER | 中 | 標準首選;等樣本數時最佳 (Tukey 1949) | All K(K−1)/2 pairs | FWER | Medium | Default; optimal with equal n (Tukey 1949) |
| Bonferroni | 任意 m 個檢定 | FWER | 高(最保守) | 少量比較、簡單透明;m 大時功效太低 | Any m tests | FWER | High (most conservative) | Few comparisons; transparent; underpowered for large m |
| Holm | 任意 m 個檢定 | FWER | 中(uniformly < Bonferroni) | Bonferroni 的「step-down」版,功效更高 (Holm 1979) | Any m tests | FWER | Medium (uniformly < Bonferroni) | Step-down version, higher power (Holm 1979) |
| Dunnett | K−1 個(每組 vs 單一 control) | FWER | 中 | 劑量試驗、藥物 vs 安慰劑 (Dunnett 1955) | K−1 (each vs single control) | FWER | Medium | Dose-response; drug vs placebo (Dunnett 1955) |
| Scheffé | 所有可能線性對比 (contrasts) | FWER | 最高(極保守) | 事後想到的複雜對比(如 (A+B)/2 vs C)(Scheffé 1959) | All linear contrasts | FWER | Highest (very conservative) | Post-hoc complex contrasts like (A+B)/2 vs C (Scheffé 1959) |
Post-hoc 調整怎麼隨 K 變化
滑動 K(組數)。觀察當 K 增加時,每個配對檢定的「有效 α」如何被三種方法調整。Bonferroni:α/m,隨 m=C(K,2) 線性下降。Tukey HSD:使用 studentized range distribution,調整較溫和。Holm step-down:sequential 調整,比 Bonferroni 寬鬆。
Slide K (number of groups). Watch how the per-test effective α shrinks under each method as K grows. Bonferroni: α/m, linear in m=C(K,2). Tukey HSD: uses the studentized range distribution; gentler adjustment. Holm step-down: sequential, less conservative than Bonferroni.
y 軸 = 每個配對檢定的有效 α 門檻(越低越保守)y-axis = effective per-test α threshold (lower = more conservative)
四、決策樹
🌳 ANOVA 決策樹
car::Anova)。oneway.test);後續用 Games–Howell post-hoc。lm(y ~ group + covariate)),等價於迴歸。car::Anova).oneway.test); pair with Games–Howell post-hoc.lm(y ~ group + covariate)) — equivalent to regression.五、效果量與不平衡設計
效果量
η² (eta squared) = SS_between / SS_total,組間變異佔總變異的比例。Cohen 1988 標準:0.01 小、0.06 中、0.14 大。
ω² (omega squared):較不偏估,小樣本時優於 η²。partial η² 用於 multi-factor ANOVA,分母只用該效果 + 殘差,方便分別解讀各 main effect。報告時建議同時給 p、F、df、η²/ω²、95% CI。
η² (eta squared) = SS_between / SS_total — proportion of total variance explained by groups. Cohen 1988 benchmarks: 0.01 small, 0.06 medium, 0.14 large.
ω² (omega squared): less biased, better for small samples. partial η² for multi-factor designs uses only "this effect + residual" as denominator, so each main effect reads independently. Report p, F, df, η²/ω², 95% CI together.
Type I / II / III SS
當組樣本數 不等 時,主效果的 SS 計算順序會影響結果。Type I(序貫):R 預設 anova(),加入順序敏感;Type II:忽略 interaction 計算主效果(推薦無顯著 interaction 時);Type III:SAS 預設,car::Anova(fit, type=3),必須設正交對比 (contr.sum) 才正確。
Langsrud 2003 Statistics & Computing 詳述差異。最安全:平衡設計 (equal n) 三種結果相同;不平衡時論文 Methods 明確寫使用哪一種。
With unequal group sizes the order in which SS for main effects is computed matters. Type I (sequential): R's default anova(), order-sensitive; Type II: main effects ignoring interaction (recommended when interaction is non-significant); Type III: SAS default, car::Anova(fit, type=3) — requires orthogonal contrasts (contr.sum) to be valid.
Langsrud 2003 Statistics & Computing walks through the differences. Safest: with balanced designs all three agree; in unbalanced designs state explicitly which type you used.
六、實作範例
# R: classical, Welch, Kruskal–Wallis, post-hoc, two-way library(tidyverse); library(car); library(emmeans); library(rstatix) # Drug dose-response: placebo / low / medium / high (n=20 each) df <- tibble( dose = factor(rep(c("placebo","low","med","high"), each=20), levels=c("placebo","low","med","high")), bp = c(rnorm(20,140,8), rnorm(20,135,8), rnorm(20,130,8), rnorm(20,122,8))) # --- 1. Classical one-way ANOVA (assumes equal variance) --- fit <- aov(bp ~ dose, data = df) summary(fit) # F, df, p summary(lm(bp ~ dose, data = df)) # identical F — ANOVA = regression # --- 2. Assumption checks on RESIDUALS --- plot(fit, which = 2) # residual QQ plot shapiro.test(resid(fit)) # residual normality car::leveneTest(bp ~ dose, data = df, center = median) # --- 3. Welch ANOVA (default if variances unequal) --- oneway.test(bp ~ dose, data = df, var.equal = FALSE) # --- 4. Post-hoc --- TukeyHSD(fit) # all pairs, FWER emmeans::emmeans(fit, pairwise ~ dose, adjust = "tukey") # Dunnett: each vs placebo (control) emmeans::emmeans(fit, trt.vs.ctrl ~ dose, ref = "placebo") # Games–Howell for unequal variance rstatix::games_howell_test(df, bp ~ dose) # --- 5. Kruskal–Wallis non-parametric --- kruskal.test(bp ~ dose, data = df) rstatix::dunn_test(df, bp ~ dose, p.adjust.method = "BH") # --- 6. Two-way ANOVA + interaction --- fit2 <- lm(bp ~ drug * dose, data = df2) car::Anova(fit2, type = 3) # Type III SS (unbalanced safe) # --- 7. Effect size --- rstatix::anova_summary(fit, effect.size = "pes") # partial η² effectsize::omega_squared(fit) # ω²
import numpy as np, pandas as pd from scipy import stats import statsmodels.api as sm from statsmodels.formula.api import ols from statsmodels.stats.multicomp import pairwise_tukeyhsd import pingouin as pg # clean ANOVA + effect size rng = np.random.default_rng(1) df = pd.DataFrame({ "dose": np.repeat(["placebo","low","med","high"], 20), "bp": np.concatenate([rng.normal(m,8,20) for m in [140,135,130,122]]) }) # --- 1. Classical one-way ANOVA --- fit = ols("bp ~ C(dose)", data=df).fit() sm.stats.anova_lm(fit, typ=2) stats.f_oneway(*[df[df.dose==g].bp for g in df.dose.unique()]) # pingouin gives F, df, p, η², ω² in one line pg.anova(data=df, dv="bp", between="dose", effsize="np2") # --- 2. Assumption checks on residuals --- stats.shapiro(fit.resid) stats.levene(*[df[df.dose==g].bp for g in df.dose.unique()], center="median") # --- 3. Welch ANOVA --- pg.welch_anova(data=df, dv="bp", between="dose") # --- 4. Post-hoc --- pairwise_tukeyhsd(df.bp, df.dose) # Tukey HSD pg.pairwise_gameshowell(data=df, dv="bp", between="dose") pg.pairwise_tests(data=df, dv="bp", between="dose", parametric=True, padjust="holm") # --- 5. Kruskal–Wallis + Dunn --- stats.kruskal(*[df[df.dose==g].bp for g in df.dose.unique()]) import scikit_posthocs as sp sp.posthoc_dunn(df, val_col="bp", group_col="dose", p_adjust="fdr_bh") # --- 6. Two-way + interaction, Type III SS --- fit2 = ols("bp ~ C(drug) * C(dose)", data=df2).fit() sm.stats.anova_lm(fit2, typ=3)
七、六大陷阱
❌ Omnibus ≠ 特定對比
F 顯著只說「至少一對不同」,不能直接結論「treatment vs placebo 有效」。必須跑 post-hoc 才能說特定配對;甚至 omnibus 不顯著時,pre-planned contrast 仍可能顯著(Hsu 1996)。
A significant F means "at least one pair differs" — it does not by itself prove "treatment vs placebo works". Run a post-hoc to claim a specific pair; and a pre-planned contrast can still be significant even when the omnibus isn't (Hsu 1996).
❌ Pseudoreplication
3 隻老鼠每隻 4 個切片 ≠ n=12。Hurlbert 1984 的經典論文指出生態與生物實驗最常見的錯誤。獨立單位是「mouse」,切片是 within-subject 重複——須用混合模型把 mouse 設成 random effect。
3 mice × 4 slices ≠ n=12. Hurlbert 1984's classic paper documents this as the most common error in biological and ecological studies. The independent unit is the mouse; slices are within-subject replicates — fit a mixed model with mouse as a random effect.
❌ 不做後續調整
「omnibus p=0.04 ✓,再隨便跑幾個 t-test 看哪對顯著」——這是 garden-of-forking-paths 的典型路徑。當你比較 m 對,要報 adjusted p(Tukey/Bonferroni/Holm)並在 Methods 寫清楚。
"Omnibus p=0.04 ✓; now run a few t-tests to find which pair is significant" — textbook garden-of-forking-paths. With m pairs, report adjusted p (Tukey/Bonferroni/Holm) and state the method in Methods.
❌ 不平衡未說 SS 類型
Type I/II/III 在不平衡時結果不同。Langsrud 2003:論文要明確說「Type III SS via car::Anova with contr.sum」否則他人無法重現。R 預設 Type I 是常見地雷。
Type I/II/III diverge on unbalanced data. Langsrud 2003: papers must state "Type III SS via car::Anova with contr.sum" — otherwise the analysis cannot be reproduced. R's default (Type I) is a frequent trap.
❌ Outlier 毒殺 F
SS 用平方距離,一個極端值就能讓 MS_within 暴增、F 變不顯著。先看 boxplot、QQ plot;嚴重時改用 Kruskal–Wallis 或 robust ANOVA(Wilcox 2017,WRS2::t1way 用 trimmed means)。
SS uses squared distances, so a single extreme point inflates MS_within and kills F. Inspect boxplots and QQ plots; if needed, switch to Kruskal–Wallis or robust ANOVA (Wilcox 2017, WRS2::t1way uses trimmed means).
❌ p>0.05 ≠ 沒差別
無法拒絕虛無假設不代表組間相等——可能只是 n 太小或變異太大(Type II error)。報告效果量 + 95% CI 才能看出「真的沒差」或「沒功效偵測差距」。等價檢定 (equivalence test) 是更嚴謹做法。
Failing to reject doesn't imply equality — small n or large variance (Type II error) can hide a real difference. Report effect size + 95% CI to tell "truly equal" from "underpowered". For a strong claim of no difference, run an equivalence test.
📝 自我檢測
1. 你比較 4 種藥物濃度的細胞存活率(每組 n=15)。Levene's test p=0.02,殘差 QQ plot 大致直線。最合適的主檢定?
1. You compare cell viability across 4 drug concentrations (n=15/group). Levene's test p=0.02, residual QQ plot roughly linear. The best primary test?
2. 一篇論文寫「ANOVA 顯著 (F=4.2, p=0.01),因此 treatment 比 control 有效」。最大的問題是?
2. A paper writes "ANOVA significant (F=4.2, p=0.01), so treatment beats control". What's the main problem?
3. 你有 6 種 cytokine 處理 vs 同一個 untreated control。最有效率的 post-hoc?
3. You compare 6 cytokine treatments against a single untreated control. Most efficient post-hoc?
4. 同一隻老鼠取了 5 個腦切片,3 隻老鼠 × 5 切片 = 15 觀察值放進 ANOVA。問題是?
4. Five brain slices from each of three mice (3×5=15) are fed into ANOVA. The problem is?