ANOVA 是線性模型的特例
「為什麼分開教 ANOVA?」其實 ANOVA = 用類別變數做 X 的線性迴歸。F 檢定把總變異拆解成「組間 + 組內」,問「組間變異是否顯著大於組內」。一旦理解這個身分,ANCOVA(加入連續共變數)、repeated measures、mixed models都是這個框架的自然延伸。
真正能讓 bioinformatician 與其他人拉開差距的是 混合效應模型 (mixed models):當資料有批次、subject、複製等多層結構時,固定效應模型要嘛假裝它們不存在(錯)、要嘛吃掉所有自由度(浪費),而混合模型用隨機效應把這些變異來源建模成分布,達成部分匯聚 (partial pooling)。
"Why teach ANOVA separately?" — in fact ANOVA = linear regression with categorical X. The F-test partitions variance into "between + within" and asks if between exceeds within. Once you see this identity, ANCOVA (add continuous covariates), repeated measures, and mixed models all become natural extensions.
What separates bioinformaticians from the pack is mixed-effects modeling: when data have batch, subject, replicate structure, a purely fixed-effects model either ignores them (wrong) or burns df on them (wasteful). Mixed models treat these variance sources as random effects drawn from a distribution — yielding partial pooling.
一、從 ANOVA 到混合模型
ANOVA 與 SS 類型
- F = MSbetween / MSwithin
- Type I(序列):依輸入順序,結果隨順序變
- Type II(邊際):適合無交互作用
- Type III(part. SS):失衡資料的標準選擇 (
car::Anova(., type=3))
- F = MSbetween / MSwithin
- Type I (sequential): order-dependent
- Type II (marginal): for no-interaction models
- Type III (partial SS): standard for unbalanced designs (
car::Anova(., type=3))
ANCOVA:調整後比較
lm(y ~ group + covariate):在共變數下比較組均值- 關鍵:先檢查 group × covariate 交互是否顯著(平行斜率假設)
- 批次、性別、年齡、library size 都是 ANCOVA 的常見共變數
lm(y ~ group + covariate): compare group means at common covariate value- Always check group × covariate interaction first (parallel-slopes assumption)
- Batch, sex, age, library size — typical ANCOVA covariates
隨機 vs 固定
- 固定:估計每個 level;隨機:估計 levels 的變異 σ²
- 隨機允許「未來新 level」的推廣
- 常見隨機效應:batch、subject、site、family
- 公式:
(1|batch)隨機截距;(slope|subject)隨機斜率
- Fixed: estimate each level; random: estimate variance σ² of levels
- Random enables generalization to new levels
- Typical random effects: batch, subject, site, family
- Formula:
(1|batch)random intercept;(slope|subject)random slope
部分匯聚
- No pooling = 每組獨立估計(雜訊大)
- Complete pooling = 全體共用均值(偏誤大)
- Partial pooling = 各組均值朝整體均值收縮,收縮量依各組樣本數與群間變異決定
- 這就是 BLUP(Best Linear Unbiased Predictor)
- No pooling = each group estimated independently (high variance)
- Complete pooling = one grand mean (high bias)
- Partial pooling = group means shrunk toward the grand mean, by amount depending on n and between-group variance
- This is the BLUP (Best Linear Unbiased Predictor)
lme4::lmer 故意不給 p-value——分母自由度在不平衡資料下沒有閉式解。解法:用 lmerTest (Satterthwaite) 或 pbkrtest (Kenward-Roger)。③ singular fit 警告:某個變異成分被估在 0 邊界——簡化隨機結構(移除隨機斜率或互動)。lme4::lmer deliberately withholds p-values — there is no closed-form denominator df in unbalanced designs. Fix: lmerTest (Satterthwaite) or pbkrtest (Kenward-Roger). ③ Singular fit warning: a variance component is estimated at zero — simplify the random structure (drop random slopes or correlations).三種匯聚方式視覺化
同一筆 6 組資料(每組 n=5)用三種方式估計組均值:灰色=資料;紅=各組獨立均值(no pooling);綠=整體均值(complete pooling);藍=混合模型 BLUP(partial pooling)。拖動滑桿改變群間 vs 群內變異比 τ²/σ²——當 τ²/σ² 大(組真的不同),BLUP 接近 no-pooling;當 τ²/σ² 小(組差異是雜訊),BLUP 強烈收縮到整體均值。
Same 6 groups (n=5 each) estimated three ways: gray=data; red=independent group means (no pooling); green=grand mean (complete pooling); blue=mixed-model BLUP (partial pooling). The slider controls between/within variance ratio τ²/σ²: when τ²/σ² is large (groups truly differ), BLUP nears no-pooling; when small (group differences are noise), BLUP shrinks hard toward the grand mean.
X: 組別;Y: 估計均值
二、生物資訊中的應用
| 情境 | 模型 | 工具 | |||
|---|---|---|---|---|---|
| RNA-seq 技術複製(同一樣本多 lane) | (1|sample) | limma::duplicateCorrelation | RNA-seq tech replicates (sample × lane) | (1|sample) | limma::duplicateCorrelation |
| 多批次 RNA-seq | ~ treatment + (1|batch) | variancePartition::dream | Multi-batch RNA-seq | ~ treatment + (1|batch) | variancePartition::dream |
| scRNA-seq 多 donor(細胞嵌套 donor) | (1|donor) | pseudobulk + DESeq2; muscat; NEBULA | scRNA-seq multi-donor (cells nested in donor) | (1|donor) | pseudobulk + DESeq2; muscat; NEBULA |
| 縱貫資料(同 subject 多時間點) | (time|subject) | lme4; nlme | Longitudinal (subjects × time) | (time|subject) | lme4; nlme |
| GWAS 親緣矩陣 | linear mixed model + kinship K | GEMMA; SAIGE | GWAS with kinship | LMM + kinship K | GEMMA; SAIGE |
實作:ANOVA、ANCOVA、lmer
# --- R --- ANOVA / ANCOVA / mixed model 全套 library(car); library(lme4); library(lmerTest); library(emmeans); library(performance) # 1) 單因子 ANOVA(與 lm 等價) fit_a <- aov(y ~ group, data = df) summary(fit_a) # 不平衡 → Type III SS car::Anova(lm(y ~ group * sex, data = df), type = "III") # 2) ANCOVA:加共變數,先檢查交互 anova(lm(y ~ group * cov, data=df), lm(y ~ group + cov, data=df)) fit_c <- lm(y ~ group + cov, data = df) # 平行斜率成立 → 用此 # 3) Mixed model:固定 treatment、隨機 batch + subject m <- lme4::lmer(y ~ treatment + (1|batch) + (1|subject), data = df, REML = TRUE) summary(m) # lmerTest 提供 Satterthwaite df 與 p anova(m_null, m) # LR 檢定 (改 REML=FALSE) # 4) 估計邊際均值 (EMM) + Tukey 比較 emmeans::emmeans(m, pairwise ~ treatment, adjust = "tukey") # 5) 模型診斷 performance::check_model(m) # 6 張全面診斷圖
# --- Python --- import statsmodels.formula.api as smf from statsmodels.stats.anova import anova_lm from statsmodels.regression.mixed_linear_model import MixedLM import pingouin as pg # 1) Type III ANOVA fit = smf.ols("y ~ C(group) * C(sex)", data=df).fit() anova_lm(fit, typ=3) # 2) ANCOVA fit_c = smf.ols("y ~ C(group) + cov", data=df).fit() # 3) Mixed model — random intercept on batch m = MixedLM.from_formula("y ~ treatment", groups="batch", re_formula="~time", data=df).fit(reml=True) print(m.summary()) # 4) 重複測量 ANOVA pg.mixed_anova(data=df, dv="y", within="time", between="group", subject="id") # 5) 跨平台:pymer4 把 lme4 帶到 Python # from pymer4.models import Lmer # Lmer("y ~ treatment + (1|batch)", data=df).fit()
(time|subject) 但只有 2 個時間點——隨機斜率無法識別。
Common mistakes: ① Treating batch as fixed — wastes k−1 df, and conclusions don't generalize to new batches. ② Ignoring nesting (multiple measurements per subject as independent) — pseudoreplication; p-values vastly anti-conservative; a silent killer of omics studies. ③ Writing (time|subject) with only 2 time points — random slope not identifiable.
📝 自我檢測
1. 你的研究有 8 個 RNA-seq batch,希望結論推廣到「未來再做的 batch」。batch 應怎麼建模?
1. Your study has 8 RNA-seq batches and you want conclusions to generalize to future batches. How should batch enter the model?
2. 部分匯聚(partial pooling)解決了什麼問題?
2. What problem does partial pooling solve?
3. 為什麼 lme4::lmer 預設不給 p-value?
3. Why does lme4::lmer by default refuse to print p-values?