Step 13: Mixed-Effects Models — Biostatistics Tutorial

總覽

為什麼「生物 vs 技術重複」是一個統計問題？

實驗室白板上常寫著「n = 60」。但仔細追問：是 60 隻獨立小鼠？還是 3 隻小鼠 × 20 顆細胞？這兩者在統計上不是同一件事。前者的自由度接近 60；後者真正獨立的單位只有 3——做 t 檢定假裝 n = 60，就是 Hurlbert (1984) Ecological Monographs 54:187 所定義的 pseudoreplication（假性重複）。

Hurlbert 1984 那篇論文檢視 156 篇生態學論文，發現近一半存在 pseudoreplication；後續神經科學（Lazic 2010 BMC Neurosci 11:5；Aarts et al. 2014 Nat Neurosci 17:491）、細胞生物學（Lord et al. 2020 J Cell Biol 219:e202001064 SuperPlots）也重複驗證了同一個問題。混合效應模型（mixed-effects models）就是處理這類「巢狀／群聚（nested / clustered）」資料的標準工具：透過 random effect 把 cluster 結構顯式建模，讓 SE 與 p 值反映真正的資訊量。

The lab whiteboard says "n = 60". Ask twice: sixty independent mice, or 3 mice × 20 cells? Statistically these are not the same thing. The first has roughly 60 degrees of freedom; the second has only 3 truly independent units — running a t-test as if n = 60 is the textbook case of pseudoreplication, the term coined by Hurlbert (1984) Ecological Monographs 54:187.

Hurlbert reviewed 156 ecology papers and found nearly half contained pseudoreplication; the same problem has since been documented in neuroscience (Lazic 2010 BMC Neurosci 11:5; Aarts et al. 2014 Nat Neurosci 17:491) and cell biology (Lord et al. 2020 J Cell Biol 219:e202001064, SuperPlots). Mixed-effects models are the standard tool for nested / clustered data: they make the cluster structure explicit through random effects, so SEs and p-values reflect the actual amount of independent information.

💡

核心訊息：當你的資料有「結構」——同一個病人多次測量、同一隻小鼠多顆神經元、同一片組織多個視野、同一塊培養盤多孔位、同一家醫院多病人——觀察值不獨立，必須用混合模型。OLS／普通 t 檢定是「單層模型」，它假設每個觀察值都是 i.i.d.，但生物學幾乎從不滿足這個假設。 The headline: when your data have structure — repeated measures on the same patient, many neurons per mouse, many fields per slice, many wells per plate, many patients per hospital — the observations are not independent, and you need a mixed model. OLS / a vanilla t-test is a "flat" model that assumes i.i.d. observations; biology almost never delivers i.i.d.

核心觀念

一、Fixed vs Random、Intercept vs Slope、ICC

混合模型由三件事組成：固定效應（fixed effects）是你關心、想要做 inference 的參數；隨機效應（random effects）是分群的變異成分，你假設群組是來自某個更大母體的「樣本」；以及殘差變異（residual）。模型形式：y_ij = X_ij β + Z_ij b_i + ε_ij，其中 b_i ~ N(0, G)、ε_ij ~ N(0, σ²)。

A mixed model has three ingredients: fixed effects — the parameters you care about and want to infer; random effects — variance components for grouping variables (you treat the groups as a sample from a wider population); and the residual. The form: y_ij = X_ij β + Z_ij b_i + ε_ij, with b_i ~ N(0, G) and ε_ij ~ N(0, σ²).

🎯

Fixed Effects

你關心的效應，所有 level 都觀察到、想要做點估計與假設檢定。
例：treatment（藥 vs 安慰劑）、time、sex、genotype。
類比：你想知道「這個藥對所有人是否有效」。

The effects you care about — all levels observed, want a point estimate plus a hypothesis test.
e.g. treatment (drug vs placebo), time, sex, genotype.
Analogy: "does the drug work, on average, across people?"

🎲

Random Effects

群組是母體的樣本，你不關心特定 level 的效應，只關心群組之間有多少變異。
例：mouse_id、patient_id、plate_id、clinic、cage。
類比：你不想知道「小鼠 7 號比 8 號高多少」，只想知道「小鼠之間有多少基線差異」。

Groups are a sample from a population; you do not care about individual level effects, only how much they vary.
e.g. mouse_id, patient_id, plate_id, clinic, cage.
Analogy: not "how much does mouse #7 differ from #8", but "how much do mice vary?"

📐

ICC

Intraclass Correlation Coefficient：
ICC = σ²_between / (σ²_between + σ²_within)
同 cluster 內兩觀察值的相關係數。ICC = 0 表完全獨立，ICC → 1 表 cluster 內幾乎相同。神經元 / 細胞層級研究常見 ICC = 0.1–0.5。

Intraclass Correlation Coefficient:
ICC = σ²_between / (σ²_between + σ²_within)
The correlation between two observations from the same cluster. ICC = 0 → fully independent; ICC → 1 → within-cluster nearly identical. Neuron / cell studies often show ICC ≈ 0.1–0.5.

隨機截距 Random Intercept

公式：y ~ x + (1 | mouse)
每隻小鼠有自己的基線（intercept），但反應斜率相同。
適用：個體基線差異大（基礎血壓、基礎表達量），但對處理的反應方向類似。

Formula: y ~ x + (1 | mouse)
Each mouse has its own baseline, but a shared slope.
Use when: baselines differ a lot (baseline BP, baseline expression), but the response to treatment is similar across subjects.

隨機斜率 Random Slope

公式：y ~ time + (1 + time | mouse)
每隻小鼠不僅基線不同，對 time 的反應強度也不同。
適用：個體間反應差異大（drug responder vs non-responder、成長曲線斜率不一）。
限制：每群至少要有多筆觀察（建議 ≥ 5），否則無法估計群組層級的斜率。

Formula: y ~ time + (1 + time | mouse)
Each mouse has its own baseline and its own response strength to time.
Use when: subjects respond very differently (drug responders vs non-responders, varying growth-curve slopes).
Constraint: each group needs multiple observations (≥ 5 is a healthy minimum) — otherwise group-level slopes cannot be identified.

⌜ y_ij = β₀ + β₁·x_ij + b_0i + b_1i·x_ij + ε_ij , b_i ~ N(0, G) , ε ~ N(0, σ²) ⌝ b_0i = 群 i 的截距偏移；b_1i = 群 i 的斜率偏移。當只有 b_0i 時 = 隨機截距；同時有 b_0i 與 b_1i 時 = 隨機斜率。Laird-Ware (1982 Biometrics 38:963) 是現代混合模型的奠基公式。 ⌜ y_ij = β₀ + β₁·x_ij + b_0i + b_1i·x_ij + ε_ij , b_i ~ N(0, G) , ε ~ N(0, σ²) ⌝ b_0i = group-i intercept offset; b_1i = group-i slope offset. With b_0i alone: random intercept. With both b_0i and b_1i: random slope. Laird & Ware (1982 Biometrics 38:963) is the foundational formulation of modern mixed models.

互動模擬 ①

巢狀資料模擬器：OLS vs 混合模型

下面模擬「小鼠 × 細胞」實驗：產生 n_mice 隻小鼠，每隻測 m 顆細胞，真實處理效應為 β。比較兩種分析：(A) OLS（pseudoreplication）把每顆細胞當獨立觀察，會嚴重低估 SE；(B) 混合模型把 mouse_id 加為隨機截距，SE 反映真實的小鼠數。看看當 between-mouse SD 變大、ICC 變高時，兩種分析的差異會放多大。

Below: a "mouse × cell" experiment. Generate n_mice mice, m cells per mouse, with a true treatment effect β. Compare: (A) OLS (pseudoreplication) — treats every cell as independent, severely underestimates SE; (B) mixed model — adds mouse_id as a random intercept, SE reflects the real number of mice. Increase the between-mouse SD or the ICC and watch the gap grow.

小鼠數 n_mice 6

每隻小鼠細胞數 m 20

細胞層 SD 1.0

小鼠層 SD 1.2

真實 β 0.6

每隻小鼠一種顏色，左 = 對照，右 = 處理。橫線 = 該小鼠的細胞 mean。One colour per mouse; left = control, right = treated. Bar = that mouse's cell-level mean.

⚠️

觀察：把 between-mouse SD 從 0.2 拉到 2.5，OLS 的 p 值幾乎不變，但混合模型的 SE 會明顯放大。這就是為什麼用 OLS 分析巢狀資料會給你「假陽性」。極端例子：n_mice = 3、m = 100 時，OLS 「n = 300」聲稱有顯著差異，混合模型可能告訴你 p > 0.3。 What to watch: push between-mouse SD from 0.2 to 2.5: the OLS p barely budges, but the mixed-model SE inflates a lot. That is why OLS on nested data gives you false positives. Extreme case: n_mice = 3, m = 100 — OLS claims "n = 300" and a huge significance; the mixed model can give p > 0.3.

深入討論

二、lme4、REML、df 修正、BLUP、GLMM、GEE、Bayesian

📦 lme4

R 的事實標準。lmer(y ~ treat + (1|mouse))。基於 sparse Cholesky 與 PIRLS 的 ECM 演算法，可以處理交叉（crossed）與巢狀（nested）隨機效應，速度比 nlme 快數十倍。但 lme4 預設不給 p 值——Doug Bates 的立場是：適當的 reference distribution 在混合模型中沒有單一正確答案。

The de facto R standard. lmer(y ~ treat + (1|mouse)). Sparse Cholesky + PIRLS ECM; handles crossed and nested random effects far faster than nlme. By design lme4 does not return p-values — Doug Bates argues there is no single correct reference distribution for mixed models.

📦 nlme

R 的「老牌」混合模型套件，lme() 與 nlme()。比 lme4 慢，但支援相關結構（correlation structures）——AR(1)、exchangeable、compound symmetry，這對縱貫資料 (longitudinal) 非常重要。教科書 Pinheiro & Bates 2000《Mixed-Effects Models in S and S-PLUS》仍是經典。

The classic R package: lme() and nlme(). Slower than lme4 but supports correlation structures — AR(1), exchangeable, compound symmetry — which matter for longitudinal data. The textbook Pinheiro & Bates 2000 Mixed-Effects Models in S and S-PLUS remains the canonical reference.

🧪 REML vs ML

REML (Restricted Maximum Likelihood)：預設，估變異成分時無偏（unbiased），是推論變異 / ICC 的正確方法。
ML (Maximum Likelihood)：估變異會略微低估，但當你要比較固定效應結構不同的兩個模型（用 likelihood ratio test, LRT），必須改成 ML。
實務：lmer(..., REML=TRUE) 報告變異；REML=FALSE 跑 LRT 比模型。

REML: default; produces unbiased variance estimates and is the right choice for inferring variance components / ICC.
ML: slightly biased variance, but required when comparing two models that differ in fixed effects via a likelihood ratio test.
Rule: lmer(..., REML=TRUE) to report; REML=FALSE for LRTs on fixed effects.

🔢 Satterthwaite vs Kenward-Roger

lmer 不直接給 p 值，因為自由度 (df) 在混合模型下沒有清晰定義。近似方案：(1) lmerTest 套件用 Satterthwaite 近似（速度快、適合中大樣本）；(2) pbkrtest 套件用 Kenward-Roger（小樣本最準確，是 Luke 2017 J Memory Lang 強力推薦的方法）。小 n 一律用 KR。

lmer doesn't print p-values because df is not cleanly defined for mixed models. Approximations: (1) lmerTest uses Satterthwaite (fast, fine for medium/large samples); (2) pbkrtest uses Kenward-Roger (most accurate at small n, recommended by Luke 2017 J Memory Lang). Always prefer KR when n is small.

🎁 BLUPs

Best Linear Unbiased Predictor：每個隨機效應 level 的「最佳預測值」。ranef(fit) 取出每隻小鼠的截距偏移。BLUPs 比 group-wise mean 更穩定，因為它向總平均「收縮」（shrinkage）——這是 James-Stein 估計子的近親，也是 Bayesian 視角下的後驗 mode。

Best Linear Unbiased Predictor: the best predicted value for each random-effect level. ranef(fit) returns per-mouse intercept offsets. BLUPs are more stable than per-group means because they shrink toward the grand mean — a cousin of the James-Stein estimator, and the posterior mode in a Bayesian view.

🔁 GLMM

當反應變數不是常態（binary outcome、count）時，用 glmer(y ~ x + (1|cluster), family=binomial)。例：cluster-randomised trial 的二分結果、神經元 spike count（Poisson）、過度離散計數（negative binomial）。算法用 Laplace 近似或 adaptive Gauss-Hermite quadrature。

When the response is not Gaussian (binary, count), use glmer(y ~ x + (1|cluster), family=binomial). Examples: binary outcomes in cluster-RCTs, neuron spike counts (Poisson), overdispersed counts (negative binomial). Fitted with Laplace approximation or adaptive Gauss-Hermite quadrature.

🌐 GEE

Generalized Estimating Equations（Liang & Zeger 1986 Biometrika 73:13）。提供母體層級（population-average / marginal）推論——你想知道「整個母體的平均效應」，而不是「給定 cluster 的條件效應」。使用 sandwich estimator，對 working correlation 的設定有 robust 性質。GLMM 給 subject-specific 估計、GEE 給 population-average，兩者在非線性連結（logit）下會不同。

Generalized Estimating Equations (Liang & Zeger 1986 Biometrika 73:13). Provides population-average (marginal) inference — "the average effect over the whole population", not the conditional effect given a cluster. Uses a sandwich variance estimator, robust to misspecified working correlation. GLMM gives subject-specific, GEE gives population-average estimates; under nonlinear links (logit) the two differ.

🛠 glmmTMB

建立在 Template Model Builder 之上，速度快、支援 lme4 無法處理的家族：zero-inflated、negative binomial（NB1/NB2）、beta、Tweedie、ordered beta、Conway-Maxwell-Poisson。當 lme4 收斂困難或需要 ZINB 時，第一個替代品就是 glmmTMB。

Built on TMB; fast and supports families lme4 cannot: zero-inflated, negative binomial (NB1/NB2), beta, Tweedie, ordered beta, Conway-Maxwell-Poisson. When lme4 will not converge or you need zero-inflated NB, glmmTMB is the first place to look.

🧠 brms / MCMCglmm

brms（Bürkner 2017 J Stat Softw 80:1）是 R 對 Stan 的封裝，公式語法和 lme4 幾乎一致：brm(y ~ x + (1|mouse))，背後是 NUTS-HMC。當隨機效應 level 很少（< 5），lme4 容易給 singular fit / 邊界估計，這時 weakly informative prior 的 Bayesian 模型（brms / MCMCglmm Hadfield 2010）給的後驗區間更可靠。也支援 missing data 與複雜似然。

brms (Bürkner 2017 J Stat Softw 80:1) wraps Stan in lme4-style syntax: brm(y ~ x + (1|mouse)), fitted with NUTS-HMC. When the number of random-effect levels is small (< 5), lme4 often returns singular fits / boundary estimates; a Bayesian fit (brms / MCMCglmm, Hadfield 2010) with weakly informative priors gives more reliable posterior intervals. Also handles missing data and complex likelihoods.

「Three-of-them」實務口訣：(1) lmer + lmerTest + Kenward-Roger 是 80% 場合的最佳組合；(2) 二分／計數結果改 glmer 或 glmmTMB；(3) 隨機效應 level < 5 或 singular fit → 直接上 brms。記住 Gelman-Hill 2006 的名言：「If you have fewer than 5 levels, partial pooling will help you most.」 The "three-of-them" rule: (1) lmer + lmerTest with Kenward-Roger covers 80% of cases; (2) binary / count outcomes → glmer or glmmTMB; (3) fewer than 5 random-effect levels, or a singular fit → go straight to brms. Remember Gelman & Hill 2006: "If you have fewer than 5 levels, partial pooling helps you the most."

互動模擬 ②

ICC 與設計效應：cluster 越大、ICC 越高，樣本需求飆升

在 cluster-randomised trial（醫院 / 班級 / 動物欄）中，實際所需樣本 = 標稱 n × Design Effect，其中 DE = 1 + (m − 1) · ICC，m = 每 cluster 觀察數。當 ICC = 0.05、m = 20，DE ≈ 1.95——你需要兩倍樣本才能達到原本獨立樣本的 power。Donner & Klar 2000 是 cluster-RCT 設計的標準參考。

In a cluster-randomised trial (hospital, classroom, animal-pen unit), required sample = nominal n × Design Effect, where DE = 1 + (m − 1) · ICC and m = observations per cluster. With ICC = 0.05 and m = 20, DE ≈ 1.95 — you need twice the sample for the same power as an independent design. Donner & Klar 2000 is the standard reference for cluster-RCT design.

ICC 0.05

每 cluster 觀察數 m 20

標稱 n 200

🚨

給 clinical trial 設計者：計算 cluster-RCT 樣本時，必須先估計 ICC。常見值：學校層級教育介入 ICC ≈ 0.01–0.10；醫院層級照護品質研究 ICC ≈ 0.01–0.05（Adams et al. 2004）；動物欄 cage effect ICC 可達 0.1–0.3。若忽略 cluster 結構，研究會嚴重 underpowered。 For trialists: when sizing a cluster-RCT, estimate ICC first. Typical values: school-level education interventions ICC ≈ 0.01–0.10; hospital-level quality studies ICC ≈ 0.01–0.05 (Adams et al. 2004); animal-cage effects ICC ≈ 0.1–0.3. Ignoring the cluster structure leaves your study badly underpowered.

決策引導

三、Fixed / Random / GEE / Bayesian——怎麼選？

🌳 混合模型方法選擇樹

Q1:

觀察值彼此獨立嗎？→ 是 → 用普通 OLS / t-test / GLM 即可，不需要混合模型。

Q2:

分群變數的 level 是研究中關心的全部（如 2 種藥）嗎？→ 是 → fixed effect。若 level 是更大母體的樣本（病人 ID、實驗批次）→ random effect。

Q3:

反應變數是常態？→ 是 → lmer（lme4）+ lmerTest / pbkrtest 取得 p 值。→ 否 → glmer / glmmTMB。

Q4:

你關心「母體平均效應」而非「條件效應」（如公衛政策評估）？→ 是 → GEE（geepack / statsmodels GEE）+ exchangeable 或 AR(1) working correlation。

Q5:

隨機效應 level < 5、或 lmer 報 singular fit、或樣本極小？→ 是 → Bayesian（brms / MCMCglmm），用 weakly informative prior 穩定變異成分。

Q6:

縱貫資料，且時間點間相關隨距離衰減？→ 是 → 用 nlme::lme 加 correlation = corAR1()，或 glmmTMB 的 ar1() 結構。

Q1:

Are observations independent? → Yes → plain OLS / t-test / GLM — no mixed model needed.

Q2:

Are the levels of a grouping variable all of interest (e.g. 2 drugs)? → Yes → fixed effect. If levels are a sample from a larger population (patient ID, batch) → random effect.

Q3:

Gaussian response? → Yes → lmer (lme4) + lmerTest / pbkrtest for p-values. → No → glmer / glmmTMB.

Q4:

You want the population-average effect rather than conditional (e.g. public-health policy)? → Yes → GEE (geepack / statsmodels GEE) with exchangeable or AR(1) working correlation.

Q5:

Fewer than 5 random-effect levels, singular fit, or very small n? → Yes → Bayesian (brms / MCMCglmm) with weakly informative priors.

Q6:

Longitudinal data with time-distance-decaying correlation? → Yes → use nlme::lme with correlation = corAR1(), or glmmTMB's ar1() structure.

工具比較

四、五大工具對照

套件	語言	強項	弱項	典型場景
lme4 Bates 2015	R	速度、交叉隨機效應、生態系大	不直接給 p 值；非常態家族有限	linear / logistic / Poisson 混合模型主力	Fast, crossed random effects, huge ecosystem	No p-values out of the box; limited non-Gaussian	Workhorse for linear / logistic / Poisson mixed
nlme Pinheiro-Bates 2000	R	支援 correlation structures（AR1、CS、UN）	速度慢；不支援交叉隨機效應	縱貫資料、藥動力學 (PK/PD)	Supports correlation structures (AR1, CS, UN)	Slow; no crossed random effects	Longitudinal, PK/PD modelling
statsmodels MixedLM	Python	Python 原生、整合 pandas	功能比 lme4 少；速度較慢；GLMM 較弱	Python pipeline 簡單混合模型	Native Python, pandas integration	Fewer features than lme4; slower; weak GLMM	Simple mixed models in a Python pipeline
glmmTMB Brooks 2017	R	Zero-inflated、NB、beta、Tweedie；速度快	學習曲線略陡；診斷工具較少	過度離散計數、零膨脹資料	Zero-inflated, NB, beta, Tweedie; fast	Steeper learning curve; fewer diagnostics	Overdispersed counts, zero-inflated data
brms Bürkner 2017	R (Stan)	任何 likelihood、完整不確定性、小 n 穩定	慢（MCMC）；需要 prior 思考	小樣本、複雜模型、報告 credible interval	Any likelihood, full uncertainty, robust at small n	Slow (MCMC); requires prior thinking	Small samples, complex models, credible intervals

程式碼

五、實作範例

# R: lme4 + lmerTest + pbkrtest + geepack + brms
library(lme4); library(lmerTest)      # Satterthwaite df
library(pbkrtest)                  # Kenward-Roger df (small n)
library(geepack); library(brms)

# --- Linear mixed model: random intercept ---
fit1 <- lmer(expr ~ treat + time + (1 | mouse), data = df,
             REML = TRUE)                    # REML for variance reporting
summary(fit1)                            # lmerTest gives p via Satterthwaite
anova(fit1, ddf = "Kenward-Roger")        # KR df — preferred when n small

# --- Random slope + intercept ---
fit2 <- lmer(expr ~ time + (1 + time | mouse), data = df)
VarCorr(fit2)                            # variance components
ranef(fit2)$mouse                       # BLUPs per mouse
fixef(fit2)                             # fixed-effect estimates

# --- Compare fixed-effect structures: USE ML, not REML ---
m_null <- lmer(expr ~ 1     + (1|mouse), data = df, REML = FALSE)
m_alt  <- lmer(expr ~ treat + (1|mouse), data = df, REML = FALSE)
anova(m_null, m_alt)                     # likelihood ratio test

# --- GLMM: binary outcome with random clinic ---
g <- glmer(recovered ~ drug + age + (1|clinic),
            data = trial, family = binomial)

# --- GEE: population-average estimate ---
gee <- geeglm(recovered ~ drug + age, id = clinic, data = trial,
              family = binomial, corstr = "exchangeable")

# --- ICC from a fitted lmer ---
v <- as.data.frame(VarCorr(fit1))
icc <- v$vcov[1] / sum(v$vcov)

# --- Bayesian fallback for small n / singular fits ---
fit_b <- brm(expr ~ treat + (1|mouse), data = df,
             prior = prior(student_t(3, 0, 2.5), "sd"),
             chains = 4, cores = 4)

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.families import Binomial

# --- Linear mixed model: random intercept ---
m = smf.mixedlm('expr ~ treat + time', df, groups=df['mouse'])
fit = m.fit(reml=True)
print(fit.summary())

# --- Random slope: vc_formula or re_formula ---
m2 = smf.mixedlm('expr ~ time', df, groups=df['mouse'],
                   re_formula='~time')
fit2 = m2.fit()
print(fit2.random_effects)         # BLUPs per mouse
print(fit2.fe_params)              # fixed effects

# --- ICC ---
sig2_b = fit.cov_re.iloc[0, 0]
sig2_w = fit.scale
icc = sig2_b / (sig2_b + sig2_w)

# --- GLMM: BinomialBayesMixedGLM (limited) ---
from statsmodels.genmod.bayes_mixed_glm import BinomialBayesMixedGLM
glmm = BinomialBayesMixedGLM.from_formula(
    'recovered ~ drug + age', {'clinic': '0 + C(clinic)'}, trial)
res = glmm.fit_vb()

# --- GEE: population-average ---
gee = GEE.from_formula('recovered ~ drug + age', 'clinic', trial,
                       family=Binomial(),
                       cov_struct=sm.cov_struct.Exchangeable())
gee_fit = gee.fit()
print(gee_fit.summary())

💡

實作小撇步：(1) Always plot residuals(fit) vs fitted——看是否還有殘留結構；(2) ranef 取出 BLUPs 後畫 caterpillar plot（dotplot(ranef(fit))），檢查極端 outlier group；(3) performance::icc() 一行算 conditional 與 marginal ICC（Nakagawa-Schielzeth 2013 Methods Ecol Evol 4:133）。 Practical tips: (1) Always plot residuals(fit) vs fitted to check for leftover structure; (2) pull BLUPs with ranef and draw a caterpillar plot (dotplot(ranef(fit))) to spot outlier groups; (3) performance::icc() gives conditional and marginal ICC in one line (Nakagawa-Schielzeth 2013 Methods Ecol Evol 4:133).

常見陷阱

六、六大陷阱

❌ Pseudoreplication

Hurlbert 1984「假裝相關資料是獨立」是這個領域最古老、最普遍的錯誤。電生理（多細胞 / 單鼠）、組織學（多切片 / 單樣本）、影像（多視野 / 單皿）皆是高風險區。解方：cluster 變數一律放成 random effect。Lazic 2010 用一篇 BMC Neurosci 整篇文章證明：神經科學論文中此錯誤普及率高得令人擔憂。

Hurlbert 1984's "treating correlated data as independent" remains the oldest and most common offence. High-risk zones: electrophysiology (many cells / one mouse), histology (many slices / one block), imaging (many fields / one dish). Fix: any clustering variable becomes a random effect. Lazic 2010 BMC Neurosci showed the problem is widespread in neuroscience papers.

❌ Singular fit

lme4 警告 boundary (singular) fit: see ?isSingular，通常表示某個變異成分被估到 0、或隨機斜率與截距完美相關。常見原因：random-effect level 太少（< 5）、模型過度複雜（random slope 但每群觀察太少）、資料的真實 ICC 確實接近 0。解方：簡化隨機結構，或改用 brms（Bayesian 自然處理邊界）。

lme4 warns boundary (singular) fit: see ?isSingular — usually a variance component estimated at zero or an intercept-slope correlation pinned at ±1. Causes: too few random-effect levels (< 5), an over-rich random structure (random slope with few obs per group), or a genuinely tiny ICC. Fix: simplify the random structure, or switch to brms (Bayesian handles boundaries naturally).

❌ REML 比較固定效應

REML 的 likelihood 與固定效應設定有關，用 REML 跑 LRT 比較不同固定效應結構是錯的。改用 REML = FALSE（ML）來比較固定效應；要報告變異成分時再切回 REML。anova(m_null, m_alt) 在 lme4 中會自動 refit 到 ML，但別忽視這個細節。

The REML likelihood depends on the fixed-effects specification, so using REML for a LRT across different fixed-effect models is invalid. Switch to REML = FALSE (ML) for fixed-effects comparisons; switch back to REML when reporting variance components. anova(m_null, m_alt) in lme4 quietly refits to ML — don't ignore the detail.

❌ 「lmer 不給 p」

原生 lme4 的 summary 不印 p 值——但很多論文卻寫了「p < 0.001」。必須附上方法：Satterthwaite (lmerTest) 或 Kenward-Roger (pbkrtest)，或用 LRT、parametric bootstrap、profile CI。Luke 2017 J Memory Lang 強烈推薦 KR 為小樣本黃金標準。

The native summary(lmer) prints no p-values — yet papers cheerfully report "p < 0.001". State the method: Satterthwaite (lmerTest), Kenward-Roger (pbkrtest), LRT, parametric bootstrap, or profile CI. Luke 2017 J Memory Lang strongly recommends KR as the small-sample gold standard.

❌ 收斂警告被忽視

Model failed to converge、maxfun=2*10^4 警告經常被當「擺著就好」。但這通常代表優化器沒走到 ML/REML 真正的極值，估計值不可靠。處置：(1) 換 optimizer：control = lmerControl(optimizer = "bobyqa")；(2) 中心化／標準化 predictor；(3) 簡化隨機結構；(4) 改 brms / glmmTMB。

Model failed to converge and maxfun=2*10^4 warnings are routinely "left in place". They usually mean the optimiser did not reach the actual ML/REML optimum, so estimates are unreliable. What to do: (1) try another optimiser: control = lmerControl(optimizer = "bobyqa"); (2) centre / scale predictors; (3) simplify the random structure; (4) switch to brms / glmmTMB.

❌ 忽略交叉隨機效應

當「受試者」與「項目（item / stimulus）」彼此都重複（如 psycholinguistic 實驗），必須同時放 random effect：y ~ x + (1|subject) + (1|item)。Baayen-Davidson-Bates 2008 J Memory Lang 是這個觀念的標誌。多個生物實驗（trial × cell line、reader × image）也適用。lme4 對交叉結構支援良好，nlme 不行。

When "subjects" and "items" both repeat (e.g. psycholinguistics, reader × image, trial × cell line) you must include both as random effects: y ~ x + (1|subject) + (1|item). Baayen, Davidson & Bates 2008 J Memory Lang is the landmark paper. lme4 supports crossed random effects natively; nlme does not.

陷阱：random effect 不是萬靈丹「我加了 random effect，所以小 n 沒問題」——錯。隨機效應可以正確處理結構，但無法憑空產生資訊。當你只有 3 隻小鼠，混合模型誠實地告訴你「自由度只有 ~2」，這正是它的價值。若連 2 自由度都不夠做你想做的 inference，真正的答案是再做幾隻小鼠，不是換另一個模型。 "I added a random effect, so n=3 is fine" — no. Random effects handle structure correctly, but they cannot manufacture information. With three mice, the mixed model honestly tells you "df ≈ 2" — that is its value. If 2 df is not enough for your inference, the real answer is more mice, not a different model.

真實案例

七、四個典型場景

🧬 縱貫病人測量

50 位糖尿病患者，每位每月測 HbA1c 共 12 個月。模型：HbA1c ~ time + treat + time:treat + (1 + time | patient)。time:treat 是「處理改變斜率」的 interaction；random slope 容許每位患者自己的血糖控制軌跡。

50 diabetes patients, HbA1c measured monthly for 12 months. Model: HbA1c ~ time + treat + time:treat + (1 + time | patient). The time:treat interaction captures the treatment effect on the slope; the random slope lets each patient have their own glycaemic trajectory.

🏥 Cluster RCT

20 家診所被隨機分配「新照護模式 vs 標準」，每家 100 位病人。模型：outcome ~ arm + age + (1 | clinic)；二分結果用 glmer binomial 或 GEE exchangeable。樣本量計算必須乘以 design effect（見模擬 ②）。

20 clinics randomised to "new care model vs standard", 100 patients per clinic. Model: outcome ~ arm + age + (1 | clinic); binary outcome with glmer binomial or GEE exchangeable. Sample-size calculation must multiply by the design effect (see Sim ②).

🧠 多電極神經記錄

5 隻大鼠，每隻 32 通道，每通道記錄 ~200 神經元 spike rate。三層巢狀：rate ~ stim + (1 | rat / channel / unit)。完美對應 Aarts 2014 Nat Neurosci 的警告：別把 32 × 200 當 6400 個獨立樣本。

5 rats, 32 channels each, ~200 neuron spike rates per channel. Three-level nesting: rate ~ stim + (1 | rat / channel / unit). Exactly the scenario Aarts et al. 2014 Nat Neurosci warn about — don't treat 32 × 200 as 6,400 independent samples.

🌐 多地點基因體研究

5 個中心，每個中心 200 個樣本，做 RNA-seq 表達量分析。模型：expression ~ disease + age + sex + (1 | center)，或把 center 與 batch 雙隨機效應 + (1 | batch)。這也是消除批次效應 (batch effect) 的補充工具，跟 ComBat / sva 互補。

5 centres, 200 RNA-seq samples each. Model: expression ~ disease + age + sex + (1 | center), or include both centre and batch + (1 | batch). Complements ComBat / sva for batch-effect handling.

📝 自我檢測

1. 你用 5 隻小鼠，每隻取 20 顆神經元做電生理。最合適的分析是？

1. You record 20 neurons each from 5 mice in electrophysiology. The right analysis is?

A. 把 100 顆神經元當獨立樣本做 t 檢定（n = 100）A. Pool 100 neurons as independent and run a t-test (n = 100)

B. 把神經元按 mouse 平均後做 t 檢定（n = 5），其他全部丟掉B. Average to per-mouse means and run a t-test (n = 5), discarding the rest

C. 用混合模型 y ~ treat + (1 | mouse)，把 mouse 設為隨機效應C. Fit a mixed model y ~ treat + (1 | mouse) with mouse as a random effect

D. 改用非參數檢定就好D. Just switch to a non-parametric test

2. 同事用 lmer 跑 y ~ x1 + (1|group) 與 y ~ x1 + x2 + (1|group)，皆用預設 REML，然後 anova() 比較。正確嗎？

2. A colleague fits y ~ x1 + (1|group) and y ~ x1 + x2 + (1|group) in lmer with default REML, then compares them with anova(). Correct?

A. 完全正確，REML 是預設A. Completely correct, REML is the default

B. 不對。比較固定效應應該用 ML（REML = FALSE）；lme4 的 anova 會自動 refit 但要明白為什麼B. Wrong. Compare fixed effects under ML (REML = FALSE); lme4's anova refits automatically but you should know why

C. 應該用 Bayesian factorC. Should use a Bayes factor

D. 應該改成 GEED. Should switch to GEE

3. ICC = 0.1，每 cluster 21 人。Design effect 與「為了相同 power 所需樣本」的關係是？

3. ICC = 0.1, 21 per cluster. What's the design effect and its impact on required sample size?

A. DE = 1.0，跟獨立樣本一樣A. DE = 1.0, same as independent

B. DE = 1.21，輕微增加B. DE = 1.21, slight increase

C. DE = 2.10，但對 power 沒影響C. DE = 2.10, but no impact on power

D. DE = 1 + 20×0.1 = 3.0，需要 3 倍樣本D. DE = 1 + 20×0.1 = 3.0; need 3× the sample

4. 在分析 cluster-RCT 的二分結果時，你想要「整體母體的處理效應」而非「條件於 clinic 的效應」。最合適的方法？

4. Analysing a cluster-RCT binary outcome and you want the "population-average" effect rather than the conditional-on-clinic effect. Best method?

A. 普通 logistic regression，忽略 clinicA. Plain logistic regression, ignoring clinic

B. glmer（GLMM）binomial + 隨機截距B. glmer (GLMM) binomial with random intercept

C. GEE，exchangeable working correlation + sandwich SEC. GEE with exchangeable working correlation + sandwich SE

D. 用 chi-square 檢定即可D. A chi-square test will do