為什麼線性迴歸仍是脊椎?
線性迴歸 y = Xβ + ε 是 t 檢定、ANOVA、ANCOVA 的統一框架;limma 的 lmFit 對每個基因擬合的就是一個線性模型。不是「過時的方法」,而是 GLM、mixed model、Bayesian regression 的基底——所有複雜模型都是它的推廣。
OLS 之所以強大,是因為在 Gauss-Markov 條件下(線性、外生性、同質變異、無多重共線性),最小平方估計量是 BLUE(Best Linear Unbiased Estimator)——在所有線性無偏估計中變異最小。但當假設破壞時,估計仍可能無偏,標準誤卻會錯——這正是為什麼診斷比點估計重要。
Linear regression y = Xβ + ε is the unifying frame for t-test, ANOVA, and ANCOVA; limma's lmFit fits exactly such a model per gene. It is not an obsolete method but the substrate on which GLM, mixed, and Bayesian regression are built.
OLS is powerful because under the Gauss-Markov conditions (linearity, exogeneity, homoscedasticity, no perfect collinearity), least squares is BLUE — minimum variance among all linear unbiased estimators. When assumptions fail, point estimates often remain unbiased but standard errors get wrong — which is why diagnostics matter more than coefficients themselves.
一、OLS、診斷與穩健化
OLS 與 Gauss-Markov
- 正規方程式:
β̂ = (XᵀX)⁻¹ Xᵀy - BLUE 條件:線性、獨立、同質變異、外生性
- 常態誤差 → t / F 分布;無常態 → 仍可用 CLT 近似
- R² 是樣本內擬合度——不等於預測力
- Normal equations:
β̂ = (XᵀX)⁻¹ Xᵀy - BLUE conditions: linearity, independence, homoscedasticity, exogeneity
- Normal errors → t / F distributions; non-normal → CLT-based approximation
- R² is in-sample fit — not predictive power
四張殘差診斷圖
- Residuals vs Fitted:曲線樣式 → 非線性
- QQ plot:尾部偏離 → 非常態 / 重尾
- Scale-Location:開口形 → 異質變異
- Residuals vs Leverage:Cook's 距離 → 影響點
- Residuals vs Fitted: curvature → non-linearity
- QQ plot: tail deviation → non-normal / heavy tails
- Scale-Location: funnel shape → heteroscedasticity
- Residuals vs Leverage: Cook's distance → influential points
多重共線性與 VIF
- VIFj = 1/(1−R²j);VIF > 5–10 警示
- 不偏係數,但膨脹標準誤 → t 變小、p 變大
- 解法:刪除冗餘變數、合併、PCA 或 ridge 正則化
- VIFj = 1/(1−R²j); VIF > 5–10 is a red flag
- Does not bias coefficients, but inflates SE → smaller t, larger p
- Fix: drop redundancy, combine, PCA, or ridge regularization
影響點
- Leverage hii:X 空間中的「邊角」位置
- Cook's D = (殘差²) × (hii/(1-hii)²)
- DFBETA:個別點對單一係數的影響
- 規則:Cook > 4/n 或 > 1 警示
- Leverage hii: corner position in X-space
- Cook's D combines residual size × hii/(1-hii)²
- DFBETA: point's influence on a single coefficient
- Rule of thumb: Cook > 4/n or > 1 flags
bptest(Breusch-Pagan)拒絕同質變異,不要盲目對數轉換 y——這會改變模型解釋。先試 HC3 (White / sandwich) 穩健標準誤:係數不變,但 SE / CI / p-value 都對異質變異穩健。若異質變異與結構有關(如 RNA-seq 計數的 mean-variance),就應走 GLM(NB、Gamma)而非 OLS+轉換。bptest rejects homoscedasticity, don't reflexively log-transform y — that changes the model. First try HC3 (White / sandwich) robust SE: coefficients unchanged, but SE / CI / p-values now valid under heteroscedasticity. If the variance structure is intrinsic (e.g. RNA-seq mean-variance), move to a GLM (NB, Gamma) rather than OLS-plus-transform.影響點如何扭曲迴歸線
下圖中有一個紅色可拖曳點——用滑鼠拖動它(或調滑桿改變其 x / y 位置),觀察迴歸線斜率、R²、與 Cook's 距離如何即時變化。當該點同時具有高 leverage(離 x̄ 遠)與大殘差時,Cook's D 飆升——這就是影響點的數學定義。
The plot below has a red draggable point — move it via the sliders (or click to set) and watch the regression line, R², and Cook's distance update live. When the point has both high leverage (far from x̄) AND large residual, Cook's D spikes — the mathematical definition of an influential point.
灰點:固定資料;紅點:可拖曳;藍線:含紅點的 OLS 擬合
二、殘差診斷流程
🌳 OLS 診斷與修正路徑
| 違反 | 徵兆 | 修正 | |||
|---|---|---|---|---|---|
| 非線性 | 曲線殘差 | spline / GAM | Non-linearity | curved residuals | spline / GAM |
| 異質變異 | 喇叭殘差 | HC3 / WLS / GLM | Heteroscedasticity | funnel residuals | HC3 / WLS / GLM |
| 多重共線性 | VIF 高、SE 膨脹 | drop / ridge | Multicollinearity | high VIF, inflated SE | drop / ridge |
| 影響點 | Cook 飆升 | rlm / 檢查資料 | Influential point | Cook spike | rlm / inspect data |
| 相關殘差 | Durbin-Watson、批次 | mixed model | Correlated errors | Durbin-Watson, batch | mixed model |
實作:擬合、診斷、穩健 SE
# --- R --- OLS 全流程 library(broom); library(car); library(lmtest); library(sandwich) # 1) 擬合與整理 fit <- lm(y ~ x1 + x2 + x3, data = df) summary(fit) broom::tidy(fit, conf.int = TRUE) # 整潔的係數表 broom::glance(fit) # R², AIC, p, df # 2) 四張診斷圖(plot.lm 一鍵呼叫) par(mfrow = c(2, 2)); plot(fit) # 3) 多重共線性 car::vif(fit) # > 5–10 警示 # 4) 影響點:Cook、DFBETA、leverage car::influencePlot(fit) # 互動式氣泡圖 cooks.distance(fit) |> which(. > 4/nrow(df)) # 5) 異質變異 → HC3 穩健 SE lmtest::bptest(fit) # Breusch-Pagan lmtest::coeftest(fit, vcov = sandwich::vcovHC(fit, type = "HC3"))
# --- Python --- import statsmodels.formula.api as smf import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factor, OLSInfluence # 1) 擬合 fit = smf.ols("y ~ x1 + x2 + x3", data = df).fit() print(fit.summary()) # 2) 殘差診斷圖 import statsmodels.graphics.regressionplots as rp rp.plot_regress_exog(fit, "x1") # 3) VIF X = fit.model.exog [variance_inflation_factor(X, i) for i in range(X.shape[1])] # 4) Cook、leverage infl = fit.get_influence() cooks_d, _ = infl.cooks_distance hii = infl.hat_matrix_diag # 5) HC3 穩健 SE robust = fit.get_robustcov_results(cov_type="HC3") print(robust.summary())
step() 做變數選擇後,再用同一筆資料報告 p-value——係數與 p 都被嚴重偏估(post-selection inference)。使用懲罰式方法(lasso、ridge)或事前指定模型。② 在 genomics 中看到 R² > 0.9——多半是樣本洩漏或批次混雜,不是「擬合很好」。
Common mistakes: ① Using step() then reporting p-values on the same data — both coefficients and p-values are heavily biased (post-selection inference). Use penalized methods (lasso, ridge) or pre-specified models. ② Seeing R² > 0.9 in genomics — usually sample leakage or batch confounding, not "great fit".
三、limma = 百萬個線性模型
limma 對 每個基因 擬合 lm(expression ~ design),再做 empirical Bayes 把所有基因的殘差變異 shrunk 到共同先驗——這就是 moderated t-statistic。所以本章的所有診斷(殘差、共線性、影響點)都直接適用於 limma,只不過從一張圖變成「每個基因都要關心」的尺度問題。實務上看 limma::plotSA(mean-variance)與 plotMD(每基因擬合)取代 plot.lm。
limma fits lm(expression ~ design) for each gene, then empirical-Bayes shrinks residual variances to a common prior — yielding the moderated t-statistic. Every diagnostic in this chapter (residuals, collinearity, influence) applies directly to limma — just at scale. Use limma::plotSA (mean-variance) and plotMD (per-gene fit) as the genome-wide counterparts of plot.lm.
📝 自我檢測
1. 異質變異違反了哪條 OLS 假設?以下哪個是最佳第一線修正?
1. Which OLS assumption does heteroscedasticity violate, and what is the best first-line fix?
2. 關於高 VIF(多重共線性),下列何者正確?
2. About high VIF (multicollinearity), which is correct?
3. Cook's 距離大表示什麼?
3. A large Cook's distance indicates what?