STEP 6 / 17

線性迴歸與診斷

OLS 是所有統計建模的母體——理解它的假設、診斷與穩健化,是進入 limma、GLM、混合模型的必要條件。

OLS is the parent of all statistical models — mastering its assumptions, diagnostics, and robust variants is the gateway to limma, GLMs, and mixed models.

為什麼線性迴歸仍是脊椎?

線性迴歸 y = Xβ + ε 是 t 檢定、ANOVA、ANCOVA 的統一框架;limma 的 lmFit 對每個基因擬合的就是一個線性模型。不是「過時的方法」,而是 GLM、mixed model、Bayesian regression 的基底——所有複雜模型都是它的推廣。

OLS 之所以強大,是因為在 Gauss-Markov 條件下(線性、外生性、同質變異、無多重共線性),最小平方估計量是 BLUE(Best Linear Unbiased Estimator)——在所有線性無偏估計中變異最小。但當假設破壞時,估計仍可能無偏,標準誤卻會錯——這正是為什麼診斷比點估計重要。

Linear regression y = Xβ + ε is the unifying frame for t-test, ANOVA, and ANCOVA; limma's lmFit fits exactly such a model per gene. It is not an obsolete method but the substrate on which GLM, mixed, and Bayesian regression are built.

OLS is powerful because under the Gauss-Markov conditions (linearity, exogeneity, homoscedasticity, no perfect collinearity), least squares is BLUE — minimum variance among all linear unbiased estimators. When assumptions fail, point estimates often remain unbiased but standard errors get wrong — which is why diagnostics matter more than coefficients themselves.

💡
核心原則:「檢查 y 是否常態」是常見誤解——OLS 不需要 y 常態,只在做推論(CI、p-value)時需要殘差近似常態,而且 CLT 在 n 夠大時會自動處理。真正必檢的是:殘差 vs 擬合的系統性樣式同質變異、與影響點 Core principle: "Check if y is normal" is a textbook misconception. OLS does not require y to be normal — only the residuals should be approximately normal for inference (CIs, p-values), and CLT handles that when n is reasonable. The real diagnostics are: systematic patterns in residuals vs fitted, homoscedasticity, and influence.

一、OLS、診斷與穩健化

📐

OLS 與 Gauss-Markov

  • 正規方程式:β̂ = (XᵀX)⁻¹ Xᵀy
  • BLUE 條件:線性、獨立、同質變異、外生性
  • 常態誤差 → t / F 分布;無常態 → 仍可用 CLT 近似
  • R² 是樣本內擬合度——不等於預測力
  • Normal equations: β̂ = (XᵀX)⁻¹ Xᵀy
  • BLUE conditions: linearity, independence, homoscedasticity, exogeneity
  • Normal errors → t / F distributions; non-normal → CLT-based approximation
  • R² is in-sample fit — not predictive power
📊

四張殘差診斷圖

  • Residuals vs Fitted:曲線樣式 → 非線性
  • QQ plot:尾部偏離 → 非常態 / 重尾
  • Scale-Location:開口形 → 異質變異
  • Residuals vs Leverage:Cook's 距離 → 影響點
  • Residuals vs Fitted: curvature → non-linearity
  • QQ plot: tail deviation → non-normal / heavy tails
  • Scale-Location: funnel shape → heteroscedasticity
  • Residuals vs Leverage: Cook's distance → influential points
🔗

多重共線性與 VIF

  • VIFj = 1/(1−R²j);VIF > 5–10 警示
  • 不偏係數,但膨脹標準誤 → t 變小、p 變大
  • 解法:刪除冗餘變數、合併、PCA 或 ridge 正則化
  • VIFj = 1/(1−R²j); VIF > 5–10 is a red flag
  • Does not bias coefficients, but inflates SE → smaller t, larger p
  • Fix: drop redundancy, combine, PCA, or ridge regularization
⚖️

影響點

  • Leverage hii:X 空間中的「邊角」位置
  • Cook's D = (殘差²) × (hii/(1-hii)²)
  • DFBETA:個別點對單一係數的影響
  • 規則:Cook > 4/n 或 > 1 警示
  • Leverage hii: corner position in X-space
  • Cook's D combines residual size × hii/(1-hii
  • DFBETA: point's influence on a single coefficient
  • Rule of thumb: Cook > 4/n or > 1 flags
⚠️
異質變異與穩健 SE:bptest(Breusch-Pagan)拒絕同質變異,不要盲目對數轉換 y——這會改變模型解釋。先試 HC3 (White / sandwich) 穩健標準誤:係數不變,但 SE / CI / p-value 都對異質變異穩健。若異質變異與結構有關(如 RNA-seq 計數的 mean-variance),就應走 GLM(NB、Gamma)而非 OLS+轉換。
Heteroscedasticity & robust SE: if Breusch-Pagan bptest rejects homoscedasticity, don't reflexively log-transform y — that changes the model. First try HC3 (White / sandwich) robust SE: coefficients unchanged, but SE / CI / p-values now valid under heteroscedasticity. If the variance structure is intrinsic (e.g. RNA-seq mean-variance), move to a GLM (NB, Gamma) rather than OLS-plus-transform.

影響點如何扭曲迴歸線

下圖中有一個紅色可拖曳點——用滑鼠拖動它(或調滑桿改變其 x / y 位置),觀察迴歸線斜率、R²、與 Cook's 距離如何即時變化。當該點同時具有高 leverage(離 x̄ 遠)與大殘差時,Cook's D 飆升——這就是影響點的數學定義。

The plot below has a red draggable point — move it via the sliders (or click to set) and watch the regression line, R², and Cook's distance update live. When the point has both high leverage (far from x̄) AND large residual, Cook's D spikes — the mathematical definition of an influential point.

灰點:固定資料;紅點:可拖曳;藍線:含紅點的 OLS 擬合

二、殘差診斷流程

🌳 OLS 診斷與修正路徑

Q1:
殘差 vs 擬合圖呈曲線樣式?→ 是 → 非線性。加入多項式 / 樣條 (spline) / GAM,或改 log 等轉換。
Q2:
Scale-Location 呈開口/喇叭樣式?→ 是 → 異質變異。先 HC3 sandwich SE;若結構性 → GLS 加權或改 GLM
Q3:
任一 VIF > 5–10→ 是 → 多重共線。刪除冗餘 / 合併指標 / PCA / ridge
Q4:
Cook's D > 4/n(或 > 1)?→ 是 → 影響點。不要直接刪——先檢查資料錯誤 / 異常生物學意義;若必須保留 → 穩健迴歸 (rlm, MM-estimator)
Q5:
QQ plot 嚴重重尾?→ 是 → 多半是 Q1/Q2/Q4 的次生現象。修完前四項後重看;若仍重尾 → t-likelihood / quantile regression
Q1:
Residuals-vs-fitted shows curvature? → Yes → non-linearity. Add polynomial / spline / GAM, or transform.
Q2:
Scale-Location shows a funnel? → Yes → heteroscedasticity. First HC3 sandwich SE; if structural → WLS / GLS or switch to a GLM.
Q3:
Any VIF > 5–10? → Yes → multicollinearity. Drop / combine / PCA / ridge.
Q4:
Cook's D > 4/n (or > 1)? → Yes → influence. Don't just delete — check for data errors / biological meaning; if you must keep it → robust regression (rlm, MM-estimator).
Q5:
QQ plot heavy tails? → Yes → usually downstream of Q1/Q2/Q4. Re-check after fixing those; if still heavy → t-likelihood / quantile regression.
違反 徵兆 修正
非線性曲線殘差spline / GAMNon-linearitycurved residualsspline / GAM
異質變異喇叭殘差HC3 / WLS / GLMHeteroscedasticityfunnel residualsHC3 / WLS / GLM
多重共線性VIF 高、SE 膨脹drop / ridgeMulticollinearityhigh VIF, inflated SEdrop / ridge
影響點Cook 飆升rlm / 檢查資料Influential pointCook spikerlm / inspect data
相關殘差Durbin-Watson、批次mixed modelCorrelated errorsDurbin-Watson, batchmixed model

實作:擬合、診斷、穩健 SE

# --- R --- OLS 全流程
library(broom); library(car); library(lmtest); library(sandwich)

# 1) 擬合與整理
fit <- lm(y ~ x1 + x2 + x3, data = df)
summary(fit)
broom::tidy(fit, conf.int = TRUE)        # 整潔的係數表
broom::glance(fit)                          # R², AIC, p, df

# 2) 四張診斷圖(plot.lm 一鍵呼叫)
par(mfrow = c(2, 2)); plot(fit)

# 3) 多重共線性
car::vif(fit)                              # > 5–10 警示

# 4) 影響點:Cook、DFBETA、leverage
car::influencePlot(fit)                    # 互動式氣泡圖
cooks.distance(fit) |> which(. > 4/nrow(df))

# 5) 異質變異 → HC3 穩健 SE
lmtest::bptest(fit)                          # Breusch-Pagan
lmtest::coeftest(fit, vcov = sandwich::vcovHC(fit, type = "HC3"))
# --- Python ---
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor, OLSInfluence

# 1) 擬合
fit = smf.ols("y ~ x1 + x2 + x3", data = df).fit()
print(fit.summary())

# 2) 殘差診斷圖
import statsmodels.graphics.regressionplots as rp
rp.plot_regress_exog(fit, "x1")

# 3) VIF
X = fit.model.exog
[variance_inflation_factor(X, i) for i in range(X.shape[1])]

# 4) Cook、leverage
infl = fit.get_influence()
cooks_d, _ = infl.cooks_distance
hii = infl.hat_matrix_diag

# 5) HC3 穩健 SE
robust = fit.get_robustcov_results(cov_type="HC3")
print(robust.summary())
🚫
常見錯誤:① 用 step() 做變數選擇後,再用同一筆資料報告 p-value——係數與 p 都被嚴重偏估(post-selection inference)。使用懲罰式方法(lasso、ridge)或事前指定模型。② 在 genomics 中看到 R² > 0.9——多半是樣本洩漏或批次混雜,不是「擬合很好」。 Common mistakes: ① Using step() then reporting p-values on the same data — both coefficients and p-values are heavily biased (post-selection inference). Use penalized methods (lasso, ridge) or pre-specified models. ② Seeing R² > 0.9 in genomics — usually sample leakage or batch confounding, not "great fit".

三、limma = 百萬個線性模型

limma 對 每個基因 擬合 lm(expression ~ design),再做 empirical Bayes 把所有基因的殘差變異 shrunk 到共同先驗——這就是 moderated t-statistic。所以本章的所有診斷(殘差、共線性、影響點)都直接適用於 limma,只不過從一張圖變成「每個基因都要關心」的尺度問題。實務上看 limma::plotSA(mean-variance)與 plotMD(每基因擬合)取代 plot.lm

limma fits lm(expression ~ design) for each gene, then empirical-Bayes shrinks residual variances to a common prior — yielding the moderated t-statistic. Every diagnostic in this chapter (residuals, collinearity, influence) applies directly to limma — just at scale. Use limma::plotSA (mean-variance) and plotMD (per-gene fit) as the genome-wide counterparts of plot.lm.

📝 自我檢測

1. 異質變異違反了哪條 OLS 假設?以下哪個是最佳第一線修正?

1. Which OLS assumption does heteroscedasticity violate, and what is the best first-line fix?

A. 違反獨立性;用 mixed modelA. Violates independence; use mixed model
B. 違反同質變異 (Gauss-Markov);用 HC3 sandwich 穩健 SEB. Violates homoscedasticity; use HC3 sandwich robust SE
C. 違反線性;用 polynomialC. Violates linearity; use polynomial
D. 違反常態;對 y 取 logD. Violates normality; log y

2. 關於高 VIF(多重共線性),下列何者正確?

2. About high VIF (multicollinearity), which is correct?

A. 會使係數估計變偏(biased)A. It biases coefficient estimates
B. 會使預測值變差B. It worsens predictions
C. 不偏係數,但膨脹 SE → t 變小、p 變大;對純預測無大影響C. Doesn't bias coefficients, but inflates SE → smaller t, larger p; little effect on pure prediction
D. 與 leverage 同義D. Same as leverage

3. Cook's 距離大表示什麼?

3. A large Cook's distance indicates what?

A. 殘差大但 leverage 小A. Large residual but low leverage
B. Leverage 大但殘差小B. High leverage but small residual
C. R² 很高C. High R²
D. 兼具大殘差高 leverage,整體擬合對該點高度敏感D. Both large residual and high leverage — the fit is highly sensitive to that point