Step 6: Linear Regression & Diagnostics — Statistical Inference Tutorial

概覽

為什麼線性迴歸仍是脊椎？

線性迴歸 y = Xβ + ε 是 t 檢定、ANOVA、ANCOVA 的統一框架；limma 的 lmFit 對每個基因擬合的就是一個線性模型。不是「過時的方法」，而是 GLM、mixed model、Bayesian regression 的基底——所有複雜模型都是它的推廣。

OLS 之所以強大，是因為在 Gauss-Markov 條件下（線性、外生性、同質變異、無多重共線性），最小平方估計量是 BLUE（Best Linear Unbiased Estimator）——在所有線性無偏估計中變異最小。但當假設破壞時，估計仍可能無偏，標準誤卻會錯——這正是為什麼診斷比點估計重要。

Linear regression y = Xβ + ε is the unifying frame for t-test, ANOVA, and ANCOVA; limma's lmFit fits exactly such a model per gene. It is not an obsolete method but the substrate on which GLM, mixed, and Bayesian regression are built.

OLS is powerful because under the Gauss-Markov conditions (linearity, exogeneity, homoscedasticity, no perfect collinearity), least squares is BLUE — minimum variance among all linear unbiased estimators. When assumptions fail, point estimates often remain unbiased but standard errors get wrong — which is why diagnostics matter more than coefficients themselves.

💡

核心原則：「檢查 y 是否常態」是常見誤解——OLS 不需要 y 常態，只在做推論（CI、p-value）時需要殘差近似常態，而且 CLT 在 n 夠大時會自動處理。真正必檢的是：殘差 vs 擬合的系統性樣式、同質變異、與影響點。 Core principle: "Check if y is normal" is a textbook misconception. OLS does not require y to be normal — only the residuals should be approximately normal for inference (CIs, p-values), and CLT handles that when n is reasonable. The real diagnostics are: systematic patterns in residuals vs fitted, homoscedasticity, and influence.

核心概念

一、OLS、診斷與穩健化

📐

OLS 與 Gauss-Markov

正規方程式：β̂ = (XᵀX)⁻¹ Xᵀy
BLUE 條件：線性、獨立、同質變異、外生性
常態誤差 → t / F 分布；無常態 → 仍可用 CLT 近似
R² 是樣本內擬合度——不等於預測力

Normal equations: β̂ = (XᵀX)⁻¹ Xᵀy
BLUE conditions: linearity, independence, homoscedasticity, exogeneity
Normal errors → t / F distributions; non-normal → CLT-based approximation
R² is in-sample fit — not predictive power

📊

四張殘差診斷圖

Residuals vs Fitted：曲線樣式 → 非線性
QQ plot：尾部偏離 → 非常態 / 重尾
Scale-Location：開口形 → 異質變異
Residuals vs Leverage：Cook's 距離 → 影響點

Residuals vs Fitted: curvature → non-linearity
QQ plot: tail deviation → non-normal / heavy tails
Scale-Location: funnel shape → heteroscedasticity
Residuals vs Leverage: Cook's distance → influential points

🔗

多重共線性與 VIF

VIF_j = 1/(1−R²_j)；VIF > 5–10 警示
不偏係數，但膨脹標準誤 → t 變小、p 變大
解法：刪除冗餘變數、合併、PCA 或 ridge 正則化

VIF_j = 1/(1−R²_j); VIF > 5–10 is a red flag
Does not bias coefficients, but inflates SE → smaller t, larger p
Fix: drop redundancy, combine, PCA, or ridge regularization

⚖️

影響點

Leverage h_ii：X 空間中的「邊角」位置
Cook's D = (殘差²) × (h_ii/(1-h_ii)²)
DFBETA：個別點對單一係數的影響
規則：Cook > 4/n 或 > 1 警示

Leverage h_ii: corner position in X-space
Cook's D combines residual size × h_ii/(1-h_ii)²
DFBETA: point's influence on a single coefficient
Rule of thumb: Cook > 4/n or > 1 flags

⚠️

異質變異與穩健 SE：若 bptest（Breusch-Pagan）拒絕同質變異，不要盲目對數轉換 y——這會改變模型解釋。先試 HC3 (White / sandwich) 穩健標準誤：係數不變，但 SE / CI / p-value 都對異質變異穩健。若異質變異與結構有關（如 RNA-seq 計數的 mean-variance），就應走 GLM（NB、Gamma）而非 OLS+轉換。

Heteroscedasticity & robust SE: if Breusch-Pagan bptest rejects homoscedasticity, don't reflexively log-transform y — that changes the model. First try HC3 (White / sandwich) robust SE: coefficients unchanged, but SE / CI / p-values now valid under heteroscedasticity. If the variance structure is intrinsic (e.g. RNA-seq mean-variance), move to a GLM (NB, Gamma) rather than OLS-plus-transform.

互動模擬

影響點如何扭曲迴歸線

下圖中有一個紅色可拖曳點——用滑鼠拖動它（或調滑桿改變其 x / y 位置），觀察迴歸線斜率、R²、與 Cook's 距離如何即時變化。當該點同時具有高 leverage（離 x̄ 遠）與大殘差時，Cook's D 飆升——這就是影響點的數學定義。

The plot below has a red draggable point — move it via the sliders (or click to set) and watch the regression line, R², and Cook's distance update live. When the point has both high leverage (far from x̄) AND large residual, Cook's D spikes — the mathematical definition of an influential point.

— — —

紅點 x 8.0

紅點 y 10.0

灰點：固定資料；紅點：可拖曳；藍線：含紅點的 OLS 擬合

決策引導

二、殘差診斷流程

🌳 OLS 診斷與修正路徑

Q1:

殘差 vs 擬合圖呈曲線樣式？→ 是 → 非線性。加入多項式 / 樣條 (spline) / GAM，或改 log 等轉換。

Q2:

Scale-Location 呈開口/喇叭樣式？→ 是 → 異質變異。先 HC3 sandwich SE；若結構性 → GLS 加權或改 GLM。

Q3:

任一 VIF > 5–10？→ 是 → 多重共線。刪除冗餘 / 合併指標 / PCA / ridge。

Q4:

Cook's D > 4/n（或 > 1）？→ 是 → 影響點。不要直接刪——先檢查資料錯誤 / 異常生物學意義；若必須保留 → 穩健迴歸 (rlm, MM-estimator)。

Q5:

QQ plot 嚴重重尾？→ 是 → 多半是 Q1/Q2/Q4 的次生現象。修完前四項後重看；若仍重尾 → t-likelihood / quantile regression。

Q1:

Residuals-vs-fitted shows curvature? → Yes → non-linearity. Add polynomial / spline / GAM, or transform.

Q2:

Scale-Location shows a funnel? → Yes → heteroscedasticity. First HC3 sandwich SE; if structural → WLS / GLS or switch to a GLM.

Q3:

Any VIF > 5–10? → Yes → multicollinearity. Drop / combine / PCA / ridge.

Q4:

Cook's D > 4/n (or > 1)? → Yes → influence. Don't just delete — check for data errors / biological meaning; if you must keep it → robust regression (rlm, MM-estimator).

Q5:

QQ plot heavy tails? → Yes → usually downstream of Q1/Q2/Q4. Re-check after fixing those; if still heavy → t-likelihood / quantile regression.

違反	徵兆	修正
非線性	曲線殘差	spline / GAM	Non-linearity	curved residuals	spline / GAM
異質變異	喇叭殘差	HC3 / WLS / GLM	Heteroscedasticity	funnel residuals	HC3 / WLS / GLM
多重共線性	VIF 高、SE 膨脹	drop / ridge	Multicollinearity	high VIF, inflated SE	drop / ridge
影響點	Cook 飆升	rlm / 檢查資料	Influential point	Cook spike	rlm / inspect data
相關殘差	Durbin-Watson、批次	mixed model	Correlated errors	Durbin-Watson, batch	mixed model

程式碼

實作：擬合、診斷、穩健 SE

# --- R --- OLS 全流程
library(broom); library(car); library(lmtest); library(sandwich)

# 1) 擬合與整理
fit <- lm(y ~ x1 + x2 + x3, data = df)
summary(fit)
broom::tidy(fit, conf.int = TRUE)        # 整潔的係數表
broom::glance(fit)                          # R², AIC, p, df

# 2) 四張診斷圖（plot.lm 一鍵呼叫）
par(mfrow = c(2, 2)); plot(fit)

# 3) 多重共線性
car::vif(fit)                              # > 5–10 警示

# 4) 影響點：Cook、DFBETA、leverage
car::influencePlot(fit)                    # 互動式氣泡圖
cooks.distance(fit) |> which(. > 4/nrow(df))

# 5) 異質變異 → HC3 穩健 SE
lmtest::bptest(fit)                          # Breusch-Pagan
lmtest::coeftest(fit, vcov = sandwich::vcovHC(fit, type = "HC3"))

# --- Python ---
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor, OLSInfluence

# 1) 擬合
fit = smf.ols("y ~ x1 + x2 + x3", data = df).fit()
print(fit.summary())

# 2) 殘差診斷圖
import statsmodels.graphics.regressionplots as rp
rp.plot_regress_exog(fit, "x1")

# 3) VIF
X = fit.model.exog
[variance_inflation_factor(X, i) for i in range(X.shape[1])]

# 4) Cook、leverage
infl = fit.get_influence()
cooks_d, _ = infl.cooks_distance
hii = infl.hat_matrix_diag

# 5) HC3 穩健 SE
robust = fit.get_robustcov_results(cov_type="HC3")
print(robust.summary())

🚫

常見錯誤：① 用 step() 做變數選擇後，再用同一筆資料報告 p-value——係數與 p 都被嚴重偏估（post-selection inference）。使用懲罰式方法（lasso、ridge）或事前指定模型。② 在 genomics 中看到 R² > 0.9——多半是樣本洩漏或批次混雜，不是「擬合很好」。 Common mistakes: ① Using step() then reporting p-values on the same data — both coefficients and p-values are heavily biased (post-selection inference). Use penalized methods (lasso, ridge) or pre-specified models. ② Seeing R² > 0.9 in genomics — usually sample leakage or batch confounding, not "great fit".

與 limma 的連結

三、limma = 百萬個線性模型

limma 對 每個基因 擬合 lm(expression ~ design)，再做 empirical Bayes 把所有基因的殘差變異 shrunk 到共同先驗——這就是 moderated t-statistic。所以本章的所有診斷（殘差、共線性、影響點）都直接適用於 limma，只不過從一張圖變成「每個基因都要關心」的尺度問題。實務上看 limma::plotSA（mean-variance）與 plotMD（每基因擬合）取代 plot.lm。

limma fits lm(expression ~ design) for each gene, then empirical-Bayes shrinks residual variances to a common prior — yielding the moderated t-statistic. Every diagnostic in this chapter (residuals, collinearity, influence) applies directly to limma — just at scale. Use limma::plotSA (mean-variance) and plotMD (per-gene fit) as the genome-wide counterparts of plot.lm.

📝 自我檢測

1. 異質變異違反了哪條 OLS 假設？以下哪個是最佳第一線修正？

1. Which OLS assumption does heteroscedasticity violate, and what is the best first-line fix?

A. 違反獨立性；用 mixed modelA. Violates independence; use mixed model

B. 違反同質變異 (Gauss-Markov)；用 HC3 sandwich 穩健 SEB. Violates homoscedasticity; use HC3 sandwich robust SE

C. 違反線性；用 polynomialC. Violates linearity; use polynomial

D. 違反常態；對 y 取 logD. Violates normality; log y

2. 關於高 VIF（多重共線性），下列何者正確？

2. About high VIF (multicollinearity), which is correct?

A. 會使係數估計變偏（biased）A. It biases coefficient estimates

B. 會使預測值變差B. It worsens predictions

C. 不偏係數，但膨脹 SE → t 變小、p 變大；對純預測無大影響C. Doesn't bias coefficients, but inflates SE → smaller t, larger p; little effect on pure prediction

D. 與 leverage 同義D. Same as leverage

3. Cook's 距離大表示什麼？

3. A large Cook's distance indicates what?

A. 殘差大但 leverage 小A. Large residual but low leverage

B. Leverage 大但殘差小B. High leverage but small residual

C. R² 很高C. High R²

D. 兼具大殘差與高 leverage，整體擬合對該點高度敏感D. Both large residual and high leverage — the fit is highly sensitive to that point