Step 8: Linear Regression — Biostatistics Tutorial

總覽

為什麼線性迴歸是生物統計的「瑞士刀」？

從 Galton（1886）用「regression toward mediocrity」描述父子身高、到今天 RNA-seq 的 DESeq2 與 limma-voom 內核仍是 GLM——線性迴歸 (linear regression) 是現代統計最重要的一個模型。t 檢定是它的 1 變數版本、ANOVA 是它的類別自變數版本、ANCOVA 是它的混合版、GLM 是它的延伸、混合模型 (mixed models) 是它加上隨機效應。學會線性迴歸的「思維方式」（建模 → 估計 → 診斷 → 推論），就等於拿到 80% 生物統計分析的鑰匙。

但「容易上手」與「容易誤用」往往同義。Westreich & Greenland（2013 American Journal of Epidemiology）發現：絕大多數流行病學論文把 Table 2 裡所有的迴歸係數當成因果效應來解讀——這就是著名的「Table 2 fallacy」。其他常見災難：把 R² 當「成功指標」、忽略 VIF > 10 的共線性、不畫殘差圖、外推到資料範圍外⋯⋯本章把這些坑一次踩給你看。

From Galton's 1886 "regression toward mediocrity" used to describe father-son heights, to today's DESeq2 and limma-voom in RNA-seq still running on a GLM kernel — linear regression is the single most important model in modern statistics. The t-test is its one-variable case, ANOVA is its categorical-predictor case, ANCOVA mixes them, GLMs extend the link function, mixed models add random effects. Master the regression mindset (model → estimate → diagnose → infer) and you hold the keys to roughly 80% of biostatistical analysis.

But "easy to start" and "easy to misuse" are synonyms here. Westreich & Greenland (2013 AJE) found that the vast majority of epidemiology papers interpret every coefficient in Table 2 as a causal effect — the now-famous "Table 2 fallacy". Other regular disasters: R²-as-trophy, ignoring VIF > 10, skipping residual plots, extrapolation beyond the data range… we walk into each pit so you don't have to.

💡

歷史 × 一句話：Gauss（1809，《Theoria motus》）為了拼合星體軌道資料發明了最小平方法；Legendre 同年獨立發表；Fisher（1925）把它擺進現代統計推論框架；Nelder & Wedderburn（1972）把它推廣為 GLM。「線性」指的是對參數線性，不是對 x 線性——所以 y = β₀ + β₁·x + β₂·x² 仍是線性迴歸。 One-line history: Gauss invented least squares in 1809 (Theoria motus) for fitting planetary orbits; Legendre published independently the same year; Fisher (1925) embedded it in modern inferential statistics; Nelder & Wedderburn (1972) generalised it into GLMs. "Linear" means linear in parameters, not linear in x — so y = β₀ + β₁·x + β₂·x² is still linear regression.

核心概念

一、OLS、β 係數、R²、殘差

線性迴歸的整個世界由四個概念撐起：(1) 損失函數 OLS 告訴你「最好」是什麼意思；(2) β 係數告訴你 x 變化 1 單位 y 變化多少；(3) R² 告訴你模型解釋多少變異；(4) 殘差是診斷的線索。

The whole world of linear regression rests on four ideas: (1) the loss function OLS defines what "best" means; (2) the β coefficients tell you how much y moves per unit of x; (3) R² tells you the share of variance explained; (4) the residuals are the diagnostic trail.

📐

OLS：最小平方準則

最小化殘差平方和 Σ(yᵢ − ŷᵢ)²。閉合解 β̂ = (X'X)⁻¹X'y。幾何意義：把 y 向量正交投影到 X 的欄空間。

幾何直覺：殘差 e = y − ŷ 永遠垂直於擬合值 ŷ。這就是 OLS「最佳」的幾何定義——沒有更近的點了。

Minimize the residual sum of squares Σ(yᵢ − ŷᵢ)². Closed form: β̂ = (X'X)⁻¹X'y. Geometrically, project y orthogonally onto the column space of X.

Intuition: residuals e = y − ŷ are perpendicular to fitted values ŷ. That orthogonality is what "best" means in OLS — no closer point exists.

📈

β 係數的解讀

β₀ (intercept)：所有 x = 0 時的預期 y。生物資料中常無意義（年齡 0、體重 0）——中心化 (centering) 後 β₀ 變成「平均人」。

βⱼ：保持其他變數不變時，xⱼ 增加 1 單位，y 平均變化 βⱼ 單位（這個「保持其他不變」就是條件詮釋——也是 Table 2 fallacy 的根源）。

標準化 β：先 z-score 所有變數再迴歸，β 之間可比較「強度」，但失去原始單位。

β₀ (intercept): expected y when all x = 0. Often meaningless biologically (age 0, weight 0) — centering the predictors makes β₀ "the average person".

βⱼ: holding all other variables fixed, a 1-unit increase in xⱼ shifts y by βⱼ units on average (the "holding fixed" caveat is the conditional interpretation — and the seed of the Table 2 fallacy).

Standardized β: z-score everything before fitting; betas become comparable in "strength" but lose original units.

🎯

R² 與 adjusted R²

R² = 1 − SS_res/SS_tot，模型解釋的變異比例 (0–1)。

R²_adj = 1 − (1−R²)(n−1)/(n−p−1)，懲罰多餘變數。R² 永遠隨變數數量增加而上升，所以單看 R² 比模型大小是錯的——比 adjusted R² 或 AIC/BIC。

⚠️ R² = 0.9 不代表模型好，R² = 0.1 也不代表模型差——生物學中行為遺傳學常 R² ≈ 0.05 仍是重要發現。

R² = 1 − SS_res/SS_tot: share of variance explained (0–1).

R²_adj = 1 − (1−R²)(n−1)/(n−p−1) penalises extra predictors. Plain R² rises whenever you add predictors, so comparing model sizes by R² is wrong — use adjusted R² or AIC/BIC.

⚠️ R² = 0.9 isn't automatically good and R² = 0.1 isn't automatically bad — in behaviour genetics R² ≈ 0.05 can be a major finding.

🔍

殘差：診斷的金線

殘差是「資料告訴模型還沒抓到什麼」的訊號。四個診斷圖（下方第 5 節詳述）：residuals vs fitted、QQ plot、scale-location、leverage vs residuals。

不畫殘差圖就報告迴歸，等於不看 histogram 就算 mean ± SD。R 的 plot(model) 一行就有四圖。

Residuals are the signal of "what the model didn't catch". The four diagnostic plots (Section 5 below): residuals vs fitted, QQ plot, scale-location, leverage vs residuals.

Reporting a regression without residual plots is like reporting mean ± SD without a histogram. In R, plot(model) gives all four in one line.

⌜ ŷ = Xβ̂ · β̂ = (X'X)⁻¹X'y · SS_tot = Σ(yᵢ−ȳ)² · SS_res = Σ(yᵢ−ŷᵢ)² ⌝ Gauss-Markov 定理（1821）：在線性、獨立、同變異數的假設下，OLS 估計量是所有線性無偏估計中變異最小的（BLUE — Best Linear Unbiased Estimator）。注意「線性無偏」是個強烈限制——脫離這個族群，bias 換 variance 的 ridge / lasso 可以贏 OLS。 ⌜ ŷ = Xβ̂ · β̂ = (X'X)⁻¹X'y · SS_tot = Σ(yᵢ−ȳ)² · SS_res = Σ(yᵢ−ŷᵢ)² ⌝ Gauss-Markov (1821): under linearity, independence, and homoscedasticity, OLS is the minimum-variance among all linear unbiased estimators (BLUE — Best Linear Unbiased Estimator). The qualifier "linear unbiased" is strong — step outside that class and bias-traded estimators like ridge/lasso can beat OLS.

互動模擬 ①

迴歸遊樂場：滑動真實 β 與雜訊

下方資料由 y = β₀ + β₁·x + ε, ε ~ N(0, σ²) 模擬產生。你可以調整真實截距、斜率與雜訊標準差，觀察 OLS 重新估計的 β̂ 與真實值的差異。試試：(1) 雜訊極小時 R² 接近 1，β̂ 幾乎等於真值；(2) 雜訊變大 R² 下降但 β̂ 仍接近真值（OLS unbiased！）；(3) n 變小時 β̂ 的變異增大（標準誤 ∝ 1/√n）。

The points below are simulated from y = β₀ + β₁·x + ε, ε ~ N(0, σ²). Adjust the true intercept, slope, and noise SD, and watch how the OLS-estimated β̂ tracks the truth. Try: (1) tiny noise → R² ≈ 1, β̂ ≈ truth; (2) raise noise → R² drops but β̂ still ≈ truth (OLS is unbiased!); (3) shrink n → β̂ wiggles more (SE ∝ 1/√n).

真實 β₀ 2

真實 β₁ 1.5

雜訊 σ 1.5

樣本數 n 40

藍點 = 資料 · 紅線 = OLS 擬合 · 灰虛線 = 真實關係Blue = data · Red = OLS fit · Grey dashed = ground truth

OLS 假設

二、四個假設與診斷圖

OLS 推論（CI、p 值）的有效性建立在四個假設上，記憶口訣 LINE：Linearity、Independence、Normality of residuals、Equal variance（同變異數，homoscedasticity）。違反不同假設後果不一樣：非線性 → β 整個錯；異變異數 → β 仍 unbiased 但 SE 錯（CI 與 p 不可信）；殘差非常態 → 小樣本 CI 不可信，大樣本因 CLT 影響小；觀察值不獨立 → SE 嚴重低估（最嚴重）。

The validity of OLS inference (CIs, p-values) rests on four assumptions, mnemonic LINE: Linearity, Independence, Normality of residuals, Equal variance (homoscedasticity). Different violations bite differently: non-linearity → β estimates are wrong; heteroscedasticity → β still unbiased but SEs are wrong (CIs/p untrustworthy); non-normal residuals → small-n CIs unreliable, large-n rescued by CLT; non-independence → SEs grossly underestimated (the worst of the four).

診斷圖	檢查的假設	健康的樣子	病徵
Residuals vs Fitted	Linearity, homoscedasticity	隨機散布、無弧形、無漏斗	U 形 → 非線性；漏斗 → 異變異數	Linearity, homoscedasticity	Random scatter, no curve, no funnel	U-shape → non-linearity; funnel → heteroscedasticity
QQ Plot	Normality of residuals	點落在 y=x 對角線	S 形 → 重尾；翹尾 → outlier	Normality of residuals	Points along y=x line	S-shape → heavy tails; bent tail → outliers
Scale-Location	Homoscedasticity (專用)	√\|residuals\| 水平	上升 → 變異隨 fitted 增大	Homoscedasticity (focused)	√\|residuals\| flat	Rising trend → variance grows with fit
Residuals vs Leverage	Influence (Cook's distance)	無點落在 Cook 等高線外	右上角點 + 高 Cook's D → 單點主導擬合	Influence (Cook's distance)	No points outside Cook contours	Top-right + high Cook's D → one point drives the fit

⚠️

影響力 (influence) ≠ 離群值 (outlier) ≠ 高槓桿 (leverage)：離群值 = 殘差大；高槓桿 = x 偏離資料中心；影響力 = 兩者結合（Cook's D 同時看殘差與槓桿）。判定門檻：Cook's D > 4/n 提示審查，leverage h_ii > 2(p+1)/n 視為高槓桿。實務上：先看圖，再用 Cook's D，最後用敏感度分析（保留/移除該點看 β 變化）。 Influence ≠ outlier ≠ leverage: an outlier has a big residual; high leverage means x is far from the centre of x; influence combines both (Cook's D mixes residual with leverage). Common thresholds: Cook's D > 4/n flags for review, leverage h_ii > 2(p+1)/n is "high leverage". Practical recipe: plot first, then Cook's D, then sensitivity analysis (compare β with vs without the point).

互動模擬 ②

診斷圖實驗室

選擇下方情境：none（健康）、heteroscedastic（漏斗）、non-linear（彎曲）、outlier（一個極端值）。觀察每個情境下，四個診斷圖如何分別反應——尤其注意「Residuals vs Fitted」看曲線、「Scale-Location」看漏斗、「QQ」看尾巴、「Leverage」看單點主導。

Pick a scenario: none (healthy), heteroscedastic (funnel), non-linear (curve), outlier (one extreme). Watch how each of the four diagnostic plots responds — Residuals vs Fitted catches curves, Scale-Location catches funnels, QQ catches tail issues, Leverage catches single-point dominance.

情境

左上：Residuals vs Fitted · 右上：QQ Plot · 左下：Scale-Location · 右下：Residuals vs LeverageTop-left: Residuals vs Fitted · Top-right: QQ Plot · Bottom-left: Scale-Location · Bottom-right: Residuals vs Leverage

進階主題

三、共線性、交互作用、虛擬變數、多項式

🔗 多重共線性

當預測變數彼此高度相關，β 個別不可信、SE 爆炸、p 飄。診斷：VIF (Variance Inflation Factor) = 1/(1−R²ⱼ)，其中 R²ⱼ 是用其他 x 預測 xⱼ 的判定係數。

門檻：VIF > 5 輕度警告、VIF > 10 嚴重共線性（O'Brien 2007 Quality & Quantity）。對策：(1) 移除高度相關的某一變數；(2) PCA / PLS 把變數合成；(3) 用 ridge regression（L2 規則化，正是設計來救共線性）。請勿單靠成對相關係數判斷——三個變數兩兩 r = 0.4 仍可能 VIF = 8。

When predictors are mutually correlated, individual βs become unreliable, SEs explode, p-values drift. Diagnose with the Variance Inflation Factor (VIF) = 1/(1−R²ⱼ), where R²ⱼ is the R² of regressing xⱼ on the rest.

Thresholds: VIF > 5 mild concern, VIF > 10 severe (O'Brien 2007 Quality & Quantity). Remedies: (1) drop a redundant variable; (2) combine via PCA / PLS; (3) ridge regression (L2 penalty — literally invented for this). Do not rely on pairwise correlations alone — three predictors all with r ≈ 0.4 can still have VIF ≈ 8.

✖️ 交互作用

當 x₁ 對 y 的效應「取決於」x₂ 的水準，就需要交互項：y ~ x1 * x2 = y ~ x1 + x2 + x1:x2。例：藥物效果 (treatment) 取決於性別 (sex)，模型 y ~ dose * sex。

解讀陷阱：有交互項時，不要單獨解讀 main effect——dose 的係數現在是「sex = reference 時的 dose 效應」。請報告兩個亞群下各自的效應或畫 interaction plot。建議：中心化連續變數，讓 main effect 變成「平均人的條件效應」。

When the effect of x₁ on y depends on the level of x₂, add an interaction: y ~ x1 * x2 = y ~ x1 + x2 + x1:x2. Example: drug effect varies with sex → y ~ dose * sex.

Reading pitfall: with an interaction, do not interpret main effects in isolation — the dose coefficient is now "dose effect when sex = reference". Report subgroup effects or draw an interaction plot. Tip: centre continuous predictors first so the main effect becomes the conditional effect "at the average".

🏷️ 類別變數編碼

k 類的類別變數變成 k − 1 個虛擬變數（dummy variable，又稱 one-hot），剩下一類當參考組。R 的 factor() + lm 自動處理；Python pd.get_dummies(drop_first=True)。

另一種：effect coding (sum-to-zero)，類別效應對「總平均」而非「參考組」。在 ANOVA 報告 type-III SS 時尤其有用。R: options(contrasts=c("contr.sum","contr.poly"))。

A k-level categorical becomes k − 1 dummies (one-hot), with one level held out as the reference. R's factor() + lm handles this automatically; Python pd.get_dummies(drop_first=True).

Alternative: effect coding (sum-to-zero) — category effects are relative to the grand mean rather than a reference level. Especially handy for type-III SS in ANOVA. R: options(contrasts=c("contr.sum","contr.poly")).

📈 多項式與樣條

對 x 非線性、對 β 仍線性：y ~ x + I(x²) 或更穩定的 poly(x, 2)（正交多項式避免共線性）。對劑量-反應曲線更推薦 natural splines ns(x, df=3)（Harrell 2015 RMS）。

陷阱：x²、x³ 之間共線性極高；高次多項式在資料邊緣震盪嚴重——超過 3 次幾乎都應改用 spline。

Non-linear in x but still linear in β: y ~ x + I(x²), or the more stable poly(x, 2) (orthogonal polynomials avoid collinearity). For dose-response curves, prefer natural splines ns(x, df=3) (Harrell 2015 RMS).

Pitfall: x², x³ are highly collinear; high-degree polynomials swing wildly at the data edges — anything past degree 3 should usually be a spline.

思路 (Harrell 2015 Regression Modeling Strategies)：建模前決定每個連續變數允許多少自由度（總自由度 ≤ n / 15），把那個自由度用在 spline 上而非「先看線性、p 不顯著再加二次項」。後者就是 p-hacking 的迴歸版。 Harrell's framing (RMS 2015): decide before fitting how many degrees of freedom each continuous variable gets (total df ≤ n / 15), and spend them on a spline — not on "try linear, then add quadratic if p > 0.05". The latter is regression-flavoured p-hacking.

決策引導

四、選 OLS / robust / GLM / mixed 的決策樹

🌳 迴歸方法決策樹

Q1:

y 是連續且近常態？→ 是 → OLS lm()。下一步檢查診斷圖。

Q2:

y 是計數、二元、比例？→ 是 → 用 GLM（二元 logistic、計數 Poisson / NB、比例 logit-binomial），見 Step 9。

Q3:

觀察值有分層或重複（同一隻老鼠多個切片、同一病人多個時間點）？→ 是 → 用 mixed-effects model (lme4::lmer)，見 Step 13。

Q4:

診斷圖出現嚴重 outlier / 重尾？→ 是 → Robust regression（MASS::rlm，Huber 或 MM）或先 log/Box-Cox 轉換。

Q5:

異變異數但其他假設 OK？→ 是 → 用 robust sandwich SE（White 1980, R: sandwich::vcovHC）或 weighted least squares。β 不變，SE 更可信。

Q6:

變數 p > n 或共線性嚴重？→ 是 → Ridge（L2，全部係數縮）或 Lasso（L1，變數選擇）或 Elastic Net。

Q7:

關心條件分位數而非平均？（例如：「最差 10% 的人」）→ 是 → Quantile regression（Koenker 2005）。

Q1:

y is continuous and roughly normal? → Yes → OLS lm(). Next, inspect diagnostics.

Q2:

y is a count, binary, or proportion? → Yes → use a GLM (logistic, Poisson / NB, logit-binomial) — see Step 9.

Q3:

Observations are nested or repeated (same mouse, multiple slices; same patient, multiple visits)? → Yes → use a mixed-effects model (lme4::lmer) — see Step 13.

Q4:

Diagnostics show severe outliers or heavy tails? → Yes → robust regression (MASS::rlm with Huber or MM) or log/Box-Cox first.

Q5:

Heteroscedastic but the rest is OK? → Yes → use robust sandwich SEs (White 1980, R: sandwich::vcovHC) or weighted least squares. β unchanged, SEs more reliable.

Q6:

p > n or severe collinearity? → Yes → Ridge (L2, shrinks all coefficients) or Lasso (L1, variable selection) or Elastic Net.

Q7:

Care about a conditional quantile, not the mean (e.g. "the worst 10%")? → Yes → quantile regression (Koenker 2005).

方法比較表

五、五種迴歸快速對照

方法	損失 / 目標	最佳情境	代價	工具
OLS	min Σ(y−ŷ)²	假設成立、推論為主	對 outlier 敏感、共線性下變異大	`lm`, `statsmodels.OLS`	Assumptions hold, inference-focused	Sensitive to outliers; high variance under collinearity
Huber / MM (Robust)	小殘差用平方、大殘差用線性 (Huber 1964)	outlier 多或重尾	統計效率比 OLS 略低	`MASS::rlm`, `statsmodels.RLM`	Quadratic on small residuals, linear on large (Huber 1964)	Outliers or heavy tails	Slightly less efficient than OLS
Ridge (L2)	min Σ(y−ŷ)² + λΣβⱼ²	共線性嚴重、p 大、要預測	係數有偏（縮向 0）、不做變數選擇	`glmnet`, `sklearn.Ridge`	Severe collinearity, large p, prediction	Coefficients biased (shrunk to 0); no variable selection
Lasso (L1)	min Σ(y−ŷ)² + λΣ\|βⱼ\| (Tibshirani 1996)	高維、想做變數選擇	相關變數中只挑一個、係數估計仍有偏	`glmnet`, `sklearn.Lasso`	min Σ(y−ŷ)² + λΣ\|βⱼ\| (Tibshirani 1996)	High-dimensional, automatic variable selection	Picks one from a correlated group; biased estimates
Elastic Net	α·L1 + (1−α)·L2 (Zou-Hastie 2005)	同時要群體選擇與穩定性	兩個超參數需 CV	`glmnet(alpha=0.5)`	α·L1 + (1−α)·L2 (Zou-Hastie 2005)	Want group selection + stability	Two hyperparameters; needs CV
Quantile	min Σρ_τ(y−ŷ) (Koenker-Bassett 1978)	關心中位數或極端分位數	SE 由 bootstrap，較不直接	`quantreg::rq`, `statsmodels.QuantReg`	min Σρ_τ(y−ŷ) (Koenker-Bassett 1978)	Median or tail of the conditional distribution	SEs typically need bootstrap

💡

記憶卡：「L1 選變數，L2 救共線」。Lasso 把不重要的 β 壓成 0（變數選擇）；Ridge 把所有 β 都壓小一點（穩定變異）。L1 + L2 同時用就是 Elastic Net。生物資訊高維 (p ≫ n) 場景（基因表達 → 表型）常用 Lasso / Elastic Net 配 10-fold CV 選 λ。 Mnemonic: "L1 selects, L2 stabilises." Lasso shrinks unimportant βs exactly to 0 (variable selection); Ridge shrinks every β a bit (variance reduction). Combine both → Elastic Net. The high-dimensional p ≫ n setting in bioinformatics (gene expression → phenotype) typically uses Lasso / Elastic Net with 10-fold CV to pick λ.

程式碼

六、實作範例

# R: linear regression workflow
library(tidyverse); library(car); library(MASS); library(sandwich); library(lmtest); library(glmnet)

# Example: systolic BP ~ age + BMI + sex (sex factor)
df <- tibble(age=rnorm(200,50,12), bmi=rnorm(200,25,4),
             sex=factor(sample(c("F","M"),200,replace=TRUE)),
             sbp=100 + 0.5*age + 1.2*bmi + rnorm(200,0,8))

# --- Fit & summary ---
m <- lm(sbp ~ age + bmi + sex, data=df)
summary(m)              # β̂, SE, t, p, R², adj R²
confint(m)              # 95% CI for each β

# --- Four diagnostic plots in one line ---
par(mfrow=c(2,2)); plot(m); par(mfrow=c(1,1))

# --- VIF for multicollinearity ---
car::vif(m)               # > 5 mild, > 10 severe

# --- Heteroscedasticity test & sandwich SE ---
lmtest::bptest(m)               # Breusch-Pagan test
lmtest::coeftest(m, vcov.=sandwich::vcovHC(m, type="HC3"))

# --- Robust regression (Huber / MM) ---
m_rob <- MASS::rlm(sbp ~ age + bmi + sex, data=df, psi=MASS::psi.huber)

# --- Influence: Cook's distance ---
cd <- cooks.distance(m); which(cd > 4/nrow(df))

# --- Ridge / Lasso via glmnet (alpha=0 ridge, 1 lasso) ---
X <- model.matrix(sbp ~ age + bmi + sex, df)[,-1]
cv <- glmnet::cv.glmnet(X, df$sbp, alpha=1)   # lasso
coef(cv, s="lambda.min")

import numpy as np, pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import RidgeCV, LassoCV
import matplotlib.pyplot as plt

# --- Same example dataset ---
rng = np.random.default_rng(0)
df = pd.DataFrame({
  "age": rng.normal(50,12,200),
  "bmi": rng.normal(25,4,200),
  "sex": rng.choice(["F","M"],200)})
df["sbp"] = 100 + 0.5*df.age + 1.2*df.bmi + rng.normal(0,8,200)

# --- Fit & summary ---
m = smf.ols("sbp ~ age + bmi + C(sex)", data=df).fit()
print(m.summary())                  # β̂, SE, t, p, R², adj R²
print(m.conf_int())                 # 95% CI

# --- Four diagnostic plots manually ---
fitted, resid = m.fittedvalues, m.resid
fig, ax = plt.subplots(2,2, figsize=(9,7))
ax[0,0].scatter(fitted, resid); ax[0,0].set_title("Residuals vs Fitted")
sm.qqplot(resid, line="s", ax=ax[0,1]); ax[0,1].set_title("QQ Plot")
ax[1,0].scatter(fitted, np.sqrt(abs(resid))); ax[1,0].set_title("Scale-Location")
inf = m.get_influence()
ax[1,1].scatter(inf.hat_matrix_diag, resid); ax[1,1].set_title("Residuals vs Leverage")

# --- VIF ---
X = sm.add_constant(pd.get_dummies(df[["age","bmi","sex"]], drop_first=True))
[variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# --- Robust regression (Huber) ---
m_rob = sm.RLM(df.sbp, X, M=sm.robust.norms.HuberT()).fit()

# --- Heteroscedasticity-robust SE (HC3) ---
m_hc3 = smf.ols("sbp ~ age + bmi + C(sex)", data=df).fit(cov_type="HC3")

# --- Ridge / Lasso with CV ---
RidgeCV(alphas=np.logspace(-2,2,50)).fit(X, df.sbp)
LassoCV(cv=10).fit(X, df.sbp)

💡

實務 SOP：(1) 先 summary(model)；(2) plot(model) 看四圖；(3) vif() 看共線性；(4) bptest() 看異變異數，必要時改 sandwich SE；(5) 用 cooks.distance() 看影響點，做敏感度分析；(6) 報告時附 95% CI（不是 SE）。 Standard workflow: (1) summary(model); (2) plot(model) for four diagnostics; (3) vif() for collinearity; (4) bptest() for heteroscedasticity, swap to sandwich SEs if needed; (5) cooks.distance() for influence + sensitivity check; (6) report 95% CIs (not SEs).

常見陷阱

七、五個論文常見災難

❌ Table 2 fallacy

Westreich & Greenland 2013 AJE：流行病學論文 Table 2 通常把模型裡所有變數的 β都當成「該變數對 y 的因果效應」報告。但只有主要 estimand（target exposure → outcome）的 β 在因果上有意義；其他變數（混淆變數 confounders、中介變數 mediators）的 β 在條件解讀上與想像的不同。

對策：畫 DAG（Pearl 2009; Hernán-Robins 2020）、明確指定 estimand、只報告主要 exposure 的因果係數，其他變數標為「adjusted for」。

Westreich & Greenland (2013 AJE): most epidemiology Table 2s present every β as if it were the causal effect of that variable on y. But only the target estimand (exposure → outcome) carries causal meaning; coefficients for confounders/mediators have a conditional interpretation that almost never matches the implied causal story.

Fix: draw a DAG (Pearl 2009; Hernán-Robins 2020), declare the estimand, report only the target exposure's coefficient causally — list other variables as "adjusted for".

❌ 忽略共線性

在 BMI 同時放 weight 與 height（VIF 爆炸），結果 weight 的 β 不顯著卻方向反——共線性常造成「顯著翻轉」。每次跑迴歸都該 vif()。

Putting weight and height alongside BMI (VIF explodes), the weight β flips sign and is non-significant — collinearity often causes "sign flips". Always run vif().

❌ 遺漏變數偏誤

真實 y 受 z 影響，z 又與 x 相關，但模型沒放 z——x 的 β 會吸收 z 的效應。生物資料中 age、sex 常是被遺忘的共同混淆。判斷依靠 DAG，而不是 R²。

If z drives y and z correlates with x, but z isn't in the model — x's β absorbs z's effect. In biological data, age and sex are the usual forgotten common causes. Decide what to include from a DAG, not from R².

❌ 外推

資料 x 範圍 10–60，預測 x = 100 的 y——數學可以給你一個數字，但沒有任何證據那裡是線性。外推是 Challenger 太空梭 O-ring 災難（1986）的統計教訓：發射溫度 31°F 遠在歷史資料下限 (53°F) 之外。

Your x ranges over 10–60 and you predict at x = 100 — the math returns a number but you have no evidence the line continues. Extrapolation is the statistical lesson of the Challenger O-ring disaster (1986): launch temperature 31°F was well below the historical minimum (53°F).

❌ R² 至上主義

「我把 R² 從 0.6 衝到 0.9 了！」——多半是加了過多變數導致過度擬合 (overfitting)。判斷模型好壞應看：(1) adjusted R²；(2) cross-validated R²；(3) out-of-sample MSE。論文上請報告 adjusted R² 或交叉驗證表現，不是 raw R²。

"I pushed R² from 0.6 to 0.9!" — usually by adding so many predictors that you overfit. Judge models by: (1) adjusted R²; (2) cross-validated R²; (3) out-of-sample MSE. Papers should report adjusted R² or CV performance, not raw R².

❌ 不畫殘差圖

論文中迴歸結果只給「β, SE, p」表，卻沒有 residual vs fitted 與 QQ plot——這是 reviewer 應立即退件的紅旗。Fox（2016 Applied Regression）：「沒有診斷的迴歸是迷信。」

Papers reporting regressions as a β / SE / p table without residuals-vs-fitted or QQ plots — a red flag any reviewer should catch. Fox (2016 Applied Regression): "Regression without diagnostics is superstition."

生物醫學案例

八、兩個案例

案例 A：BP ~ age + BMI + sex

目標 estimand：BMI 對 SBP 的條件效應（控制 age、sex）。建模：lm(sbp ~ age + bmi + sex)。預期 β_BMI ≈ 1–1.5 mmHg per kg/m² (Linderman 2018)。診斷：殘差通常輕微右偏；考慮 log(SBP) 或 robust SE。報告：「BMI was associated with SBP (β = 1.2, 95% CI 0.9–1.5, p < 0.001), adjusting for age and sex.」不要把 age 的 β 也當成「年齡的因果效應」報告——Table 2 fallacy！

Target estimand: the conditional effect of BMI on SBP (adjusting for age, sex). Fit: lm(sbp ~ age + bmi + sex). Expect β_BMI ≈ 1–1.5 mmHg per kg/m² (Linderman 2018). Diagnostics: residuals are usually mildly right-skewed → consider log(SBP) or robust SEs. Report: "BMI was associated with SBP (β = 1.2, 95% CI 0.9–1.5, p < 0.001), adjusting for age and sex." Don't simultaneously report the age β as its causal effect — Table 2 fallacy!

案例 B：expr ~ time + treatment

RNA-seq 中對單一基因擬合 lm(log2_expr ~ time * treatment)——交互項代表「處理改變了時間趨勢」。陷阱：(1) RNA-seq 計數應該用 NB-GLM (DESeq2 / edgeR) 而非 OLS，OLS 對小 count 變異估計錯；(2) 同一隻動物多時間點 → 需要 mixed model（random intercept per animal）；(3) 跨基因檢定要 BH-FDR 校正（Step 12）。

For a single gene in RNA-seq, fit lm(log2_expr ~ time * treatment) — the interaction says "treatment alters the time trend". Caveats: (1) raw counts should use NB-GLMs (DESeq2 / edgeR), not OLS — OLS mis-estimates variance for small counts; (2) repeated measures on the same animal → mixed model with random intercept per animal; (3) testing across many genes requires BH-FDR (Step 12).

重要文獻

九、必讀清單

Galton 1886「Regression towards mediocrity in hereditary stature」J Anthropol Inst — 迴歸這個詞的起源。
Gauss 1809《Theoria motus》— OLS 的數學奠基。
Westreich & Greenland 2013 AJE「The Table 2 fallacy」— 必讀，避免最常見的因果誤解。
White 1980 Econometrica「A heteroskedasticity-consistent covariance matrix estimator」— sandwich SE 的原典。
Tibshirani 1996 JRSS-B「Regression shrinkage and selection via the lasso」— Lasso 原典。
Hoerl & Kennard 1970 Technometrics — Ridge regression 原典。
Zou & Hastie 2005 JRSS-B — Elastic Net。
James, Witten, Hastie, Tibshirani 2021 ISLR 2nd ed — 線性模型與規則化的最佳教科書（免費 PDF）。
Fox 2016 Applied Regression Analysis 3rd ed — 診斷的聖經。
Harrell 2015 Regression Modeling Strategies 2nd ed — 臨床建模實務、spline、shrinkage、validation。
Hernán & Robins 2020 Causal Inference: What If — DAG 與條件解讀的因果框架（免費 PDF）。

Galton 1886 "Regression towards mediocrity in hereditary stature", J Anthropol Inst — origin of the word.
Gauss 1809 Theoria motus — mathematical foundation of OLS.
Westreich & Greenland 2013 AJE "The Table 2 fallacy" — essential, prevents the most common causal misreading.
White 1980 Econometrica — origin of sandwich (HC) standard errors.
Tibshirani 1996 JRSS-B — the Lasso paper.
Hoerl & Kennard 1970 Technometrics — Ridge regression.
Zou & Hastie 2005 JRSS-B — Elastic Net.
James, Witten, Hastie, Tibshirani 2021 ISLR 2nd ed — best textbook for linear models & regularisation (free PDF).
Fox 2016 Applied Regression Analysis 3rd ed — the diagnostics bible.
Harrell 2015 Regression Modeling Strategies 2nd ed — clinical modelling, splines, shrinkage, validation.
Hernán & Robins 2020 Causal Inference: What If — DAGs and the conditional-interpretation framework (free PDF).

📝 自我檢測

1. Residuals vs Fitted 圖呈現「漏斗形」（殘差變異隨 fitted 值增加），這意味著違反了哪個 OLS 假設？

1. A Residuals vs Fitted plot shows a funnel (variance grows with fitted value). Which OLS assumption is violated?

A. 線性 (Linearity)A. Linearity

B. 殘差常態性 (Normality)B. Normality of residuals

C. 同變異數 (Homoscedasticity)C. Homoscedasticity (equal variance)

D. 獨立性 (Independence)D. Independence

2. 你跑迴歸後發現 VIF = 12，下列最合理的處置是？

2. After fitting your regression you find VIF = 12. The most reasonable response?

A. 不管它，p 值還是顯著的A. Ignore it — the p-values are still significant

B. 增加更多變數來「稀釋」共線性B. Add more variables to "dilute" collinearity

C. 刪除所有 p > 0.05 的變數C. Drop any variable with p > 0.05

D. 移除一個冗餘變數，或改用 ridge / PCAD. Drop a redundant variable, or switch to ridge / PCA

3. Westreich & Greenland 2013「Table 2 fallacy」警告的是什麼？

3. What does Westreich & Greenland's "Table 2 fallacy" (2013) warn against?

A. 不應該用迴歸做因果推論A. Never use regression for causal inference

B. 將模型中所有變數的 β 都當作各自的因果效應來解讀B. Interpreting every β in the model as that variable's causal effect

C. 報告 R² 而不報告 adjusted R²C. Reporting R² rather than adjusted R²

D. 同時把連續變數與類別變數放在一個模型D. Mixing continuous and categorical predictors

4. p > n 的高維基因表達資料（20,000 genes、200 samples）預測表型，下列哪個方法最合適？

4. For high-dimensional gene-expression data (20,000 genes, 200 samples) predicting phenotype, which is best?

A. 普通 OLS — 解越多越好A. Plain OLS — more variables, better fit

B. 用 stepwise selection 挑變數B. Stepwise variable selection

C. Lasso 或 Elastic Net 配 10-fold CVC. Lasso or Elastic Net with 10-fold CV

D. 只用 p 值前 100 名的基因再跑 OLSD. Pre-filter top 100 genes by p, then run OLS