為什麼線性迴歸是生物統計的「瑞士刀」?
從 Galton(1886)用「regression toward mediocrity」描述父子身高、到今天 RNA-seq 的 DESeq2 與 limma-voom 內核仍是 GLM——線性迴歸 (linear regression) 是現代統計最重要的一個模型。t 檢定是它的 1 變數版本、ANOVA 是它的類別自變數版本、ANCOVA 是它的混合版、GLM 是它的延伸、混合模型 (mixed models) 是它加上隨機效應。學會線性迴歸的「思維方式」(建模 → 估計 → 診斷 → 推論),就等於拿到 80% 生物統計分析的鑰匙。
但「容易上手」與「容易誤用」往往同義。Westreich & Greenland(2013 American Journal of Epidemiology)發現:絕大多數流行病學論文把 Table 2 裡所有的迴歸係數當成因果效應來解讀——這就是著名的「Table 2 fallacy」。其他常見災難:把 R² 當「成功指標」、忽略 VIF > 10 的共線性、不畫殘差圖、外推到資料範圍外⋯⋯本章把這些坑一次踩給你看。
From Galton's 1886 "regression toward mediocrity" used to describe father-son heights, to today's DESeq2 and limma-voom in RNA-seq still running on a GLM kernel — linear regression is the single most important model in modern statistics. The t-test is its one-variable case, ANOVA is its categorical-predictor case, ANCOVA mixes them, GLMs extend the link function, mixed models add random effects. Master the regression mindset (model → estimate → diagnose → infer) and you hold the keys to roughly 80% of biostatistical analysis.
But "easy to start" and "easy to misuse" are synonyms here. Westreich & Greenland (2013 AJE) found that the vast majority of epidemiology papers interpret every coefficient in Table 2 as a causal effect — the now-famous "Table 2 fallacy". Other regular disasters: R²-as-trophy, ignoring VIF > 10, skipping residual plots, extrapolation beyond the data range… we walk into each pit so you don't have to.
y = β₀ + β₁·x + β₂·x² 仍是線性迴歸。
One-line history: Gauss invented least squares in 1809 (Theoria motus) for fitting planetary orbits; Legendre published independently the same year; Fisher (1925) embedded it in modern inferential statistics; Nelder & Wedderburn (1972) generalised it into GLMs. "Linear" means linear in parameters, not linear in x — so y = β₀ + β₁·x + β₂·x² is still linear regression.
一、OLS、β 係數、R²、殘差
線性迴歸的整個世界由四個概念撐起:(1) 損失函數 OLS 告訴你「最好」是什麼意思;(2) β 係數告訴你 x 變化 1 單位 y 變化多少;(3) R² 告訴你模型解釋多少變異;(4) 殘差是診斷的線索。
The whole world of linear regression rests on four ideas: (1) the loss function OLS defines what "best" means; (2) the β coefficients tell you how much y moves per unit of x; (3) R² tells you the share of variance explained; (4) the residuals are the diagnostic trail.
OLS:最小平方準則
最小化殘差平方和 Σ(yᵢ − ŷᵢ)²。閉合解 β̂ = (X'X)⁻¹X'y。幾何意義:把 y 向量正交投影到 X 的欄空間。
幾何直覺:殘差 e = y − ŷ 永遠垂直於擬合值 ŷ。這就是 OLS「最佳」的幾何定義——沒有更近的點了。
Minimize the residual sum of squares Σ(yᵢ − ŷᵢ)². Closed form: β̂ = (X'X)⁻¹X'y. Geometrically, project y orthogonally onto the column space of X.
Intuition: residuals e = y − ŷ are perpendicular to fitted values ŷ. That orthogonality is what "best" means in OLS — no closer point exists.
β 係數的解讀
β₀ (intercept):所有 x = 0 時的預期 y。生物資料中常無意義(年齡 0、體重 0)——中心化 (centering) 後 β₀ 變成「平均人」。
βⱼ:保持其他變數不變時,xⱼ 增加 1 單位,y 平均變化 βⱼ 單位(這個「保持其他不變」就是條件詮釋——也是 Table 2 fallacy 的根源)。
標準化 β:先 z-score 所有變數再迴歸,β 之間可比較「強度」,但失去原始單位。
β₀ (intercept): expected y when all x = 0. Often meaningless biologically (age 0, weight 0) — centering the predictors makes β₀ "the average person".
βⱼ: holding all other variables fixed, a 1-unit increase in xⱼ shifts y by βⱼ units on average (the "holding fixed" caveat is the conditional interpretation — and the seed of the Table 2 fallacy).
Standardized β: z-score everything before fitting; betas become comparable in "strength" but lose original units.
R² 與 adjusted R²
R² = 1 − SSres/SStot,模型解釋的變異比例 (0–1)。
R²adj = 1 − (1−R²)(n−1)/(n−p−1),懲罰多餘變數。R² 永遠隨變數數量增加而上升,所以單看 R² 比模型大小是錯的——比 adjusted R² 或 AIC/BIC。
⚠️ R² = 0.9 不代表模型好,R² = 0.1 也不代表模型差——生物學中行為遺傳學常 R² ≈ 0.05 仍是重要發現。
R² = 1 − SSres/SStot: share of variance explained (0–1).
R²adj = 1 − (1−R²)(n−1)/(n−p−1) penalises extra predictors. Plain R² rises whenever you add predictors, so comparing model sizes by R² is wrong — use adjusted R² or AIC/BIC.
⚠️ R² = 0.9 isn't automatically good and R² = 0.1 isn't automatically bad — in behaviour genetics R² ≈ 0.05 can be a major finding.
殘差:診斷的金線
殘差是「資料告訴模型還沒抓到什麼」的訊號。四個診斷圖(下方第 5 節詳述):residuals vs fitted、QQ plot、scale-location、leverage vs residuals。
不畫殘差圖就報告迴歸,等於不看 histogram 就算 mean ± SD。R 的 plot(model) 一行就有四圖。
Residuals are the signal of "what the model didn't catch". The four diagnostic plots (Section 5 below): residuals vs fitted, QQ plot, scale-location, leverage vs residuals.
Reporting a regression without residual plots is like reporting mean ± SD without a histogram. In R, plot(model) gives all four in one line.
迴歸遊樂場:滑動真實 β 與雜訊
下方資料由 y = β₀ + β₁·x + ε, ε ~ N(0, σ²) 模擬產生。你可以調整真實截距、斜率與雜訊標準差,觀察 OLS 重新估計的 β̂ 與真實值的差異。試試:(1) 雜訊極小時 R² 接近 1,β̂ 幾乎等於真值;(2) 雜訊變大 R² 下降但 β̂ 仍接近真值(OLS unbiased!);(3) n 變小時 β̂ 的變異增大(標準誤 ∝ 1/√n)。
The points below are simulated from y = β₀ + β₁·x + ε, ε ~ N(0, σ²). Adjust the true intercept, slope, and noise SD, and watch how the OLS-estimated β̂ tracks the truth. Try: (1) tiny noise → R² ≈ 1, β̂ ≈ truth; (2) raise noise → R² drops but β̂ still ≈ truth (OLS is unbiased!); (3) shrink n → β̂ wiggles more (SE ∝ 1/√n).
藍點 = 資料 · 紅線 = OLS 擬合 · 灰虛線 = 真實關係Blue = data · Red = OLS fit · Grey dashed = ground truth
二、四個假設與診斷圖
OLS 推論(CI、p 值)的有效性建立在四個假設上,記憶口訣 LINE:Linearity、Independence、Normality of residuals、Equal variance(同變異數,homoscedasticity)。違反不同假設後果不一樣:非線性 → β 整個錯;異變異數 → β 仍 unbiased 但 SE 錯(CI 與 p 不可信);殘差非常態 → 小樣本 CI 不可信,大樣本因 CLT 影響小;觀察值不獨立 → SE 嚴重低估(最嚴重)。
The validity of OLS inference (CIs, p-values) rests on four assumptions, mnemonic LINE: Linearity, Independence, Normality of residuals, Equal variance (homoscedasticity). Different violations bite differently: non-linearity → β estimates are wrong; heteroscedasticity → β still unbiased but SEs are wrong (CIs/p untrustworthy); non-normal residuals → small-n CIs unreliable, large-n rescued by CLT; non-independence → SEs grossly underestimated (the worst of the four).
| 診斷圖 | 檢查的假設 | 健康的樣子 | 病徵 | |||
|---|---|---|---|---|---|---|
| Residuals vs Fitted | Linearity, homoscedasticity | 隨機散布、無弧形、無漏斗 | U 形 → 非線性;漏斗 → 異變異數 | Linearity, homoscedasticity | Random scatter, no curve, no funnel | U-shape → non-linearity; funnel → heteroscedasticity |
| QQ Plot | Normality of residuals | 點落在 y=x 對角線 | S 形 → 重尾;翹尾 → outlier | Normality of residuals | Points along y=x line | S-shape → heavy tails; bent tail → outliers |
| Scale-Location | Homoscedasticity (專用) | √|residuals| 水平 | 上升 → 變異隨 fitted 增大 | Homoscedasticity (focused) | √|residuals| flat | Rising trend → variance grows with fit |
| Residuals vs Leverage | Influence (Cook's distance) | 無點落在 Cook 等高線外 | 右上角點 + 高 Cook's D → 單點主導擬合 | Influence (Cook's distance) | No points outside Cook contours | Top-right + high Cook's D → one point drives the fit |
診斷圖實驗室
選擇下方 情境:none(健康)、heteroscedastic(漏斗)、non-linear(彎曲)、outlier(一個極端值)。觀察每個情境下,四個診斷圖如何分別反應——尤其注意「Residuals vs Fitted」看曲線、「Scale-Location」看漏斗、「QQ」看尾巴、「Leverage」看單點主導。
Pick a scenario: none (healthy), heteroscedastic (funnel), non-linear (curve), outlier (one extreme). Watch how each of the four diagnostic plots responds — Residuals vs Fitted catches curves, Scale-Location catches funnels, QQ catches tail issues, Leverage catches single-point dominance.
左上:Residuals vs Fitted · 右上:QQ Plot · 左下:Scale-Location · 右下:Residuals vs LeverageTop-left: Residuals vs Fitted · Top-right: QQ Plot · Bottom-left: Scale-Location · Bottom-right: Residuals vs Leverage
三、共線性、交互作用、虛擬變數、多項式
🔗 多重共線性
當預測變數彼此高度相關,β 個別不可信、SE 爆炸、p 飄。診斷:VIF (Variance Inflation Factor) = 1/(1−R²ⱼ),其中 R²ⱼ 是用其他 x 預測 xⱼ 的判定係數。
門檻:VIF > 5 輕度警告、VIF > 10 嚴重共線性(O'Brien 2007 Quality & Quantity)。對策:(1) 移除高度相關的某一變數;(2) PCA / PLS 把變數合成;(3) 用 ridge regression(L2 規則化,正是設計來救共線性)。請勿單靠成對相關係數判斷——三個變數兩兩 r = 0.4 仍可能 VIF = 8。
When predictors are mutually correlated, individual βs become unreliable, SEs explode, p-values drift. Diagnose with the Variance Inflation Factor (VIF) = 1/(1−R²ⱼ), where R²ⱼ is the R² of regressing xⱼ on the rest.
Thresholds: VIF > 5 mild concern, VIF > 10 severe (O'Brien 2007 Quality & Quantity). Remedies: (1) drop a redundant variable; (2) combine via PCA / PLS; (3) ridge regression (L2 penalty — literally invented for this). Do not rely on pairwise correlations alone — three predictors all with r ≈ 0.4 can still have VIF ≈ 8.
✖️ 交互作用
當 x₁ 對 y 的效應「取決於」x₂ 的水準,就需要交互項:y ~ x1 * x2 = y ~ x1 + x2 + x1:x2。例:藥物效果 (treatment) 取決於性別 (sex),模型 y ~ dose * sex。
解讀陷阱:有交互項時,不要單獨解讀 main effect——dose 的係數現在是「sex = reference 時的 dose 效應」。請報告兩個亞群下各自的效應或畫 interaction plot。建議:中心化連續變數,讓 main effect 變成「平均人的條件效應」。
When the effect of x₁ on y depends on the level of x₂, add an interaction: y ~ x1 * x2 = y ~ x1 + x2 + x1:x2. Example: drug effect varies with sex → y ~ dose * sex.
Reading pitfall: with an interaction, do not interpret main effects in isolation — the dose coefficient is now "dose effect when sex = reference". Report subgroup effects or draw an interaction plot. Tip: centre continuous predictors first so the main effect becomes the conditional effect "at the average".
🏷️ 類別變數編碼
k 類的類別變數變成 k − 1 個虛擬變數(dummy variable,又稱 one-hot),剩下一類當參考組。R 的 factor() + lm 自動處理;Python pd.get_dummies(drop_first=True)。
另一種:effect coding (sum-to-zero),類別效應對「總平均」而非「參考組」。在 ANOVA 報告 type-III SS 時尤其有用。R: options(contrasts=c("contr.sum","contr.poly"))。
A k-level categorical becomes k − 1 dummies (one-hot), with one level held out as the reference. R's factor() + lm handles this automatically; Python pd.get_dummies(drop_first=True).
Alternative: effect coding (sum-to-zero) — category effects are relative to the grand mean rather than a reference level. Especially handy for type-III SS in ANOVA. R: options(contrasts=c("contr.sum","contr.poly")).
📈 多項式與樣條
對 x 非線性、對 β 仍線性:y ~ x + I(x²) 或更穩定的 poly(x, 2)(正交多項式避免共線性)。對劑量-反應曲線更推薦 natural splines ns(x, df=3)(Harrell 2015 RMS)。
陷阱:x²、x³ 之間共線性極高;高次多項式在資料邊緣震盪嚴重——超過 3 次幾乎都應改用 spline。
Non-linear in x but still linear in β: y ~ x + I(x²), or the more stable poly(x, 2) (orthogonal polynomials avoid collinearity). For dose-response curves, prefer natural splines ns(x, df=3) (Harrell 2015 RMS).
Pitfall: x², x³ are highly collinear; high-degree polynomials swing wildly at the data edges — anything past degree 3 should usually be a spline.
四、選 OLS / robust / GLM / mixed 的決策樹
🌳 迴歸方法決策樹
sandwich::vcovHC)或 weighted least squares。β 不變,SE 更可信。sandwich::vcovHC) or weighted least squares. β unchanged, SEs more reliable.五、五種迴歸快速對照
| 方法 | 損失 / 目標 | 最佳情境 | 代價 | 工具 | ||||
|---|---|---|---|---|---|---|---|---|
| OLS | min Σ(y−ŷ)² | 假設成立、推論為主 | 對 outlier 敏感、共線性下變異大 | lm, statsmodels.OLS | Assumptions hold, inference-focused | Sensitive to outliers; high variance under collinearity | ||
| Huber / MM (Robust) | 小殘差用平方、大殘差用線性 (Huber 1964) | outlier 多或重尾 | 統計效率比 OLS 略低 | MASS::rlm, statsmodels.RLM | Quadratic on small residuals, linear on large (Huber 1964) | Outliers or heavy tails | Slightly less efficient than OLS | |
| Ridge (L2) | min Σ(y−ŷ)² + λΣβⱼ² | 共線性嚴重、p 大、要預測 | 係數有偏(縮向 0)、不做變數選擇 | glmnet, sklearn.Ridge | Severe collinearity, large p, prediction | Coefficients biased (shrunk to 0); no variable selection | ||
| Lasso (L1) | min Σ(y−ŷ)² + λΣ|βⱼ| (Tibshirani 1996) | 高維、想做變數選擇 | 相關變數中只挑一個、係數估計仍有偏 | glmnet, sklearn.Lasso | min Σ(y−ŷ)² + λΣ|βⱼ| (Tibshirani 1996) | High-dimensional, automatic variable selection | Picks one from a correlated group; biased estimates | |
| Elastic Net | α·L1 + (1−α)·L2 (Zou-Hastie 2005) | 同時要群體選擇與穩定性 | 兩個超參數需 CV | glmnet(alpha=0.5) | α·L1 + (1−α)·L2 (Zou-Hastie 2005) | Want group selection + stability | Two hyperparameters; needs CV | |
| Quantile | min Σρτ(y−ŷ) (Koenker-Bassett 1978) | 關心中位數或極端分位數 | SE 由 bootstrap,較不直接 | quantreg::rq, statsmodels.QuantReg | min Σρτ(y−ŷ) (Koenker-Bassett 1978) | Median or tail of the conditional distribution | SEs typically need bootstrap |
六、實作範例
# R: linear regression workflow library(tidyverse); library(car); library(MASS); library(sandwich); library(lmtest); library(glmnet) # Example: systolic BP ~ age + BMI + sex (sex factor) df <- tibble(age=rnorm(200,50,12), bmi=rnorm(200,25,4), sex=factor(sample(c("F","M"),200,replace=TRUE)), sbp=100 + 0.5*age + 1.2*bmi + rnorm(200,0,8)) # --- Fit & summary --- m <- lm(sbp ~ age + bmi + sex, data=df) summary(m) # β̂, SE, t, p, R², adj R² confint(m) # 95% CI for each β # --- Four diagnostic plots in one line --- par(mfrow=c(2,2)); plot(m); par(mfrow=c(1,1)) # --- VIF for multicollinearity --- car::vif(m) # > 5 mild, > 10 severe # --- Heteroscedasticity test & sandwich SE --- lmtest::bptest(m) # Breusch-Pagan test lmtest::coeftest(m, vcov.=sandwich::vcovHC(m, type="HC3")) # --- Robust regression (Huber / MM) --- m_rob <- MASS::rlm(sbp ~ age + bmi + sex, data=df, psi=MASS::psi.huber) # --- Influence: Cook's distance --- cd <- cooks.distance(m); which(cd > 4/nrow(df)) # --- Ridge / Lasso via glmnet (alpha=0 ridge, 1 lasso) --- X <- model.matrix(sbp ~ age + bmi + sex, df)[,-1] cv <- glmnet::cv.glmnet(X, df$sbp, alpha=1) # lasso coef(cv, s="lambda.min")
import numpy as np, pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf from statsmodels.stats.outliers_influence import variance_inflation_factor from sklearn.linear_model import RidgeCV, LassoCV import matplotlib.pyplot as plt # --- Same example dataset --- rng = np.random.default_rng(0) df = pd.DataFrame({ "age": rng.normal(50,12,200), "bmi": rng.normal(25,4,200), "sex": rng.choice(["F","M"],200)}) df["sbp"] = 100 + 0.5*df.age + 1.2*df.bmi + rng.normal(0,8,200) # --- Fit & summary --- m = smf.ols("sbp ~ age + bmi + C(sex)", data=df).fit() print(m.summary()) # β̂, SE, t, p, R², adj R² print(m.conf_int()) # 95% CI # --- Four diagnostic plots manually --- fitted, resid = m.fittedvalues, m.resid fig, ax = plt.subplots(2,2, figsize=(9,7)) ax[0,0].scatter(fitted, resid); ax[0,0].set_title("Residuals vs Fitted") sm.qqplot(resid, line="s", ax=ax[0,1]); ax[0,1].set_title("QQ Plot") ax[1,0].scatter(fitted, np.sqrt(abs(resid))); ax[1,0].set_title("Scale-Location") inf = m.get_influence() ax[1,1].scatter(inf.hat_matrix_diag, resid); ax[1,1].set_title("Residuals vs Leverage") # --- VIF --- X = sm.add_constant(pd.get_dummies(df[["age","bmi","sex"]], drop_first=True)) [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] # --- Robust regression (Huber) --- m_rob = sm.RLM(df.sbp, X, M=sm.robust.norms.HuberT()).fit() # --- Heteroscedasticity-robust SE (HC3) --- m_hc3 = smf.ols("sbp ~ age + bmi + C(sex)", data=df).fit(cov_type="HC3") # --- Ridge / Lasso with CV --- RidgeCV(alphas=np.logspace(-2,2,50)).fit(X, df.sbp) LassoCV(cv=10).fit(X, df.sbp)
summary(model);(2) plot(model) 看四圖;(3) vif() 看共線性;(4) bptest() 看異變異數,必要時改 sandwich SE;(5) 用 cooks.distance() 看影響點,做敏感度分析;(6) 報告時附 95% CI(不是 SE)。
Standard workflow: (1) summary(model); (2) plot(model) for four diagnostics; (3) vif() for collinearity; (4) bptest() for heteroscedasticity, swap to sandwich SEs if needed; (5) cooks.distance() for influence + sensitivity check; (6) report 95% CIs (not SEs).
七、五個論文常見災難
❌ Table 2 fallacy
Westreich & Greenland 2013 AJE:流行病學論文 Table 2 通常把模型裡所有變數的 β都當成「該變數對 y 的因果效應」報告。但只有主要 estimand(target exposure → outcome)的 β 在因果上有意義;其他變數(混淆變數 confounders、中介變數 mediators)的 β 在條件解讀上與想像的不同。
對策:畫 DAG(Pearl 2009; Hernán-Robins 2020)、明確指定 estimand、只報告主要 exposure 的因果係數,其他變數標為「adjusted for」。
Westreich & Greenland (2013 AJE): most epidemiology Table 2s present every β as if it were the causal effect of that variable on y. But only the target estimand (exposure → outcome) carries causal meaning; coefficients for confounders/mediators have a conditional interpretation that almost never matches the implied causal story.
Fix: draw a DAG (Pearl 2009; Hernán-Robins 2020), declare the estimand, report only the target exposure's coefficient causally — list other variables as "adjusted for".
❌ 忽略共線性
在 BMI 同時放 weight 與 height(VIF 爆炸),結果 weight 的 β 不顯著卻方向反——共線性常造成「顯著翻轉」。每次跑迴歸都該 vif()。
Putting weight and height alongside BMI (VIF explodes), the weight β flips sign and is non-significant — collinearity often causes "sign flips". Always run vif().
❌ 遺漏變數偏誤
真實 y 受 z 影響,z 又與 x 相關,但模型沒放 z——x 的 β 會吸收 z 的效應。生物資料中 age、sex 常是被遺忘的共同混淆。判斷依靠 DAG,而不是 R²。
If z drives y and z correlates with x, but z isn't in the model — x's β absorbs z's effect. In biological data, age and sex are the usual forgotten common causes. Decide what to include from a DAG, not from R².
❌ 外推
資料 x 範圍 10–60,預測 x = 100 的 y——數學可以給你一個數字,但沒有任何證據那裡是線性。外推是 Challenger 太空梭 O-ring 災難(1986)的統計教訓:發射溫度 31°F 遠在歷史資料下限 (53°F) 之外。
Your x ranges over 10–60 and you predict at x = 100 — the math returns a number but you have no evidence the line continues. Extrapolation is the statistical lesson of the Challenger O-ring disaster (1986): launch temperature 31°F was well below the historical minimum (53°F).
❌ R² 至上主義
「我把 R² 從 0.6 衝到 0.9 了!」——多半是加了過多變數導致過度擬合 (overfitting)。判斷模型好壞應看:(1) adjusted R²;(2) cross-validated R²;(3) out-of-sample MSE。論文上請報告 adjusted R² 或交叉驗證表現,不是 raw R²。
"I pushed R² from 0.6 to 0.9!" — usually by adding so many predictors that you overfit. Judge models by: (1) adjusted R²; (2) cross-validated R²; (3) out-of-sample MSE. Papers should report adjusted R² or CV performance, not raw R².
❌ 不畫殘差圖
論文中迴歸結果只給「β, SE, p」表,卻沒有 residual vs fitted 與 QQ plot——這是 reviewer 應立即退件的紅旗。Fox(2016 Applied Regression):「沒有診斷的迴歸是迷信。」
Papers reporting regressions as a β / SE / p table without residuals-vs-fitted or QQ plots — a red flag any reviewer should catch. Fox (2016 Applied Regression): "Regression without diagnostics is superstition."
八、兩個案例
案例 A:BP ~ age + BMI + sex
目標 estimand:BMI 對 SBP 的條件效應(控制 age、sex)。建模:lm(sbp ~ age + bmi + sex)。預期 βBMI ≈ 1–1.5 mmHg per kg/m² (Linderman 2018)。診斷:殘差通常輕微右偏;考慮 log(SBP) 或 robust SE。報告:「BMI was associated with SBP (β = 1.2, 95% CI 0.9–1.5, p < 0.001), adjusting for age and sex.」不要把 age 的 β 也當成「年齡的因果效應」報告——Table 2 fallacy!
Target estimand: the conditional effect of BMI on SBP (adjusting for age, sex). Fit: lm(sbp ~ age + bmi + sex). Expect βBMI ≈ 1–1.5 mmHg per kg/m² (Linderman 2018). Diagnostics: residuals are usually mildly right-skewed → consider log(SBP) or robust SEs. Report: "BMI was associated with SBP (β = 1.2, 95% CI 0.9–1.5, p < 0.001), adjusting for age and sex." Don't simultaneously report the age β as its causal effect — Table 2 fallacy!
案例 B:expr ~ time + treatment
RNA-seq 中對單一基因擬合 lm(log2_expr ~ time * treatment)——交互項代表「處理改變了時間趨勢」。陷阱:(1) RNA-seq 計數應該用 NB-GLM (DESeq2 / edgeR) 而非 OLS,OLS 對小 count 變異估計錯;(2) 同一隻動物多時間點 → 需要 mixed model(random intercept per animal);(3) 跨基因檢定要 BH-FDR 校正(Step 12)。
For a single gene in RNA-seq, fit lm(log2_expr ~ time * treatment) — the interaction says "treatment alters the time trend". Caveats: (1) raw counts should use NB-GLMs (DESeq2 / edgeR), not OLS — OLS mis-estimates variance for small counts; (2) repeated measures on the same animal → mixed model with random intercept per animal; (3) testing across many genes requires BH-FDR (Step 12).
九、必讀清單
- Galton 1886「Regression towards mediocrity in hereditary stature」J Anthropol Inst — 迴歸這個詞的起源。
- Gauss 1809《Theoria motus》— OLS 的數學奠基。
- Westreich & Greenland 2013 AJE「The Table 2 fallacy」— 必讀,避免最常見的因果誤解。
- White 1980 Econometrica「A heteroskedasticity-consistent covariance matrix estimator」— sandwich SE 的原典。
- Tibshirani 1996 JRSS-B「Regression shrinkage and selection via the lasso」— Lasso 原典。
- Hoerl & Kennard 1970 Technometrics — Ridge regression 原典。
- Zou & Hastie 2005 JRSS-B — Elastic Net。
- James, Witten, Hastie, Tibshirani 2021 ISLR 2nd ed — 線性模型與規則化的最佳教科書(免費 PDF)。
- Fox 2016 Applied Regression Analysis 3rd ed — 診斷的聖經。
- Harrell 2015 Regression Modeling Strategies 2nd ed — 臨床建模實務、spline、shrinkage、validation。
- Hernán & Robins 2020 Causal Inference: What If — DAG 與條件解讀的因果框架(免費 PDF)。
- Galton 1886 "Regression towards mediocrity in hereditary stature", J Anthropol Inst — origin of the word.
- Gauss 1809 Theoria motus — mathematical foundation of OLS.
- Westreich & Greenland 2013 AJE "The Table 2 fallacy" — essential, prevents the most common causal misreading.
- White 1980 Econometrica — origin of sandwich (HC) standard errors.
- Tibshirani 1996 JRSS-B — the Lasso paper.
- Hoerl & Kennard 1970 Technometrics — Ridge regression.
- Zou & Hastie 2005 JRSS-B — Elastic Net.
- James, Witten, Hastie, Tibshirani 2021 ISLR 2nd ed — best textbook for linear models & regularisation (free PDF).
- Fox 2016 Applied Regression Analysis 3rd ed — the diagnostics bible.
- Harrell 2015 Regression Modeling Strategies 2nd ed — clinical modelling, splines, shrinkage, validation.
- Hernán & Robins 2020 Causal Inference: What If — DAGs and the conditional-interpretation framework (free PDF).
📝 自我檢測
1. Residuals vs Fitted 圖呈現「漏斗形」(殘差變異隨 fitted 值增加),這意味著違反了哪個 OLS 假設?
1. A Residuals vs Fitted plot shows a funnel (variance grows with fitted value). Which OLS assumption is violated?
2. 你跑迴歸後發現 VIF = 12,下列最合理的處置是?
2. After fitting your regression you find VIF = 12. The most reasonable response?
3. Westreich & Greenland 2013「Table 2 fallacy」警告的是什麼?
3. What does Westreich & Greenland's "Table 2 fallacy" (2013) warn against?
4. p > n 的高維基因表達資料(20,000 genes、200 samples)預測表型,下列哪個方法最合適?
4. For high-dimensional gene-expression data (20,000 genes, 200 samples) predicting phenotype, which is best?