為什麼要 logit link?
二元結果(patient / control、突變 / 野生型、響應 / 無響應)若直接 OLS 擬合,預測值可能 < 0 或 > 1——既不合理也違反同質變異。Logistic 迴歸用 logit link 把 (0, 1) 機率對映到 (−∞, ∞):
logit(p) = log(p/(1−p)) = β₀ + β₁x₁ + ... + βkxk
係數 β 是對數勝算 (log-odds),exp(β) 是勝算比 (odds ratio, OR)。MLE 透過 IRLS(迭代加權最小平方)求解;推論可用 Wald、Score、Likelihood Ratio 三種檢定。
Binary outcomes (case/control, mutant/WT, responder/non) cannot be sensibly fit by OLS — predicted values escape (0, 1) and homoscedasticity fails. Logistic regression uses the logit link to map probabilities in (0, 1) to the real line:
logit(p) = log(p/(1−p)) = β₀ + β₁x₁ + ... + βkxk
The β coefficients are log-odds; exp(β) is the odds ratio (OR). MLE proceeds by IRLS (iteratively reweighted least squares). Inference comes from Wald, Score, or Likelihood Ratio tests.
一、模型構成
係數與 OR
- β₁ = x₁ 每增加 1 單位的對數勝算變化
- OR = exp(β₁);OR > 1 → 風險升高;OR = 1 → 無關
- 95% CI:exp(β̂ ± 1.96·SE)——在對數尺度對稱
- 類別變數:每 level 相對於 reference 的 OR
- β₁ = change in log-odds per unit of x₁
- OR = exp(β₁); OR > 1 → higher risk; OR = 1 → null
- 95% CI: exp(β̂ ± 1.96·SE) — symmetric on the log scale
- Categorical: OR of each level vs reference
估計與檢定
- MLE via Newton-Raphson / IRLS
- Wald:β̂/SE → N(0,1)(小樣本不穩)
- LR:−2(logL₀ − logL₁) → χ²(首選)
- Deviance:D = −2·logL;殘差類似 RSS
- Pseudo-R²:McFadden、Nagelkerke
- MLE via Newton-Raphson / IRLS
- Wald: β̂/SE → N(0,1) (unstable in small samples)
- LR: −2(logL₀ − logL₁) → χ² (preferred)
- Deviance: D = −2·logL; residuals analogous to RSS
- Pseudo-R²: McFadden, Nagelkerke
分離問題
- 當某共變數能完美預測結果,β̂ → ±∞,SE → ∞
- 常見於:稀有事件、子群分析、ML 共線
- 解法:Firth penalization (
logistf,brglm2),加 Jeffreys 先驗使 MLE 有限 - 或:Bayesian logistic + 弱常規先驗
- When a covariate perfectly predicts the outcome, β̂ → ±∞, SE → ∞
- Common in: rare outcomes, subgroup analyses, multicollinear ML
- Fix: Firth penalization (
logistf,brglm2) — Jeffreys prior keeps MLE finite - Or: Bayesian logistic with weakly informative priors
判別 vs 校準
- 判別(discrimination):能否排序——AUC / ROC
- 校準(calibration):預測機率是否與實際頻率相符——calibration plot、Hosmer-Lemeshow、Brier score
- 高 AUC + 差校準:能排序但不能當機率報告(風險溝通失效)
- 類別不平衡:用 PR-AUC 取代 ROC-AUC
- Discrimination: can it rank cases above non-cases — AUC / ROC
- Calibration: do predicted probabilities match observed frequencies — calibration plot, Hosmer-Lemeshow, Brier
- High AUC + poor calibration: ranks well but cannot be communicated as probability
- Class imbalance: prefer PR-AUC over ROC-AUC
重塑 S 曲線
下面是「真實資料」(產生自固定的 β₀=−2, β₁=0.6)。拖動下方的擬合滑桿來改變模型的 β₀(截距)與 β₁(斜率),看 S 曲線如何隨之伸縮平移,同時顯示對應 OR、樣本 AUC、與平均 deviance。當你把滑桿放在 (−2, 0.6) 時擬合最好。
The dots are "true data" (generated from fixed β₀=−2, β₁=0.6). Move the fit sliders to change the model's β₀ (intercept) and β₁ (slope) and watch the S-curve stretch and shift, with live OR, sample AUC, and mean deviance. The best fit sits near (−2, 0.6).
灰點:觀測;藍線:擬合;橘虛線:真實曲線
二、常見應用
| 場景 | 模型 / 輸出 | 注意 | |||
|---|---|---|---|---|---|
| 病例-對照 GWAS | case ~ SNP + PC1-5 | 罕見變異 → Firth | Case-control GWAS | case ~ SNP + PC1-5 | Rare variants → Firth |
| 預測響應的生物標記 | logistic + AUC / calibration | 交叉驗證避免過適 | Response biomarker | logistic + AUC / calibration | CV to avoid overfit |
| 細胞型別分類 | multinomial logit / softmax | 需 one-vs-rest 或 softmax | Cell-type classification | multinomial logit / softmax | One-vs-rest or softmax |
| 罕見變異 burden 檢定 | SKAT / Firth logistic | 小樣本下 Wald 失準 | Rare variant burden | SKAT / Firth logistic | Wald unstable at small n |
| 臨床預測模型 | 校準曲線 + DCA | 校準遠比 AUC 重要 | Clinical prediction | calibration + DCA | Calibration >> AUC |
實作:擬合、OR、ROC、Firth
# --- R --- Logistic 全流程 library(broom); library(car); library(pROC); library(logistf) # 1) 標準 logistic(病例-對照) fit <- glm(case ~ exposure + age + sex, family = binomial(link = "logit"), data = df) # 2) 整理結果並指數化為 OR broom::tidy(fit, exponentiate = TRUE, conf.int = TRUE) car::Anova(fit, type = "II") # LR 檢定 # 3) ROC / AUC pred <- predict(fit, type = "response") roc1 <- pROC::roc(df$case, pred) roc1$auc; plot(roc1) # 4) 校準圖(10 等分) library(rms) val.prob(pred, df$case) # calibration plot + stats # 5) Firth 懲罰:解決分離與小樣本偏誤 fit_f <- logistf::logistf(case ~ variant + sex, data = df) summary(fit_f)
# --- Python --- import numpy as np import statsmodels.formula.api as smf import statsmodels.api as sm from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score from sklearn.calibration import calibration_curve # 1) statsmodels:推論導向 fit = smf.logit("case ~ exposure + age + sex", data=df).fit() print(fit.summary()) # 2) OR 與 CI or_tab = np.exp(fit.params).to_frame("OR").join(np.exp(fit.conf_int())) print(or_tab) # 3) GLM 形式(等價) sm.GLM(y, X, family=sm.families.Binomial(sm.families.links.logit())).fit() # 4) sklearn:預測導向 clf = LogisticRegression(penalty="l2", C=1.0).fit(X, y) proba = clf.predict_proba(X)[:,1] roc_auc_score(y, proba) prob_true, prob_pred = calibration_curve(y, proba, n_bins=10)
📝 自我檢測
1. 何時 OR 是 RR 的差近似?
1. When is OR a poor approximation to RR?
2. 什麼是「分離」(separation) 問題?如何修正?
2. What is "separation" in logistic regression, and how is it fixed?
3. 為什麼模型可以有高 AUC 卻校準很差?
3. Why can a model have high AUC yet poor calibration?