STEP 9 / 13

邏輯斯迴歸 (Logistic Regression)

從二元結果到機率:logit 連結、OR、MLE 與 IRLS;EPV ≥ 10 與 Riley 2020 新規則;Wald vs LRT;ROC、AUC、calibration、Brier、決策曲線;完美分離與 Firth 罰估計。

From binary outcome to probability: logit link, OR, MLE via IRLS; EPV ≥ 10 and Riley 2020's modern rule; Wald vs LRT; ROC, AUC, calibration, Brier, decision curve analysis; perfect separation and Firth's penalised likelihood.

從二元結果到一條機率曲線

當 Y 是二元結果(病/不病、活/死、有反應/無反應、30 天再入院/未再入院),直接套線性迴歸會產生不合理的負機率或大於 1 的機率,也違反同方差與常態假設。邏輯斯迴歸(logistic regression)logit 連結把機率轉到實數線:logit(p) = log(p/(1−p)) = β₀ + β₁X₁ + … + βₚXₚ。反過來,sigmoid(S 型)把線性預測值再映射回 [0, 1] 的機率區間。

它不只是「分類器」——更是機率模型。臨床預測模型(clinical prediction model)的主流,從 Framingham 心血管風險、APACHE II 重症評分、到 TRIPOD(Collins et al. 2015 BMJ;2024 升級為 TRIPOD+AI),骨幹都是 logistic / Cox。掌握 logistic 不只是學一個 GLM,是學會建構、驗證、報告一個可被臨床採用的風險模型。

When Y is a binary outcome (disease/healthy, alive/dead, responder/non-responder, 30-day readmission yes/no), linear regression will produce nonsensical probabilities below 0 or above 1 and violate homoscedasticity and normality. Logistic regression uses a logit link to push probabilities onto the real line: logit(p) = log(p/(1−p)) = β₀ + β₁X₁ + … + βₚXₚ. The inverse — the sigmoid S-curve — maps the linear predictor back into [0, 1].

It is more than a classifier — it is a probability model. The workhorse behind clinical prediction (Framingham, APACHE II, and the TRIPOD reporting guideline of Collins et al. 2015 BMJ, now TRIPOD+AI 2024) is logistic or Cox regression. Mastering logistic regression is mastering how to build, validate, and report a risk model that a clinician can actually use.

💡
核心訊息:logistic 不是「分類」工具,是「機率估計」工具。預測出的 應該與真實事件率對齊——這叫 calibration,比 AUC 更難達成、卻常被忽視。一個 AUC=0.85 但 calibration 差的模型,臨床上可能比 AUC=0.75 但校準良好的模型更糟糕。 Core message: logistic is a probability model, not just a classifier. The predicted should track the observed event rate — that is calibration, harder to achieve than discrimination (AUC) and routinely ignored. A model with AUC=0.85 but poor calibration can be clinically worse than one with AUC=0.75 and good calibration.

一、Logit、Sigmoid、OR、MLE

🔗

Logit 連結

logit(p) = log(p/(1−p)) = β₀ + β₁X
把 [0,1] 的機率拉到 (−∞, +∞);對 X 線性。
Odds = p/(1−p);機率 0.5 → odds 1 → logit 0。

logit(p) = log(p/(1−p)) = β₀ + β₁X.
Stretches [0,1] onto (−∞, +∞); linear in X.
Odds = p/(1−p); p = 0.5 → odds 1 → logit 0.

🌊

Sigmoid 反函數

p = 1/(1 + e−η),η = β₀ + β₁X。
50% 機率出現在 X = −β₀/β₁。
β₁ 越大 → 曲線越陡。

p = 1/(1 + e−η), η = β₀ + β₁X.
50% probability at X = −β₀/β₁.
Larger |β₁| → steeper curve.

📈

Odds Ratio

X 增 1 單位 → odds 變 eβ₁ 倍。
注意:是 odds × eβ不是 probability × eβ
eβ₁ = 1 → 無關;> 1 → 風險升高。

+1 unit in X → odds multiplied by eβ₁.
Beware: it's odds × eβ, not probability × eβ.
eβ₁ = 1 → no effect; > 1 → higher odds.

⌜ logit(p) = log( p / (1 − p) ) = β₀ + β₁X   ·   p = 1 / (1 + e−(β₀ + β₁X))   ·   OR = eβ₁Logit 是 Berkson (1944) 提出,sigmoid 形式則更早可追溯到 Verhulst 1838 的人口成長曲線;現代 GLM 框架由 Nelder & Wedderburn 1972 J Roy Statist Soc A 統一。 ⌜ logit(p) = log( p / (1 − p) ) = β₀ + β₁X   ·   p = 1 / (1 + e−(β₀ + β₁X))   ·   OR = eβ₁The logit was named by Berkson (1944); the sigmoid form traces back to Verhulst's 1838 population curve. The unifying GLM framework comes from Nelder & Wedderburn (1972, JRSS-A).

估計:最大概似法(MLE)。沒有閉式解,標準做法是 IRLS(Iteratively Reweighted Least Squares,疊代加權最小平方)——每一輪用上一輪的 計算工作反應變數和權重,做加權迴歸,更新 β,直到收斂。R 的 glm() 與 Python 的 statsmodels 內部都跑 IRLS。

觀察 n 個資料 (Xᵢ, Yᵢ),對數概似為 ℓ(β) = Σ [Yᵢ log pᵢ + (1−Yᵢ) log(1−pᵢ)]。最大化 ℓ 等於最小化「log-loss / 交叉熵」——這正是機器學習裡 sigmoid 神經元的損失函數,logistic 是最簡單的單層神經網路。

Estimation: maximum likelihood (MLE). There is no closed-form solution; the standard procedure is IRLS (Iteratively Reweighted Least Squares) — each iteration uses the current to build a working response and weights, fits a weighted least-squares regression, and updates β until convergence. Both R's glm() and Python's statsmodels run IRLS under the hood.

For n observations (Xᵢ, Yᵢ) the log-likelihood is ℓ(β) = Σ [Yᵢ log pᵢ + (1−Yᵢ) log(1−pᵢ)]. Maximising ℓ equals minimising log-loss / cross-entropy — the very loss used by a sigmoid neuron in ML. Logistic regression is the simplest one-layer neural network.

⚠️
OR ≠ RR(風險比)!當事件常見(例如 incidence > 10%),OR 會明顯誇大 RR。例如 RR = 2 時,當基準率 1%,OR ≈ 2.02;當基準率 30%,OR ≈ 3.5。Zhang & Yu 1998 JAMA 提出常被引用(但有爭議)的校正公式;現代建議:常見結局直接用 log-binomial 或 Poisson with robust SE 估計 RR(Zou 2004 Am J Epidemiol)。報告時請明確寫「odds ratio」,不要混用「risk」、「likelihood」、「times more likely」等模糊詞。 OR ≠ RR (risk ratio)! When events are common (incidence > 10%), the OR materially exaggerates the RR. For RR = 2: with a 1% baseline OR ≈ 2.02, but with a 30% baseline OR ≈ 3.5. Zhang & Yu (1998 JAMA) popularised a (controversial) correction; modern practice for common outcomes is log-binomial or Poisson with robust SE to estimate RR directly (Zou 2004 Am J Epidemiol). In reporting, always write "odds ratio" — never the muddier "risk", "likelihood", or "times more likely".

Sigmoid 遊樂場

調整 β₀(截距,移動曲線左右)與 β₁(斜率,控制陡度)。30 個模擬樣本(藍/橘 = 0/1)會跟著機率曲線重新生成。觀察重點:(1) β₁ > 0 時曲線向右上升,β₁ < 0 反向;(2) 機率 50% 落在 X = −β₀/β₁;(3) OR = eβ₁;(4) β₁ → 0 時曲線變平坦——X 幾乎不影響 Y。

Adjust β₀ (intercept — shifts left/right) and β₁ (slope — controls steepness). The 30 simulated points (blue = 0, orange = 1) re-draw from the current probability curve. Notice: (1) β₁ > 0 rises right, β₁ < 0 falls; (2) p = 0.5 at X = −β₀/β₁; (3) OR = eβ₁; (4) as β₁ → 0 the curve flattens — X carries little information.

綠線 = sigmoid 機率曲線 · 灰虛線 = 50% 閾值 · 點 = 30 個模擬樣本(藍=0, 橘=1)Green = sigmoid · grey dashed = 0.5 threshold · dots = 30 simulated samples (blue = 0, orange = 1)

二、Wald / LRT / Score

對單一係數 βj 檢定「H₀: βj = 0」有三種等價的漸近方法,但在有限樣本下表現差異很大。

To test "H₀: βj = 0" we have three asymptotically equivalent tests, but their finite-sample behaviour can differ a lot.

檢定 統計量 優點 風險
Waldz = β̂ / SE(β̂)最快、軟體預設輸出小樣本或 |β| 大時不可靠(Hauck-Donner 1977);可能反而拒絕變弱 Fastest, default software outputUnreliable for small n or large |β| (Hauck-Donner 1977); power paradoxically drops
Likelihood-Ratio (LRT)2(ℓ₁ − ℓ₀) ~ χ²(df)統計學界公認最可靠;對巢狀模型比較通用需 refit 兩個模型;多了一次計算 Most reliable; works for nested model comparisonNeeds two fits — slight computational cost
Score (Rao)在 H₀ 下的梯度只需擬合限制模型;理論上接近 LRT實務罕用;某些分析(如 McNemar)等同 Gradient evaluated under H₀Needs only the restricted fit; close to LRT in theoryRarely used in practice; equivalent to McNemar in 2×2 designs
實務建議:整體模型適合度 → 用 LRT(R: anova(fit, test="Chisq");Python: statsmodels 的 LR-test)。報告個別係數 → 用 profile-likelihood 95% CI(R: confint(fit),預設即 profile)比 Wald CI 更可靠,尤其當 |β| 大、樣本小、或趨近完美分離。 Practical rule: for overall model fit, use the LRT (R: anova(fit, test="Chisq"); Python: statsmodels LR-test). For individual coefficients, prefer the profile-likelihood 95% CI (R's confint(fit) defaults to profile) over Wald CIs — much safer when |β| is large, n is small, or separation is approached.

Pseudo-R² 警告:三種常見指標 McFadden(1 − ℓ₁/ℓ₀)、Cox-Snell(有上限<1)、Nagelkerke(縮放到 [0,1])都不是線性 R² 的對等物。McFadden 0.2–0.4 已算「優秀」。報告時請明確標出哪一種,並不要把 0.3 跟線性 R² 的 0.3 比較。

Pseudo-R² caveats: the three common variants — McFadden (1 − ℓ₁/ℓ₀), Cox-Snell (capped < 1), and Nagelkerke (rescaled to [0,1]) — are NOT direct analogues of linear R². McFadden values of 0.2–0.4 already count as "excellent". Always state which variant, and do not compare a pseudo-R² of 0.3 to a linear-regression R² of 0.3.

三、GOF / Discrimination / Calibration / DCA

評估邏輯斯模型有四層,許多論文只報第二層就停下——這是 Steyerberg 2019《Clinical Prediction Models》與 TRIPOD 2015 一再批評的疏失。

A logistic model should be evaluated on four layers. Many papers stop at the second one — exactly the gap criticised by Steyerberg's Clinical Prediction Models (2019) and TRIPOD (2015).

適合度 GOF

Deviance:D = −2 ℓ。比較巢狀模型直接用 ΔD ~ χ². Hosmer-Lemeshow GOF:把 切 10 等分,比較預期 vs 觀察事件數。
警告(Hosmer-Lemeshow-Sturdivant 2013 3rd ed):HL 檢定對分組數 g 很敏感(g 從 8 變 12 結論可能反轉),且大 n 時容易顯著、小 n 時又力不足。現代建議:看 calibration plot,而不是只盯 HL p 值。

Deviance: D = −2 ℓ. Compare nested models with ΔD ~ χ². Hosmer-Lemeshow GOF: bin into 10 deciles and compare expected vs observed counts.
Warning (Hosmer-Lemeshow-Sturdivant 2013, 3rd ed): the HL test is sensitive to the bin count g (changing g from 8 to 12 can flip the verdict), tends to be over-significant at large n and under-powered at small n. Modern advice: look at the calibration plot, not the HL p-value alone.

區辨力(ROC / AUC)

ROC 曲線:在所有閾值下繪 TPR (sensitivity) vs FPR (1 − specificity)。AUC = 隨機抽一正一負,模型給正樣本較高分的機率。0.5 = 亂猜;0.7–0.8 普通;0.8–0.9 好;> 0.9 罕見、需檢查 overfit 或資訊洩漏。

陷阱:類別不平衡(如 1:99 罕病)時 ROC/AUC 可能虛胖——Saito & Rehmsmeier 2015 PLOS ONE 建議改用 Precision-Recall curve / AUPRC

ROC curve: TPR (sensitivity) vs FPR (1 − specificity) across thresholds. AUC = the probability the model scores a random positive higher than a random negative. 0.5 = chance; 0.7–0.8 fair; 0.8–0.9 good; > 0.9 rare — check for overfitting or data leakage.

Pitfall: with severe class imbalance (e.g. 1:99 rare disease), ROC/AUC can be misleadingly optimistic — Saito & Rehmsmeier 2015 PLOS ONE recommend the Precision-Recall curve / AUPRC instead.

校準度 Calibration

Calibration plot:x = 預測機率 (分十等分或用 loess 平滑),y = 該段的觀察事件率。完美時落在 y = x 對角線。
三層 (Van Calster 2019 BMC Med): 平均校準(calibration-in-the-large)、弱校準(intercept + slope)、強校準(每個 都對齊)。
Brier score = mean(i − Yi)²;越小越好;含 calibration + discrimination;可分解為 reliability − resolution + uncertainty。

Calibration plot: x = predicted probability (deciles or loess smooth), y = observed event rate. Perfect lies on the y = x diagonal.
Three layers (Van Calster 2019 BMC Med): mean calibration (calibration-in-the-large), weak (intercept + slope), strong (every aligned).
Brier score = mean(i − Yi)²; smaller is better; combines calibration + discrimination; decomposable into reliability − resolution + uncertainty.

臨床效益(DCA)

Decision Curve Analysis(Vickers & Elkin 2006 Med Decis Making):在各閾值 pt 計算 net benefit = (TP/n) − (FP/n) × (pt / (1 − pt))。同圖比較「模型 vs 全治療 vs 全不治療」三條線——模型只有在「贏過兩個極端策略」的閾值範圍內才有臨床價值。
實作:R 套件 dcurves(Vickers 維護);Python 套件 dcurves

Decision Curve Analysis (Vickers & Elkin 2006 Med Decis Making): at each threshold pt compute net benefit = (TP/n) − (FP/n) × (pt / (1 − pt)). Plot the model against "treat all" and "treat none" — the model has clinical value only over the threshold range where it beats both extremes.
Implementation: R package dcurves (maintained by Vickers); Python dcurves.

🚫
預設閾值 0.5 幾乎總是錯的。0.5 只在「false positive 與 false negative 等價且 prevalence ≈ 0.5」時才合理。實務上請:(1) 用 Youden's J = max(TPR − FPR) 找區辨最佳閾值;(2) 用 DCA 找臨床最佳閾值(依 utility 而非統計);(3) 提供完整 ROC + DCA 而非單一閾值的混淆矩陣。 The default 0.5 threshold is almost always wrong. It is only sensible when false positives and false negatives are equally costly and prevalence ≈ 0.5. Instead: (1) use Youden's J = max(TPR − FPR) for the discrimination-optimal cut; (2) use DCA for the clinical optimum (utility-driven, not statistical); (3) report the full ROC + DCA rather than a single confusion matrix at one cut.

四、EPV:從 1996 到 2020

EPV ≥ 10

Peduzzi et al. 1996 J Clin Epidemiol(DOI:10.1016/S0895-4356(96)00236-3)以模擬證明:每個 predictor 至少需要 10 個 events(注意:是「事件數」即少數類別的個數,不是總 n)。EPV < 10 時,β̂ 嚴重偏誤、CI 涵蓋率錯誤、變數選擇不穩定。例:若預期事件率 20%,目標 n = 200,則最多可放 200×0.2 / 10 = 4 個變數

Peduzzi et al. 1996 J Clin Epidemiol (DOI:10.1016/S0895-4356(96)00236-3) showed via simulation that each predictor needs at least 10 events (the count of the minority class, not the total n). Below EPV = 10, β̂ is biased, CI coverage is wrong, and variable selection becomes unstable. Example: expected event rate 20%, target n = 200 → at most 200×0.2 / 10 = 4 predictors.

Riley 2020 BMJ

Riley et al. 2020 BMJ「Calculating the sample size required for developing a clinical prediction model」指出 EPV ≥ 10 是粗糙的經驗法則。新方法基於三個目標:(1) 低 optimism (overfitting);(2) Cox-Snell R²app − R²true 小於 0.05;(3) calibration slope 接近 1。R 套件 pmsampsize 是官方實作;Python 對應為 pmsampsize-py

典型結果:當預測變數多、預期 R² 低、結局少見時,新公式需要的 n 往往遠大於 EPV=10 給的。例如 10 個變數、event rate 10%、預期 Cox-Snell R²=0.05 → 至少需 n ≈ 900(EPV 規則只說 n=100)。

Riley et al. 2020 BMJ ("Calculating the sample size required for developing a clinical prediction model") shows EPV ≥ 10 is a coarse rule of thumb. The newer derivation targets three criteria: (1) low optimism (overfitting); (2) Cox-Snell R²app − R²true < 0.05; (3) calibration slope close to 1. R package pmsampsize is the canonical implementation; Python has pmsampsize-py.

Typical outcome: with many predictors, low expected R², and a rare outcome, the modern formula usually demands far more than EPV=10. E.g. 10 predictors, 10% event rate, expected Cox-Snell R² = 0.05 → n ≈ 900 (EPV rule would say only n = 100).

陷阱:n 大不代表 EPV 大 罕見結局(rare outcome)的研究 n 可能很大(10,000 人)但只有 50 個事件——EPV 隨「事件」而非「總人數」放大。若放 8 個 predictor,EPV = 50/8 ≈ 6,仍嫌不足。請永遠先計算事件數。 A rare-outcome study can have huge n (10,000 people) yet only 50 events — EPV scales with the event count, not total sample size. Fit 8 predictors and EPV = 50/8 ≈ 6, still too few. Always count events first.

ROC 曲線實驗

調整兩組(正/負)的「分離度」(類似 Cohen's d)。1000 個模擬樣本在每個閾值下計算 TPR / FPR,畫出 ROC 曲線並計算 AUC。觀察:(1) 分離度 = 0 時 AUC ≈ 0.5(亂猜);(2) 分離度 = 3 時 AUC > 0.95;(3) Youden 最佳閾值(紅點)通常不是 0.5,會隨類別重疊改變。

Adjust the separation (Cohen's d-like) between the positive and negative groups. 1000 simulated points produce TPR / FPR at every threshold to draw the ROC curve and compute AUC. Watch: (1) separation = 0 → AUC ≈ 0.5 (chance); (2) separation = 3 → AUC > 0.95; (3) Youden's optimal cut (red dot) is rarely 0.5 and shifts with overlap.

綠線 = ROC · 灰虛線 = 亂猜對角 · 紅點 = Youden's J 最佳閾值Green = ROC · grey dashed = chance · red dot = Youden's J optimum

五、Logistic 還是別的家族?

🌳 結果型別決策樹

Q1:
Y 只有兩類(病/不病)?→ 是 → Logistic regression(logit link, family=binomial)。
Q2:
Y 是 0/1 但實驗模式為「相同 X,n 次試驗中成功幾次」?→ 是 → 仍是 logistic,但餵 cbind(success, fail) 或用 binomial GLM with weights
Q3:
結局常見且需 RR 而非 OR?→ 是 → log-binomialmodified Poisson with robust SE(Zou 2004)。
Q4:
Y 有 ≥ 3 類但有順序(mild/moderate/severe)?→ 是 → Ordinal (proportional odds) logistic,R: MASS::polr;先檢查 PO 假設。
Q5:
Y 有 ≥ 3 類但順序(藥物 A/B/C)?→ 是 → Multinomial logistic,R: nnet::multinom
Q6:
需要 latent normal 解讀(譬如心理計量)?→ 是 → 可考慮 probit;係數 × 1.6 ≈ logistic 係數,實質結論幾乎相同。
Q7:
群集 / 重複測量(同病人多次量測)?→ 是 → GEE(族群平均效應,geepack)或 GLMM(個體效應,lme4::glmer)。見 Step 13。
Q1:
Two categories (disease/no)? → Yes → Logistic regression (logit link, family=binomial).
Q2:
Y is 0/1 but design is "same X, n trials, k successes"? → Yes → still logistic, feed cbind(success, fail) or use binomial GLM with weights.
Q3:
Outcome common and you need RR not OR? → Yes → log-binomial or modified Poisson with robust SE (Zou 2004).
Q4:
≥ 3 categories with a natural order? → Yes → Ordinal (proportional odds) logistic via R MASS::polr; check the PO assumption first.
Q5:
≥ 3 categories without order? → Yes → Multinomial logistic (R: nnet::multinom).
Q6:
Need a latent-normal story (psychometrics)? → Yes → consider probit; coefficients × 1.6 ≈ logistic, conclusions barely change.
Q7:
Clustered / repeated measures? → Yes → GEE (population-average; geepack) or GLMM (subject-specific; lme4::glmer). See Step 13.

六、Perfect Separation & Firth

完美分離(complete separation):某個 X 可以 100% 區分 Y = 0 / 1 → 對應 β → ±∞,MLE 不存在。軟體會給「algorithm did not converge」或極大的 SE(如 SE = 10⁹)。準完美分離(quasi-complete):分離只在部分資料發生。常見於罕見事件、小樣本、稀有 category。

解法:Firth's 罰估計(Firth 1993 Biometrika)。對概似乘上 Jeffreys prior 的開根號項,等價於對 score 加上偏誤修正項。可得到有限的 β̂ 與合理的 CI。R 套件 logistf(Heinze & Schemper 2002 Stat Med);Python 有 firthlogist

近年也常見 L2 (Ridge)L1 (Lasso) 罰估計(glmnetsklearn LogisticRegression(penalty='l1'/'l2')),用於高維情境(變數 > n),但會犧牲係數的因果可解釋性,預測為主時才適用。

Complete separation: some X perfectly splits Y = 0 / 1 → the corresponding β → ±∞ and MLE does not exist. Software will throw "algorithm did not converge" or huge SE (e.g. 10⁹). Quasi-complete separation occurs partially. Common with rare events, small samples, or sparse categories.

Fix: Firth's penalised likelihood (Firth 1993 Biometrika). Multiplies the likelihood by the square root of the Jeffreys prior, equivalently adds a score-function bias correction. Yields finite β̂ and well-behaved CIs. R: logistf (Heinze & Schemper 2002 Stat Med); Python: firthlogist.

Modern alternatives include L2 (ridge) or L1 (lasso) penalties (glmnet, sklearn LogisticRegression(penalty='l1'/'l2')) for high-dimensional settings (p > n), but they sacrifice causal interpretability of coefficients — fine when prediction is the only goal.

⚠️
診斷檢核清單:(1) 任何 |β̂| > 10 或 SE > 10 → 懷疑分離;(2) glm 警告 "fitted probabilities 0 or 1" → 強烈警示;(3) cross-tab 看每個 binary X 對 Y 的 2×2 表,若有 0 cell → 高度可疑;(4) 解法優先序:合併稀有 category > Firth > 移除該變數 Diagnostic checklist: (1) any |β̂| > 10 or SE > 10 → suspect separation; (2) glm warns "fitted probabilities 0 or 1" → strong signal; (3) cross-tab each binary X against Y — any zero cell is a red flag; (4) priority order: collapse sparse categories > Firth > drop the variable.

七、實作範例

# R: glm + diagnostics + ROC + calibration + Brier + DCA
library(tidyverse); library(pROC); library(rms); library(dcurves); library(logistf); library(pmsampsize)

# --- Fit ---
fit <- glm(disease ~ age + bmi + smoke,
            data = df, family = binomial(link = "logit"))
summary(fit)
exp(cbind(OR = coef(fit), confint(fit)))   # profile-likelihood CI

# --- LRT vs Wald ---
anova(fit, test = "Chisq")                  # LRT (preferred)
fit0 <- update(fit, . ~ . - smoke)
anova(fit0, fit, test = "LRT")              # nested LRT

# --- Discrimination: ROC + AUC ---
p <- predict(fit, type = "response")
roc1 <- pROC::roc(df$disease, p, ci = TRUE)
roc1$auc; pROC::ci.auc(roc1)
pROC::coords(roc1, "best", ret = c("threshold", "sens", "spec"),
              best.method = "youden")

# --- Calibration plot + Brier (rms package) ---
ddist <- rms::datadist(df); options(datadist = "ddist")
fit_lrm <- rms::lrm(disease ~ age + bmi + smoke, data = df, x = TRUE, y = TRUE)
rms::calibrate(fit_lrm, method = "boot", B = 200) |> plot()
mean((p - df$disease)^2)                       # Brier score

# --- Hosmer-Lemeshow GOF (use with caution) ---
ResourceSelection::hoslem.test(df$disease, p, g = 10)

# --- Decision Curve Analysis (Vickers 2006) ---
dcurves::dca(disease ~ p, data = df, thresholds = seq(0.05, 0.5, 0.01)) |> plot()

# --- Perfect separation → Firth ---
fit_firth <- logistf::logistf(disease ~ rare_var + age, data = df)
summary(fit_firth)

# --- Sample size: Riley 2020 ---
pmsampsize::pmsampsize(type = "b", csrsquared = 0.05, parameters = 10,
                        prevalence = 0.10)
import numpy as np, pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import logit
from sklearn.metrics import roc_auc_score, roc_curve, brier_score_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

# --- Fit ---
fit = logit("disease ~ age + bmi + smoke", df).fit(disp=0)
print(fit.summary())
OR  = np.exp(fit.params)
CI  = np.exp(fit.conf_int())                    # Wald CI

# --- LRT (preferred over Wald for inference) ---
fit0 = logit("disease ~ age + bmi", df).fit(disp=0)
LR   = 2*(fit.llf - fit0.llf)
from scipy.stats import chi2
p_LR = 1 - chi2.cdf(LR, df=1)

# --- ROC + AUC + Youden ---
p = fit.predict(df)
auc = roc_auc_score(df.disease, p)
fpr, tpr, thr = roc_curve(df.disease, p)
j = tpr - fpr; t_opt = thr[j.argmax()]

# --- Calibration plot + Brier ---
prob_pred, prob_obs = calibration_curve(df.disease, p, n_bins=10)
brier = brier_score_loss(df.disease, p)
plt.plot(prob_pred, prob_obs, marker="o"); plt.plot([0,1], [0,1], "k--")

# --- Decision Curve Analysis (dcurves package) ---
# pip install dcurves
from dcurves import dca, plot_graphs
res = dca(data=df, outcome="disease", modelnames=["p"],
          thresholds=np.arange(0.05, 0.5, 0.01))
plot_graphs(res)

# --- Firth penalised likelihood (separation) ---
# pip install firthlogist
from firthlogist import FirthLogisticRegression
flr = FirthLogisticRegression().fit(X, y)
flr.coef_, flr.ci_                              # finite β̂ + CIs
💡
實務工作流:(1) pmsampsize 算 n;(2) glm 擬合,profile CI;(3) LRT 報告整體 / 子組顯著性;(4) ROC + AUC + Youden 區辨;(5) calibration plot + Brier 校準;(6) DCA 評估臨床效益;(7) bootstrap 內部驗證(200–500 次)或外部驗證集;(8) 依 TRIPOD 2015 / TRIPOD+AI 2024 撰寫報告。 Workflow: (1) pmsampsize for n; (2) glm + profile CIs; (3) LRT for overall/sub-effects; (4) ROC + AUC + Youden for discrimination; (5) calibration plot + Brier; (6) DCA for clinical utility; (7) bootstrap internal validation (200–500 reps) or an external cohort; (8) write the report against TRIPOD 2015 / TRIPOD+AI 2024.

八、六個經典錯誤

OR 當 RR 解讀

「吸菸 OR=2 → 風險變兩倍」是常見誤讀。當基準率高時 OR 會誇大 RR;當基準率 < 5% 時 OR ≈ RR 才成立。常見結局請用 log-binomial 或 Poisson with robust SE(Zou 2004)。

"OR=2 for smoking means double the risk" is a common misread. With a common outcome OR overstates RR; only when prevalence < 5% is OR ≈ RR. For common outcomes use log-binomial or Poisson with robust SE (Zou 2004).

小 EPV → overfit

事件數少卻塞入十幾個變數是「醫學期刊 reviewer 看一眼就拒」的紅旗。先用 pmsampsize 算 n,事件多的最好;變數多時請改用 Lasso 或 penalised logistic 並做 bootstrap 內部驗證。

Stuffing a dozen predictors into a model with few events is a reviewer's instant red flag. Compute n with pmsampsize; with many variables use Lasso or penalised logistic plus bootstrap internal validation.

完美分離未處理

輸出顯示 OR = 4.7 × 10⁸、SE = 9.9 × 10⁹,仍照常報告——這個結果無意義。請改用 Firth (logistf) 或合併稀有 category。Heinze & Schemper 2002 是經典實作參考。

Reporting OR = 4.7 × 10⁸ with SE = 9.9 × 10⁹ as if it were real — that result is meaningless. Use Firth (logistf) or collapse sparse categories. Heinze & Schemper 2002 Stat Med is the canonical reference.

預設閾值 0.5

在不平衡或 false-positive / false-negative 成本不對稱時,0.5 幾乎肯定不是最佳。請用 Youden's J、Net Benefit、或最大化「clinical utility」的閾值;報告整條 ROC + DCA,而不是單一混淆矩陣。

With class imbalance or asymmetric costs of FP/FN, 0.5 is almost surely suboptimal. Use Youden's J, net benefit, or the threshold maximising "clinical utility"; report the full ROC + DCA, not a single confusion matrix.

只報 AUC

AUC 高的模型可能 systematically 高估或低估機率(calibration 差),臨床上仍會給錯藥。Van Calster 2019 BMC Med:必報 calibration plot + intercept + slope + Brier。

A model with high AUC can still systematically over- or under-estimate probabilities (poor calibration), leading clinicians astray. Van Calster 2019 BMC Med: always report calibration plot + intercept + slope + Brier.

內部 ≠ 外部驗證

train/test split 與 cross-validation 是內部驗證(同分布)。模型搬到不同醫院、不同年代、不同人口要做外部驗證,TRIPOD 2015 / TRIPOD+AI 2024 強調分開報告 development、internal validation、external validation 三階段。

Train/test splits and cross-validation are internal validation (same distribution). Transporting the model to a different hospital, era, or population requires external validation. TRIPOD 2015 / TRIPOD+AI 2024 require separate reporting of development, internal validation, and external validation.

九、必讀文獻

📝 自我檢測

1. 你訓練了一個 30 天死亡率的 logistic 模型,總 n = 1200,但事件僅 60,候選變數 12。下列何者最合適?

1. You build a logistic model for 30-day mortality, total n = 1200, events = 60, candidate predictors = 12. Best step?

A. n = 1200 很大,全部 12 變數直接放,看顯著性挑A. n = 1200 is large — drop all 12 in and pick by p-values
B. 樣本數靠 AUC 判斷B. Decide sample size from AUC
C. EPV = 60/12 = 5 不足;先用 pmsampsize,或縮減變數 / 改 Lasso / FirthC. EPV = 60/12 = 5 is too low — use pmsampsize, or shrink/Lasso/Firth
D. 直接 train/test split 解決D. Just do a train/test split

2. 一篇論文報告 AUC = 0.88 但沒有 calibration plot。下列何者是最強的疑慮?

2. A paper reports AUC = 0.88 but no calibration plot. Strongest concern?

A. AUC 太高一定 overfitA. AUC that high must be overfit
B. AUC > 0.85 就足夠了B. AUC > 0.85 is already enough
C. 沒問題,AUC 已經涵蓋所有性能面向C. No issue — AUC covers all aspects of performance
D. 區辨力(AUC)≠ 校準度;模型可能 systematically 高/低估機率,臨床決策仍可能錯D. Discrimination ≠ calibration; the model may systematically mis-estimate probabilities and mislead clinical use

3. glm() 警告 "fitted probabilities numerically 0 or 1 occurred",且某係數 SE = 1.2 × 10⁹。最佳處理?

3. glm() warns "fitted probabilities numerically 0 or 1 occurred" and one SE = 1.2 × 10⁹. Best fix?

A. 忽略警告,模型還是收斂了A. Ignore the warning — the model converged
B. 強行用 Wald 95% CI 報告B. Force a Wald 95% CI and report
C. 屬於(準)完美分離;用 Firth (logistf) 或合併稀有 categoryC. This is (quasi-)complete separation — switch to Firth (logistf) or collapse sparse categories
D. 換 random forestD. Switch to random forest

4. 30 天再入院率約 25%(常見結局)。研究者報告「OR=2.5 表示再入院風險變 2.5 倍」。下列何者最正確?

4. 30-day readmission rate ≈ 25% (common outcome). The authors say "OR=2.5 means 2.5× the risk." Most correct?

A. 正確,OR 等於 RRA. Correct — OR equals RR
B. 常見結局時 OR 會誇大 RR;應改用 log-binomial 或 Poisson with robust SE 直接估 RRB. With common outcomes, OR overstates RR — use log-binomial or Poisson with robust SE for RR directly
C. OR 和 RR 永遠相同C. OR and RR are always identical
D. 該說「機率 × 2.5」D. Phrase it as "probability × 2.5"