Step 9: Logistic Regression — Biostatistics Tutorial

總覽

從二元結果到一條機率曲線

當 Y 是二元結果（病/不病、活/死、有反應/無反應、30 天再入院/未再入院），直接套線性迴歸會產生不合理的負機率或大於 1 的機率，也違反同方差與常態假設。邏輯斯迴歸（logistic regression）用 logit 連結把機率轉到實數線：logit(p) = log(p/(1−p)) = β₀ + β₁X₁ + … + βₚXₚ。反過來，sigmoid（S 型）把線性預測值再映射回 [0, 1] 的機率區間。

它不只是「分類器」——更是機率模型。臨床預測模型（clinical prediction model）的主流，從 Framingham 心血管風險、APACHE II 重症評分、到 TRIPOD（Collins et al. 2015 BMJ；2024 升級為 TRIPOD+AI），骨幹都是 logistic / Cox。掌握 logistic 不只是學一個 GLM，是學會建構、驗證、報告一個可被臨床採用的風險模型。

When Y is a binary outcome (disease/healthy, alive/dead, responder/non-responder, 30-day readmission yes/no), linear regression will produce nonsensical probabilities below 0 or above 1 and violate homoscedasticity and normality. Logistic regression uses a logit link to push probabilities onto the real line: logit(p) = log(p/(1−p)) = β₀ + β₁X₁ + … + βₚXₚ. The inverse — the sigmoid S-curve — maps the linear predictor back into [0, 1].

It is more than a classifier — it is a probability model. The workhorse behind clinical prediction (Framingham, APACHE II, and the TRIPOD reporting guideline of Collins et al. 2015 BMJ, now TRIPOD+AI 2024) is logistic or Cox regression. Mastering logistic regression is mastering how to build, validate, and report a risk model that a clinician can actually use.

💡

核心訊息：logistic 不是「分類」工具，是「機率估計」工具。預測出的 p̂ 應該與真實事件率對齊——這叫 calibration，比 AUC 更難達成、卻常被忽視。一個 AUC=0.85 但 calibration 差的模型，臨床上可能比 AUC=0.75 但校準良好的模型更糟糕。 Core message: logistic is a probability model, not just a classifier. The predicted p̂ should track the observed event rate — that is calibration, harder to achieve than discrimination (AUC) and routinely ignored. A model with AUC=0.85 but poor calibration can be clinically worse than one with AUC=0.75 and good calibration.

核心概念

一、Logit、Sigmoid、OR、MLE

🔗

Logit 連結

logit(p) = log(p/(1−p)) = β₀ + β₁X
把 [0,1] 的機率拉到 (−∞, +∞)；對 X 線性。
Odds = p/(1−p)；機率 0.5 → odds 1 → logit 0。

logit(p) = log(p/(1−p)) = β₀ + β₁X.
Stretches [0,1] onto (−∞, +∞); linear in X.
Odds = p/(1−p); p = 0.5 → odds 1 → logit 0.

🌊

Sigmoid 反函數

p = 1/(1 + e^−η)，η = β₀ + β₁X。
50% 機率出現在 X = −β₀/β₁。
β₁ 越大 → 曲線越陡。

p = 1/(1 + e^−η), η = β₀ + β₁X.
50% probability at X = −β₀/β₁.
Larger |β₁| → steeper curve.

📈

Odds Ratio

X 增 1 單位 → odds 變 e^β₁ 倍。
注意：是 odds × e^β，不是 probability × e^β。
e^β₁ = 1 → 無關；> 1 → 風險升高。

+1 unit in X → odds multiplied by e^β₁.
Beware: it's odds × e^β, not probability × e^β.
e^β₁ = 1 → no effect; > 1 → higher odds.

⌜ logit(p) = log( p / (1 − p) ) = β₀ + β₁X · p = 1 / (1 + e^{−(β₀ + β₁X)}) · OR = e^β₁ ⌝ Logit 是 Berkson (1944) 提出，sigmoid 形式則更早可追溯到 Verhulst 1838 的人口成長曲線；現代 GLM 框架由 Nelder & Wedderburn 1972 J Roy Statist Soc A 統一。 ⌜ logit(p) = log( p / (1 − p) ) = β₀ + β₁X · p = 1 / (1 + e^{−(β₀ + β₁X)}) · OR = e^β₁ ⌝ The logit was named by Berkson (1944); the sigmoid form traces back to Verhulst's 1838 population curve. The unifying GLM framework comes from Nelder & Wedderburn (1972, JRSS-A).

估計：最大概似法（MLE）。沒有閉式解，標準做法是 IRLS（Iteratively Reweighted Least Squares，疊代加權最小平方）——每一輪用上一輪的 p̂ 計算工作反應變數和權重，做加權迴歸，更新 β，直到收斂。R 的 glm() 與 Python 的 statsmodels 內部都跑 IRLS。

觀察 n 個資料 (Xᵢ, Yᵢ)，對數概似為 ℓ(β) = Σ [Yᵢ log pᵢ + (1−Yᵢ) log(1−pᵢ)]。最大化 ℓ 等於最小化「log-loss / 交叉熵」——這正是機器學習裡 sigmoid 神經元的損失函數，logistic 是最簡單的單層神經網路。

Estimation: maximum likelihood (MLE). There is no closed-form solution; the standard procedure is IRLS (Iteratively Reweighted Least Squares) — each iteration uses the current p̂ to build a working response and weights, fits a weighted least-squares regression, and updates β until convergence. Both R's glm() and Python's statsmodels run IRLS under the hood.

For n observations (Xᵢ, Yᵢ) the log-likelihood is ℓ(β) = Σ [Yᵢ log pᵢ + (1−Yᵢ) log(1−pᵢ)]. Maximising ℓ equals minimising log-loss / cross-entropy — the very loss used by a sigmoid neuron in ML. Logistic regression is the simplest one-layer neural network.

⚠️

OR ≠ RR（風險比）！當事件常見（例如 incidence > 10%），OR 會明顯誇大 RR。例如 RR = 2 時，當基準率 1%，OR ≈ 2.02；當基準率 30%，OR ≈ 3.5。Zhang & Yu 1998 JAMA 提出常被引用（但有爭議）的校正公式；現代建議：常見結局直接用 log-binomial 或 Poisson with robust SE 估計 RR（Zou 2004 Am J Epidemiol）。報告時請明確寫「odds ratio」，不要混用「risk」、「likelihood」、「times more likely」等模糊詞。 OR ≠ RR (risk ratio)! When events are common (incidence > 10%), the OR materially exaggerates the RR. For RR = 2: with a 1% baseline OR ≈ 2.02, but with a 30% baseline OR ≈ 3.5. Zhang & Yu (1998 JAMA) popularised a (controversial) correction; modern practice for common outcomes is log-binomial or Poisson with robust SE to estimate RR directly (Zou 2004 Am J Epidemiol). In reporting, always write "odds ratio" — never the muddier "risk", "likelihood", or "times more likely".

互動模擬 ①

Sigmoid 遊樂場

調整 β₀（截距，移動曲線左右）與 β₁（斜率，控制陡度）。30 個模擬樣本（藍/橘 = 0/1）會跟著機率曲線重新生成。觀察重點：(1) β₁ > 0 時曲線向右上升，β₁ < 0 反向；(2) 機率 50% 落在 X = −β₀/β₁；(3) OR = e^β₁；(4) β₁ → 0 時曲線變平坦——X 幾乎不影響 Y。

Adjust β₀ (intercept — shifts left/right) and β₁ (slope — controls steepness). The 30 simulated points (blue = 0, orange = 1) re-draw from the current probability curve. Notice: (1) β₁ > 0 rises right, β₁ < 0 falls; (2) p = 0.5 at X = −β₀/β₁; (3) OR = e^β₁; (4) as β₁ → 0 the curve flattens — X carries little information.

β₀ -2

β₁ 1

綠線 = sigmoid 機率曲線 · 灰虛線 = 50% 閾值 · 點 = 30 個模擬樣本（藍=0, 橘=1）Green = sigmoid · grey dashed = 0.5 threshold · dots = 30 simulated samples (blue = 0, orange = 1)

統計推論

二、Wald / LRT / Score

對單一係數 β_j 檢定「H₀: β_j = 0」有三種等價的漸近方法，但在有限樣本下表現差異很大。

To test "H₀: β_j = 0" we have three asymptotically equivalent tests, but their finite-sample behaviour can differ a lot.

檢定	統計量	優點	風險
Wald	z = β̂ / SE(β̂)	最快、軟體預設輸出	小樣本或 \|β\| 大時不可靠（Hauck-Donner 1977）；可能反而拒絕變弱	Fastest, default software output	Unreliable for small n or large \|β\| (Hauck-Donner 1977); power paradoxically drops
Likelihood-Ratio (LRT)	2(ℓ₁ − ℓ₀) ~ χ²(df)	統計學界公認最可靠；對巢狀模型比較通用	需 refit 兩個模型；多了一次計算	Most reliable; works for nested model comparison	Needs two fits — slight computational cost
Score (Rao)	在 H₀ 下的梯度	只需擬合限制模型；理論上接近 LRT	實務罕用；某些分析（如 McNemar）等同	Gradient evaluated under H₀	Needs only the restricted fit; close to LRT in theory	Rarely used in practice; equivalent to McNemar in 2×2 designs

實務建議：整體模型適合度 → 用 LRT（R: anova(fit, test="Chisq")；Python: statsmodels 的 LR-test）。報告個別係數 → 用 profile-likelihood 95% CI（R: confint(fit)，預設即 profile）比 Wald CI 更可靠，尤其當 |β| 大、樣本小、或趨近完美分離。 Practical rule: for overall model fit, use the LRT (R: anova(fit, test="Chisq"); Python: statsmodels LR-test). For individual coefficients, prefer the profile-likelihood 95% CI (R's confint(fit) defaults to profile) over Wald CIs — much safer when |β| is large, n is small, or separation is approached.

Pseudo-R² 警告：三種常見指標 McFadden（1 − ℓ₁/ℓ₀）、Cox-Snell（有上限<1）、Nagelkerke（縮放到 [0,1]）都不是線性 R² 的對等物。McFadden 0.2–0.4 已算「優秀」。報告時請明確標出哪一種，並不要把 0.3 跟線性 R² 的 0.3 比較。

Pseudo-R² caveats: the three common variants — McFadden (1 − ℓ₁/ℓ₀), Cox-Snell (capped < 1), and Nagelkerke (rescaled to [0,1]) — are NOT direct analogues of linear R². McFadden values of 0.2–0.4 already count as "excellent". Always state which variant, and do not compare a pseudo-R² of 0.3 to a linear-regression R² of 0.3.

模型評估

三、GOF / Discrimination / Calibration / DCA

評估邏輯斯模型有四層，許多論文只報第二層就停下——這是 Steyerberg 2019《Clinical Prediction Models》與 TRIPOD 2015 一再批評的疏失。

A logistic model should be evaluated on four layers. Many papers stop at the second one — exactly the gap criticised by Steyerberg's Clinical Prediction Models (2019) and TRIPOD (2015).

① 適合度 GOF

Deviance：D = −2 ℓ。比較巢狀模型直接用 ΔD ~ χ². Hosmer-Lemeshow GOF：把 p̂ 切 10 等分，比較預期 vs 觀察事件數。
警告（Hosmer-Lemeshow-Sturdivant 2013 3rd ed）：HL 檢定對分組數 g 很敏感（g 從 8 變 12 結論可能反轉），且大 n 時容易顯著、小 n 時又力不足。現代建議：看 calibration plot，而不是只盯 HL p 值。

Deviance: D = −2 ℓ. Compare nested models with ΔD ~ χ². Hosmer-Lemeshow GOF: bin p̂ into 10 deciles and compare expected vs observed counts.
Warning (Hosmer-Lemeshow-Sturdivant 2013, 3rd ed): the HL test is sensitive to the bin count g (changing g from 8 to 12 can flip the verdict), tends to be over-significant at large n and under-powered at small n. Modern advice: look at the calibration plot, not the HL p-value alone.

② 區辨力（ROC / AUC）

ROC 曲線：在所有閾值下繪 TPR (sensitivity) vs FPR (1 − specificity)。AUC = 隨機抽一正一負，模型給正樣本較高分的機率。0.5 = 亂猜；0.7–0.8 普通；0.8–0.9 好；> 0.9 罕見、需檢查 overfit 或資訊洩漏。

陷阱：類別不平衡（如 1:99 罕病）時 ROC/AUC 可能虛胖——Saito & Rehmsmeier 2015 PLOS ONE 建議改用 Precision-Recall curve / AUPRC。

ROC curve: TPR (sensitivity) vs FPR (1 − specificity) across thresholds. AUC = the probability the model scores a random positive higher than a random negative. 0.5 = chance; 0.7–0.8 fair; 0.8–0.9 good; > 0.9 rare — check for overfitting or data leakage.

Pitfall: with severe class imbalance (e.g. 1:99 rare disease), ROC/AUC can be misleadingly optimistic — Saito & Rehmsmeier 2015 PLOS ONE recommend the Precision-Recall curve / AUPRC instead.

③ 校準度 Calibration

Calibration plot：x = 預測機率 p̂（分十等分或用 loess 平滑），y = 該段的觀察事件率。完美時落在 y = x 對角線。
三層 (Van Calster 2019 BMC Med): 平均校準（calibration-in-the-large）、弱校準（intercept + slope）、強校準（每個 p̂ 都對齊）。
Brier score = mean(p̂_i − Y_i)²；越小越好；含 calibration + discrimination；可分解為 reliability − resolution + uncertainty。

Calibration plot: x = predicted probability p̂ (deciles or loess smooth), y = observed event rate. Perfect lies on the y = x diagonal.
Three layers (Van Calster 2019 BMC Med): mean calibration (calibration-in-the-large), weak (intercept + slope), strong (every p̂ aligned).
Brier score = mean(p̂_i − Y_i)²; smaller is better; combines calibration + discrimination; decomposable into reliability − resolution + uncertainty.

④ 臨床效益（DCA）

Decision Curve Analysis（Vickers & Elkin 2006 Med Decis Making）：在各閾值 p_t 計算 net benefit = (TP/n) − (FP/n) × (p_t / (1 − p_t))。同圖比較「模型 vs 全治療 vs 全不治療」三條線——模型只有在「贏過兩個極端策略」的閾值範圍內才有臨床價值。
實作：R 套件 dcurves（Vickers 維護）；Python 套件 dcurves。

Decision Curve Analysis (Vickers & Elkin 2006 Med Decis Making): at each threshold p_t compute net benefit = (TP/n) − (FP/n) × (p_t / (1 − p_t)). Plot the model against "treat all" and "treat none" — the model has clinical value only over the threshold range where it beats both extremes.
Implementation: R package dcurves (maintained by Vickers); Python dcurves.

🚫

預設閾值 0.5 幾乎總是錯的。0.5 只在「false positive 與 false negative 等價且 prevalence ≈ 0.5」時才合理。實務上請：(1) 用 Youden's J = max(TPR − FPR) 找區辨最佳閾值；(2) 用 DCA 找臨床最佳閾值（依 utility 而非統計）；(3) 提供完整 ROC + DCA 而非單一閾值的混淆矩陣。 The default 0.5 threshold is almost always wrong. It is only sensible when false positives and false negatives are equally costly and prevalence ≈ 0.5. Instead: (1) use Youden's J = max(TPR − FPR) for the discrimination-optimal cut; (2) use DCA for the clinical optimum (utility-driven, not statistical); (3) report the full ROC + DCA rather than a single confusion matrix at one cut.

樣本數規則

四、EPV：從 1996 到 2020

EPV ≥ 10

Peduzzi et al. 1996 J Clin Epidemiol（DOI:10.1016/S0895-4356(96)00236-3）以模擬證明：每個 predictor 至少需要 10 個 events（注意：是「事件數」即少數類別的個數，不是總 n）。EPV < 10 時，β̂ 嚴重偏誤、CI 涵蓋率錯誤、變數選擇不穩定。例：若預期事件率 20%，目標 n = 200，則最多可放 200×0.2 / 10 = 4 個變數。

Peduzzi et al. 1996 J Clin Epidemiol (DOI:10.1016/S0895-4356(96)00236-3) showed via simulation that each predictor needs at least 10 events (the count of the minority class, not the total n). Below EPV = 10, β̂ is biased, CI coverage is wrong, and variable selection becomes unstable. Example: expected event rate 20%, target n = 200 → at most 200×0.2 / 10 = 4 predictors.

Riley 2020 BMJ

Riley et al. 2020 BMJ「Calculating the sample size required for developing a clinical prediction model」指出 EPV ≥ 10 是粗糙的經驗法則。新方法基於三個目標：(1) 低 optimism (overfitting)；(2) Cox-Snell R²_app − R²_true 小於 0.05；(3) calibration slope 接近 1。R 套件 pmsampsize 是官方實作；Python 對應為 pmsampsize-py。

典型結果：當預測變數多、預期 R² 低、結局少見時，新公式需要的 n 往往遠大於 EPV=10 給的。例如 10 個變數、event rate 10%、預期 Cox-Snell R²=0.05 → 至少需 n ≈ 900（EPV 規則只說 n=100）。

Riley et al. 2020 BMJ ("Calculating the sample size required for developing a clinical prediction model") shows EPV ≥ 10 is a coarse rule of thumb. The newer derivation targets three criteria: (1) low optimism (overfitting); (2) Cox-Snell R²_app − R²_true < 0.05; (3) calibration slope close to 1. R package pmsampsize is the canonical implementation; Python has pmsampsize-py.

Typical outcome: with many predictors, low expected R², and a rare outcome, the modern formula usually demands far more than EPV=10. E.g. 10 predictors, 10% event rate, expected Cox-Snell R² = 0.05 → n ≈ 900 (EPV rule would say only n = 100).

陷阱：n 大不代表 EPV 大罕見結局（rare outcome）的研究 n 可能很大（10,000 人）但只有 50 個事件——EPV 隨「事件」而非「總人數」放大。若放 8 個 predictor，EPV = 50/8 ≈ 6，仍嫌不足。請永遠先計算事件數。 A rare-outcome study can have huge n (10,000 people) yet only 50 events — EPV scales with the event count, not total sample size. Fit 8 predictors and EPV = 50/8 ≈ 6, still too few. Always count events first.

互動模擬 ②

ROC 曲線實驗

調整兩組（正/負）的「分離度」（類似 Cohen's d）。1000 個模擬樣本在每個閾值下計算 TPR / FPR，畫出 ROC 曲線並計算 AUC。觀察：(1) 分離度 = 0 時 AUC ≈ 0.5（亂猜）；(2) 分離度 = 3 時 AUC > 0.95；(3) Youden 最佳閾值（紅點）通常不是 0.5，會隨類別重疊改變。

Adjust the separation (Cohen's d-like) between the positive and negative groups. 1000 simulated points produce TPR / FPR at every threshold to draw the ROC curve and compute AUC. Watch: (1) separation = 0 → AUC ≈ 0.5 (chance); (2) separation = 3 → AUC > 0.95; (3) Youden's optimal cut (red dot) is rarely 0.5 and shifts with overlap.

分離度 1.5

正類比例 50%

綠線 = ROC · 灰虛線 = 亂猜對角 · 紅點 = Youden's J 最佳閾值Green = ROC · grey dashed = chance · red dot = Youden's J optimum

決策引導

五、Logistic 還是別的家族？

🌳 結果型別決策樹

Q1:

Y 只有兩類（病/不病）？→ 是 → Logistic regression（logit link, family=binomial）。

Q2:

Y 是 0/1 但實驗模式為「相同 X，n 次試驗中成功幾次」？→ 是 → 仍是 logistic，但餵 cbind(success, fail) 或用 binomial GLM with weights。

Q3:

結局常見且需 RR 而非 OR？→ 是 → log-binomial 或 modified Poisson with robust SE（Zou 2004）。

Q4:

Y 有 ≥ 3 類但有順序（mild/moderate/severe）？→ 是 → Ordinal (proportional odds) logistic，R: MASS::polr；先檢查 PO 假設。

Q5:

Y 有 ≥ 3 類但無順序（藥物 A/B/C）？→ 是 → Multinomial logistic，R: nnet::multinom。

Q6:

需要 latent normal 解讀（譬如心理計量）？→ 是 → 可考慮 probit；係數 × 1.6 ≈ logistic 係數，實質結論幾乎相同。

Q7:

有群集 / 重複測量（同病人多次量測）？→ 是 → GEE（族群平均效應，geepack）或 GLMM（個體效應，lme4::glmer）。見 Step 13。

Q1:

Two categories (disease/no)? → Yes → Logistic regression (logit link, family=binomial).

Q2:

Y is 0/1 but design is "same X, n trials, k successes"? → Yes → still logistic, feed cbind(success, fail) or use binomial GLM with weights.

Q3:

Outcome common and you need RR not OR? → Yes → log-binomial or modified Poisson with robust SE (Zou 2004).

Q4:

≥ 3 categories with a natural order? → Yes → Ordinal (proportional odds) logistic via R MASS::polr; check the PO assumption first.

Q5:

≥ 3 categories without order? → Yes → Multinomial logistic (R: nnet::multinom).

Q6:

Need a latent-normal story (psychometrics)? → Yes → consider probit; coefficients × 1.6 ≈ logistic, conclusions barely change.

Q7:

Clustered / repeated measures? → Yes → GEE (population-average; geepack) or GLMM (subject-specific; lme4::glmer). See Step 13.

特殊情境

六、Perfect Separation & Firth

完美分離（complete separation）：某個 X 可以 100% 區分 Y = 0 / 1 → 對應 β → ±∞，MLE 不存在。軟體會給「algorithm did not converge」或極大的 SE（如 SE = 10⁹）。準完美分離（quasi-complete）：分離只在部分資料發生。常見於罕見事件、小樣本、稀有 category。

解法：Firth's 罰估計（Firth 1993 Biometrika）。對概似乘上 Jeffreys prior 的開根號項，等價於對 score 加上偏誤修正項。可得到有限的 β̂ 與合理的 CI。R 套件 logistf（Heinze & Schemper 2002 Stat Med）；Python 有 firthlogist。

近年也常見 L2 (Ridge) 或 L1 (Lasso) 罰估計（glmnet、sklearn LogisticRegression(penalty='l1'/'l2')），用於高維情境（變數 > n），但會犧牲係數的因果可解釋性，預測為主時才適用。

Complete separation: some X perfectly splits Y = 0 / 1 → the corresponding β → ±∞ and MLE does not exist. Software will throw "algorithm did not converge" or huge SE (e.g. 10⁹). Quasi-complete separation occurs partially. Common with rare events, small samples, or sparse categories.

Fix: Firth's penalised likelihood (Firth 1993 Biometrika). Multiplies the likelihood by the square root of the Jeffreys prior, equivalently adds a score-function bias correction. Yields finite β̂ and well-behaved CIs. R: logistf (Heinze & Schemper 2002 Stat Med); Python: firthlogist.

Modern alternatives include L2 (ridge) or L1 (lasso) penalties (glmnet, sklearn LogisticRegression(penalty='l1'/'l2')) for high-dimensional settings (p > n), but they sacrifice causal interpretability of coefficients — fine when prediction is the only goal.

⚠️

診斷檢核清單：(1) 任何 |β̂| > 10 或 SE > 10 → 懷疑分離；(2) glm 警告 "fitted probabilities 0 or 1" → 強烈警示；(3) cross-tab 看每個 binary X 對 Y 的 2×2 表，若有 0 cell → 高度可疑；(4) 解法優先序：合併稀有 category > Firth > 移除該變數。 Diagnostic checklist: (1) any |β̂| > 10 or SE > 10 → suspect separation; (2) glm warns "fitted probabilities 0 or 1" → strong signal; (3) cross-tab each binary X against Y — any zero cell is a red flag; (4) priority order: collapse sparse categories > Firth > drop the variable.

程式碼

七、實作範例

# R: glm + diagnostics + ROC + calibration + Brier + DCA
library(tidyverse); library(pROC); library(rms); library(dcurves); library(logistf); library(pmsampsize)

# --- Fit ---
fit <- glm(disease ~ age + bmi + smoke,
            data = df, family = binomial(link = "logit"))
summary(fit)
exp(cbind(OR = coef(fit), confint(fit)))   # profile-likelihood CI

# --- LRT vs Wald ---
anova(fit, test = "Chisq")                  # LRT (preferred)
fit0 <- update(fit, . ~ . - smoke)
anova(fit0, fit, test = "LRT")              # nested LRT

# --- Discrimination: ROC + AUC ---
p <- predict(fit, type = "response")
roc1 <- pROC::roc(df$disease, p, ci = TRUE)
roc1$auc; pROC::ci.auc(roc1)
pROC::coords(roc1, "best", ret = c("threshold", "sens", "spec"),
              best.method = "youden")

# --- Calibration plot + Brier (rms package) ---
ddist <- rms::datadist(df); options(datadist = "ddist")
fit_lrm <- rms::lrm(disease ~ age + bmi + smoke, data = df, x = TRUE, y = TRUE)
rms::calibrate(fit_lrm, method = "boot", B = 200) |> plot()
mean((p - df$disease)^2)                       # Brier score

# --- Hosmer-Lemeshow GOF (use with caution) ---
ResourceSelection::hoslem.test(df$disease, p, g = 10)

# --- Decision Curve Analysis (Vickers 2006) ---
dcurves::dca(disease ~ p, data = df, thresholds = seq(0.05, 0.5, 0.01)) |> plot()

# --- Perfect separation → Firth ---
fit_firth <- logistf::logistf(disease ~ rare_var + age, data = df)
summary(fit_firth)

# --- Sample size: Riley 2020 ---
pmsampsize::pmsampsize(type = "b", csrsquared = 0.05, parameters = 10,
                        prevalence = 0.10)

import numpy as np, pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import logit
from sklearn.metrics import roc_auc_score, roc_curve, brier_score_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

# --- Fit ---
fit = logit("disease ~ age + bmi + smoke", df).fit(disp=0)
print(fit.summary())
OR  = np.exp(fit.params)
CI  = np.exp(fit.conf_int())                    # Wald CI

# --- LRT (preferred over Wald for inference) ---
fit0 = logit("disease ~ age + bmi", df).fit(disp=0)
LR   = 2*(fit.llf - fit0.llf)
from scipy.stats import chi2
p_LR = 1 - chi2.cdf(LR, df=1)

# --- ROC + AUC + Youden ---
p = fit.predict(df)
auc = roc_auc_score(df.disease, p)
fpr, tpr, thr = roc_curve(df.disease, p)
j = tpr - fpr; t_opt = thr[j.argmax()]

# --- Calibration plot + Brier ---
prob_pred, prob_obs = calibration_curve(df.disease, p, n_bins=10)
brier = brier_score_loss(df.disease, p)
plt.plot(prob_pred, prob_obs, marker="o"); plt.plot([0,1], [0,1], "k--")

# --- Decision Curve Analysis (dcurves package) ---
# pip install dcurves
from dcurves import dca, plot_graphs
res = dca(data=df, outcome="disease", modelnames=["p"],
          thresholds=np.arange(0.05, 0.5, 0.01))
plot_graphs(res)

# --- Firth penalised likelihood (separation) ---
# pip install firthlogist
from firthlogist import FirthLogisticRegression
flr = FirthLogisticRegression().fit(X, y)
flr.coef_, flr.ci_                              # finite β̂ + CIs

💡

實務工作流：(1) pmsampsize 算 n；(2) glm 擬合，profile CI；(3) LRT 報告整體 / 子組顯著性；(4) ROC + AUC + Youden 區辨；(5) calibration plot + Brier 校準；(6) DCA 評估臨床效益；(7) bootstrap 內部驗證（200–500 次）或外部驗證集；(8) 依 TRIPOD 2015 / TRIPOD+AI 2024 撰寫報告。 Workflow: (1) pmsampsize for n; (2) glm + profile CIs; (3) LRT for overall/sub-effects; (4) ROC + AUC + Youden for discrimination; (5) calibration plot + Brier; (6) DCA for clinical utility; (7) bootstrap internal validation (200–500 reps) or an external cohort; (8) write the report against TRIPOD 2015 / TRIPOD+AI 2024.

常見陷阱

八、六個經典錯誤

❌ OR 當 RR 解讀

「吸菸 OR=2 → 風險變兩倍」是常見誤讀。當基準率高時 OR 會誇大 RR；當基準率 < 5% 時 OR ≈ RR 才成立。常見結局請用 log-binomial 或 Poisson with robust SE（Zou 2004）。

"OR=2 for smoking means double the risk" is a common misread. With a common outcome OR overstates RR; only when prevalence < 5% is OR ≈ RR. For common outcomes use log-binomial or Poisson with robust SE (Zou 2004).

❌ 小 EPV → overfit

事件數少卻塞入十幾個變數是「醫學期刊 reviewer 看一眼就拒」的紅旗。先用 pmsampsize 算 n，事件多的最好；變數多時請改用 Lasso 或 penalised logistic 並做 bootstrap 內部驗證。

Stuffing a dozen predictors into a model with few events is a reviewer's instant red flag. Compute n with pmsampsize; with many variables use Lasso or penalised logistic plus bootstrap internal validation.

❌ 完美分離未處理

輸出顯示 OR = 4.7 × 10⁸、SE = 9.9 × 10⁹，仍照常報告——這個結果無意義。請改用 Firth (logistf) 或合併稀有 category。Heinze & Schemper 2002 是經典實作參考。

Reporting OR = 4.7 × 10⁸ with SE = 9.9 × 10⁹ as if it were real — that result is meaningless. Use Firth (logistf) or collapse sparse categories. Heinze & Schemper 2002 Stat Med is the canonical reference.

❌ 預設閾值 0.5

在不平衡或 false-positive / false-negative 成本不對稱時，0.5 幾乎肯定不是最佳。請用 Youden's J、Net Benefit、或最大化「clinical utility」的閾值；報告整條 ROC + DCA，而不是單一混淆矩陣。

With class imbalance or asymmetric costs of FP/FN, 0.5 is almost surely suboptimal. Use Youden's J, net benefit, or the threshold maximising "clinical utility"; report the full ROC + DCA, not a single confusion matrix.

❌ 只報 AUC

AUC 高的模型可能 systematically 高估或低估機率（calibration 差），臨床上仍會給錯藥。Van Calster 2019 BMC Med：必報 calibration plot + intercept + slope + Brier。

A model with high AUC can still systematically over- or under-estimate probabilities (poor calibration), leading clinicians astray. Van Calster 2019 BMC Med: always report calibration plot + intercept + slope + Brier.

❌ 內部 ≠ 外部驗證

train/test split 與 cross-validation 是內部驗證（同分布）。模型搬到不同醫院、不同年代、不同人口要做外部驗證，TRIPOD 2015 / TRIPOD+AI 2024 強調分開報告 development、internal validation、external validation 三階段。

Train/test splits and cross-validation are internal validation (same distribution). Transporting the model to a different hospital, era, or population requires external validation. TRIPOD 2015 / TRIPOD+AI 2024 require separate reporting of development, internal validation, and external validation.

重要文獻

九、必讀文獻

Peduzzi P et al. 1996 J Clin Epidemiol — EPV ≥ 10 原始模擬。DOI:10.1016/S0895-4356(96)00236-3
Riley RD et al. 2020 BMJ — 現代樣本數規則，pmsampsize 對應。DOI:10.1136/bmj.m441
Firth D 1993 Biometrika — 罰估計法，解決完美分離。DOI:10.1093/biomet/80.1.27
Hosmer DW, Lemeshow S, Sturdivant RX 2013 Applied Logistic Regression 3rd ed, Wiley.
Steyerberg EW 2019 Clinical Prediction Models 2nd ed, Springer — 預測模型 SOP。
Collins GS, Reitsma JB, Altman DG, Moons KGM 2015 BMJ — TRIPOD reporting guideline。DOI:10.1136/bmj.g7594
Collins GS et al. 2024 BMJ — TRIPOD+AI 更新版。DOI:10.1136/bmj-2023-078378
Vickers AJ, Elkin EB 2006 Med Decis Making — Decision Curve Analysis。DOI:10.1177/0272989X06295361
Saito T, Rehmsmeier M 2015 PLOS ONE — 類別不平衡時 PR 優於 ROC。DOI:10.1371/journal.pone.0118432
Van Calster B et al. 2019 BMC Med — Calibration「未完成」的呼籲。DOI:10.1186/s12916-019-1466-7
Zou G 2004 Am J Epidemiol — Modified Poisson for RR。DOI:10.1093/aje/kwh090
Heinze G, Schemper M 2002 Stat Med — Firth 在邏輯斯迴歸中的實作 (logistf)。DOI:10.1002/sim.1047

Peduzzi P et al. 1996 J Clin Epidemiol — the original EPV ≥ 10 simulation. DOI:10.1016/S0895-4356(96)00236-3
Riley RD et al. 2020 BMJ — modern sample-size formulae behind pmsampsize. DOI:10.1136/bmj.m441
Firth D 1993 Biometrika — penalised likelihood for separation. DOI:10.1093/biomet/80.1.27
Hosmer DW, Lemeshow S, Sturdivant RX 2013 Applied Logistic Regression 3rd ed, Wiley.
Steyerberg EW 2019 Clinical Prediction Models 2nd ed, Springer — the prediction-modelling SOP.
Collins GS, Reitsma JB, Altman DG, Moons KGM 2015 BMJ — the TRIPOD guideline. DOI:10.1136/bmj.g7594
Collins GS et al. 2024 BMJ — TRIPOD+AI update. DOI:10.1136/bmj-2023-078378
Vickers AJ, Elkin EB 2006 Med Decis Making — Decision Curve Analysis. DOI:10.1177/0272989X06295361
Saito T, Rehmsmeier M 2015 PLOS ONE — PR over ROC under class imbalance. DOI:10.1371/journal.pone.0118432
Van Calster B et al. 2019 BMC Med — the "calibration is not done" call. DOI:10.1186/s12916-019-1466-7
Zou G 2004 Am J Epidemiol — modified Poisson for RR. DOI:10.1093/aje/kwh090
Heinze G, Schemper M 2002 Stat Med — Firth implementation in logistic regression (logistf). DOI:10.1002/sim.1047

📝 自我檢測

1. 你訓練了一個 30 天死亡率的 logistic 模型，總 n = 1200，但事件僅 60，候選變數 12。下列何者最合適？

1. You build a logistic model for 30-day mortality, total n = 1200, events = 60, candidate predictors = 12. Best step?

A. n = 1200 很大，全部 12 變數直接放，看顯著性挑A. n = 1200 is large — drop all 12 in and pick by p-values

B. 樣本數靠 AUC 判斷B. Decide sample size from AUC

C. EPV = 60/12 = 5 不足；先用 pmsampsize，或縮減變數 / 改 Lasso / FirthC. EPV = 60/12 = 5 is too low — use pmsampsize, or shrink/Lasso/Firth

D. 直接 train/test split 解決D. Just do a train/test split

2. 一篇論文報告 AUC = 0.88 但沒有 calibration plot。下列何者是最強的疑慮？

2. A paper reports AUC = 0.88 but no calibration plot. Strongest concern?

A. AUC 太高一定 overfitA. AUC that high must be overfit

B. AUC > 0.85 就足夠了B. AUC > 0.85 is already enough

C. 沒問題，AUC 已經涵蓋所有性能面向C. No issue — AUC covers all aspects of performance

D. 區辨力（AUC）≠ 校準度；模型可能 systematically 高/低估機率，臨床決策仍可能錯D. Discrimination ≠ calibration; the model may systematically mis-estimate probabilities and mislead clinical use

3. glm() 警告 "fitted probabilities numerically 0 or 1 occurred"，且某係數 SE = 1.2 × 10⁹。最佳處理？

3. glm() warns "fitted probabilities numerically 0 or 1 occurred" and one SE = 1.2 × 10⁹. Best fix?

A. 忽略警告，模型還是收斂了A. Ignore the warning — the model converged

B. 強行用 Wald 95% CI 報告B. Force a Wald 95% CI and report

C. 屬於（準）完美分離；用 Firth (logistf) 或合併稀有 categoryC. This is (quasi-)complete separation — switch to Firth (logistf) or collapse sparse categories

D. 換 random forestD. Switch to random forest

4. 30 天再入院率約 25%（常見結局）。研究者報告「OR=2.5 表示再入院風險變 2.5 倍」。下列何者最正確？

4. 30-day readmission rate ≈ 25% (common outcome). The authors say "OR=2.5 means 2.5× the risk." Most correct?

A. 正確，OR 等於 RRA. Correct — OR equals RR

B. 常見結局時 OR 會誇大 RR；應改用 log-binomial 或 Poisson with robust SE 直接估 RRB. With common outcomes, OR overstates RR — use log-binomial or Poisson with robust SE for RR directly

C. OR 和 RR 永遠相同C. OR and RR are always identical

D. 該說「機率 × 2.5」D. Phrase it as "probability × 2.5"