Step 17: Survival Analysis — Statistical Inference Tutorial

概覽

為什麼存活分析需要自成一格的工具

「存活時間」可以泛指任何「事件發生前所經歷的時間」——患者死亡、腫瘤復發、機器故障、細胞分化。這類資料有兩個一般迴歸無法處理的特性：

審查 (censoring)：在研究結束時，有些個體仍未發生事件——我們只知道「至少活到了 t 時間」，而非確切的存活時間。右審查最常見；左審查（事件發生前個體已存在）與區間審查也存在。
截斷 (truncation)：個體只在某時間範圍內被觀察到。例如延遲進入 (left truncation) 在 TCGA 隊列中常見。

把存活時間當成普通連續變數做 t 檢定或線性迴歸，會把審查資料當「事件已發生」，整體存活時間嚴重低估、效應方向甚至顛倒。存活分析的核心思想是：對每個時間點 t，問「在 t 時刻仍存活的個體中，下一刻死亡的瞬時風險是多少？」——這就是 hazard function h(t)。所有方法（KM、log-rank、Cox PH、Weibull AFT）都建立在這個 hazard 框架上。

"Survival time" generalises to any "time until an event": patient death, tumour relapse, machine failure, cell differentiation. Such data have two features ordinary regression cannot handle:

Censoring: at study end, some individuals have not yet had the event — we only know "alive past t", not the actual time. Right-censoring is most common; left and interval censoring also occur.
Truncation: individuals are only observed within a window. Left truncation (delayed entry) is common in TCGA cohorts.

Treating survival time as ordinary continuous data and running a t-test or linear regression treats censored observations as "events occurred", severely underestimating survival and sometimes flipping the effect direction. The core idea of survival analysis: at each time t, ask "among individuals still alive at t, what is the instantaneous risk of dying in the next moment?" — this is the hazard function h(t). Every method (KM, log-rank, Cox PH, Weibull AFT) builds on this hazard framework.

💡

核心關係：S(t) = 存活函數（活過 t 的機率）；h(t) = 瞬時風險率；累積風險 H(t) = ∫₀ᵗ h(u)du；S(t) = exp(−H(t))。Cox PH 不直接模 S(t)，而是模 h(t) 的協變數效應。 Key identity: S(t) = survival function (probability of surviving past t); h(t) = instantaneous hazard; cumulative hazard H(t) = ∫₀ᵗ h(u)du; S(t) = exp(−H(t)). Cox PH models covariate effects on h(t), not directly on S(t).

核心概念

一、KM、log-rank 與 Cox PH

📉

Kaplan-Meier (KM)

非參數的 S(t) 估計：S(t) = Π_{t_i ≤ t} (1 − d_i / n_i)。
在每個事件時間點根據「事件數 d / 風險集 n」更新存活機率，產生階梯式曲線。
逐點 CI 用 Greenwood 公式；中位存活時間 = S(t) 首次跨越 0.5 的 t。
適合描述、繪圖與報告；不調整共變數。

Non-parametric S(t) estimator: S(t) = Π_{t_i ≤ t} (1 − d_i / n_i).
Updates at each event time using d / n; produces a step function.
Pointwise CI via Greenwood's formula; median survival is where S(t) first crosses 0.5.
Good for description and figures; cannot adjust for covariates.

⚖️

log-rank 檢定

比較兩條或多條 KM 曲線的整體差異。
檢定觀察到的事件數 vs 各時點的期望事件數（在 H₀ 下兩組 hazard 相同）。
在比例風險假設下檢力最高；若曲線交叉，檢力會大降。
變體：Gehan-Breslow（早期加權）、Tarone-Ware、Fleming-Harrington（給特定時段加權）。

Compares two or more KM curves overall.
Tests observed vs expected events at each time point under H₀: equal hazards.
Most powerful under proportional hazards; loses power if curves cross.
Variants: Gehan-Breslow (early-weighted), Tarone-Ware, Fleming-Harrington (time-weighted).

📐

Cox PH 模型

半參數：h(t | x) = h₀(t) · exp(Xβ)，h₀(t) 不指定形式。
係數 β 是 log hazard ratio；exp(β) 是 HR。
partial likelihood 估計——不需估 h₀(t)。
連續共變數直接放——比 dichotomise 強大得多。
支援 strata（不同 baseline hazard）、time-varying 共變數。

Semi-parametric: h(t | x) = h₀(t) · exp(Xβ); baseline h₀(t) unspecified.
Coefficients β are log hazard ratios; exp(β) is HR.
Estimated via partial likelihood — h₀(t) is never explicitly modelled.
Continuous covariates enter directly — far more powerful than dichotomising.
Supports strata (separate baselines) and time-varying covariates.

🩺

PH 假設與補救

Schoenfeld 殘差：對時間有殘留趨勢 → PH 違反。cox.zph() 給每個共變數做檢定。
log-log plot：log(−log S(t)) 在不同組應為平行直線。
違反 → time-varying coefficient (tt())、stratification (strata())、或改用 RMST、AFT 模型。
競爭風險：當另一個事件會「先發生」（如非癌死亡讓癌復發無法觀察）→ Fine-Gray subdistribution hazards。

Schoenfeld residuals: residual trend vs time → PH violated. cox.zph() tests each covariate.
log-log plot: log(−log S(t)) should be parallel across groups.
If violated → time-varying coefficient (tt()), stratification (strata()), or switch to RMST / AFT.
Competing risks: when another event preempts the one of interest (non-cancer death prevents observing relapse) → Fine-Gray subdistribution hazards.

二、生信應用情境

TCGA pan-cancer survival：把基因表現量分位（或直接放連續值）與 OS / PFS 一起放入 Cox PH，篩出 HR 顯著的「預後基因」。連續放入 + spline 比中位切分檢力大很多，也避免 cutpoint hacking。

biomarker discovery：常見錯誤是「試遍所有 cutpoint，挑 p 最小的那個」——這會把 Type I 錯誤率拉到 30% 以上。正確做法：maxstat + 校正、Cox PH on 連續變數、或 cross-validation 選 cutpoint。

scRNA-seq + 預後：先做 cell-type 解構（CIBERSORT、scaden）得到每位病人的免疫成分→ 把成分當共變數放進 Cox。或先做 trajectory 取得 pseudotime，再分析 pseudotime 階段與生存的關聯。

臨床試驗：主要終點通常是 OS / PFS；分析計畫需事先指定（pre-specified）log-rank vs stratified log-rank，並設好 interim analysis 的 α 花費函數。

TCGA pan-cancer survival: feed gene expression (preferably continuous + spline) alongside OS / PFS into Cox PH to find prognostic genes by HR. Continuous + spline is far more powerful than median dichotomisation and avoids cutpoint hacking.

Biomarker discovery: a common mistake is "search all cutpoints, pick the smallest p" — Type I error blows up past 30%. Correct approaches: maxstat with adjustment, Cox PH on continuous values, or cross-validation for cutpoint selection.

scRNA-seq + outcomes: deconvolve cell types per patient (CIBERSORT, scaden) and feed fractions as covariates into Cox. Or extract pseudotime stages from trajectory analysis and test their association with survival.

Clinical trials: primary endpoint is typically OS / PFS; the statistical analysis plan must pre-specify log-rank vs stratified log-rank and set α-spending functions for interim analyses.

⚠️

連續變數中位切分的代價：把表現量在中位數切成「高 / 低」損失約 30% 的檢力（相當於樣本量縮 30%）；若再去「找最佳 cutpoint」，p-value 已被 inflated。不要在發表用的 Cox PH 中做這件事——除非有獨立驗證集確認 cutpoint。

Cost of median dichotomisation: splitting a continuous biomarker at its median loses ~30% of power (≈ a 30% drop in effective sample size); searching for the "best" cutpoint further inflates p. Do NOT do this in publication-grade Cox PH unless an independent validation set confirms the cutpoint.

互動模擬

KM 沙盒

合成兩臂試驗：每組 n 人，存活時間為指數分布，治療組的 hazard 為對照組乘以 HR。拖動滑桿觀察 KM 曲線、log-rank p、Cox HR 估計與其 95% CI 如何隨樣本量與審查率變化。

Synthetic two-arm trial: n per arm, exponential survival, treatment hazard = control hazard × HR. Slide sample size and censoring rate to watch the KM curves, log-rank p, Cox HR estimate, and 95% CI respond.

HR (Cox) — log-rank p — median —

真實 HR 0.65

n / arm 80

審查率 30

X: time | Y: S(t) | 藍：對照；紅：治療

決策引導

存活分析決策樹

🌳 從研究問題到方法選擇

Q1:

結局是「事件發生前的時間」（有審查）嗎？→ 是 → 進存活分析；否 → 一般迴歸 / 邏輯式。

Q2:

只想看單條曲線 / 描述整體存活？→ 是 → Kaplan-Meier（含中位生存、Greenwood CI）。

Q3:

比較兩或多組、PH 看似合理？→ 是 → log-rank 檢定 + Cox PH（HR ＋ CI）。

Q4:

共變數是連續的（如基因表現量）？→ 是 → Cox PH on continuous（必要時加 spline），不要 dichotomise。

Q5:

Schoenfeld 或 log-log 顯示 PH 違反？→ 是 → time-varying coefficient / stratification / RMST / Weibull-AFT。

Q6:

有競爭事件（如非癌死亡阻礙觀察癌復發）？→ 是 → Fine-Gray subdistribution hazards 或 cause-specific Cox。

Q7:

高維特徵（p ≫ n，如基因組）做存活預測？→ 是 → penalised Cox（glmnet）、random survival forest、scikit-survival。

Q1:

Is the outcome "time until an event" with censoring? → Yes → use survival analysis; otherwise → ordinary or logistic regression.

Q2:

Want a single curve / descriptive survival? → Yes → Kaplan-Meier (median + Greenwood CI).

Q3:

Comparing two/more groups with plausible PH? → Yes → log-rank + Cox PH (HR + CI).

Q4:

Continuous covariate (e.g. expression)? → Yes → Cox PH on the continuous variable (with splines if needed). Do not dichotomise.

Q5:

Schoenfeld / log-log shows PH violated? → Yes → time-varying coefficient / stratification / RMST / Weibull-AFT.

Q6:

Competing events (non-cancer death preempts relapse)? → Yes → Fine-Gray subdistribution hazards or cause-specific Cox.

Q7:

High-dimensional features (p ≫ n)? → Yes → penalised Cox (glmnet), random survival forest, scikit-survival.

情境	推薦方法	診斷
兩臂臨床試驗 OS	KM + log-rank + Cox HR	cox.zph	Two-arm trial OS	KM + log-rank + Cox HR	cox.zph
TCGA 基因預後	Cox PH on continuous + spline	spline 顯著性 + cox.zph	TCGA gene prognosis	Cox PH on continuous + spline	spline significance + cox.zph
免疫治療曲線交叉	RMST 或 weighted log-rank	cox.zph 顯著	Immunotherapy crossing curves	RMST or weighted log-rank	significant cox.zph
老年隊列（非癌死亡多）	Fine-Gray	competing event %	Elderly cohort (many non-cancer deaths)	Fine-Gray	competing event %
基因組高維預測	penalised Cox / RSF	C-index, time-dependent AUC	High-dim genomic prediction	penalised Cox / RSF	C-index, time-dependent AUC

程式碼

實作：完整流程

# --- R --- survival / survminer 標準流程
library(survival); library(survminer)

# df 需有 time, event(0/1), group, age, stage, gene_expr

# 1) Surv object 與 KM 估計
S <- Surv(df$time, df$event)
km <- survfit(S ~ group, data = df)
ggsurvplot(km, data=df, pval=TRUE, risk.table=TRUE, conf.int=TRUE,
           xlab="Months", ylab="Overall survival")

# 2) log-rank 檢定
survdiff(S ~ group, data = df)

# 3) Cox PH（連續共變數 + strata）
fit <- coxph(S ~ gene_expr + age + strata(stage), data = df)
summary(fit)                                # HR, CI, z, p
cox.zph(fit)                                # PH 診斷

# 4) 連續變數效應視覺化（spline）
library(rms); ddist <- datadist(df); options(datadist="ddist")
fit2 <- cph(S ~ rcs(gene_expr, 4) + age, data = df, x=TRUE, y=TRUE)
plot(Predict(fit2, gene_expr))

# 5) Time-varying coefficient / 時變共變數
df_tv <- tmerge(df, df, id=patient, event=event(time, event))
coxph(Surv(tstart, tstop, event) ~ treatment + tt(treatment), data=df_tv,
      tt = function(x, t, ...) x * log(t+1))

# 6) Fine-Gray 競爭風險
library(cmprsk); crr(ftime=df$time, fstatus=df$status, cov1=df[,c("age","gene")])

# --- Python --- lifelines / scikit-survival
from lifelines import KaplanMeierFitter, CoxPHFitter, CoxTimeVaryingFitter
from lifelines.statistics import logrank_test, multivariate_logrank_test

# 1) KM curve
kmf = KaplanMeierFitter()
for g, sub in df.groupby("group"):
    kmf.fit(sub.time, sub.event, label=g).plot_survival_function()
print(kmf.median_survival_time_)

# 2) log-rank
res = logrank_test(t_A, t_B, event_observed_A=e_A, event_observed_B=e_B)
res.p_value, res.test_statistic

# 3) Cox PH on continuous + strata
cph = CoxPHFitter()
cph.fit(df, duration_col="time", event_col="event",
        formula="gene_expr + age", strata=["stage"])
cph.print_summary(); cph.plot()

# 4) PH assumption check
cph.check_assumptions(df, p_value_threshold=0.05, show_plots=True)

# 5) Time-varying covariate
ctv = CoxTimeVaryingFitter()
ctv.fit(df_long, id_col="id", event_col="event", start_col="start", stop_col="stop")

# 6) High-dim: scikit-survival (penalised Cox)
from sksurv.linear_model import CoxnetSurvivalAnalysis
from sksurv.util import Surv
y = Surv.from_dataframe("event", "time", df)
CoxnetSurvivalAnalysis(l1_ratio=0.9).fit(X, y)

🚫

常見錯誤：(1) 把存活時間當連續變數做 t / 線性 — 審查資料被當成事件。(2) 中位切分連續生物標誌物再做 KM — 損失檢力 + cutpoint hacking。(3) 報 Cox HR 卻沒做 cox.zph — PH 違反時 HR 失去解釋。(4) 把 KM 中位生存當成「平均」 — 重審查時兩者差異很大。(5) 忽略競爭風險 — KM 在競爭事件多時高估 cause-specific 機率。 Common mistakes: (1) Treating survival time as ordinary continuous data — censoring is mis-coded. (2) Median-splitting a continuous biomarker before KM — power loss + cutpoint hacking. (3) Reporting Cox HRs without cox.zph — HR loses meaning under PH violation. (4) Reporting KM median as a "mean" — they diverge under heavy censoring. (5) Ignoring competing risks — KM overestimates cause-specific probabilities when competing events are frequent.

📝 自我檢測

1. 「非資訊性審查 (non-informative censoring)」假設的內容與為何重要？

1. What does the "non-informative censoring" assumption state and why does it matter?

A. 審查的個體永遠不會發生事件A. Censored individuals will never have the event

B. 審查時間必須是固定的B. Censoring time must be fixed

C. 在時間 t 被審查的個體在 t 後的事件機率，與在 t 仍在風險集中的個體相同；若違反（如病情越重越易失訪）→ KM 與 Cox 估計偏差C. At time t, censored individuals have the same future-event risk as those still at risk at t; if violated (sicker patients drop out), KM and Cox estimates are biased

D. 審查率必須 < 50%D. Censoring rate must be < 50%

2. Schoenfeld 殘差如何用來診斷 PH 違反？

2. How do Schoenfeld residuals diagnose violation of proportional hazards?

A. 殘差越大代表模型擬合越差A. Larger residuals = worse fit

B. 在 PH 成立下殘差應與時間獨立；若 cox.zph 檢測出殘差對時間有趨勢，代表 HR 隨時間變動 → PH 違反B. Under PH, residuals should be independent of time; if cox.zph detects a residual-vs-time trend, the HR varies over time → PH violated

C. Schoenfeld 殘差只能用在 Weibull 模型C. Schoenfeld residuals only apply to Weibull models

D. Schoenfeld 殘差用來估計 baseline hazardD. Schoenfeld residuals are used to estimate the baseline hazard

3. 為什麼在 Cox PH 中把連續生物標誌物在中位數切成「高 / 低」是統計上不佳的做法？

3. Why is median-dichotomising a continuous biomarker in Cox PH statistically poor practice?

A. Cox PH 不允許連續變數A. Cox PH does not allow continuous variables

B. 中位切分讓 HR 的解讀更清楚，沒有任何缺點B. Dichotomising makes HR easier to interpret with no drawbacks

C. 中位切分損失約 30% 的統計檢力；若再「搜尋最佳 cutpoint」會嚴重膨脹 Type I 錯誤率；連續放入 (含 spline) 才能保留訊號C. Median dichotomisation loses ~30% power; searching for the "best" cutpoint inflates Type I error severely; continuous entry (with splines) preserves signal

D. 中位切分會違反非資訊性審查D. Median dichotomisation violates non-informative censoring