STEP 8 / 16

生存分析

Kaplan-Meier 曲線、log-rank 檢定、Cox 比例風險迴歸──癌症研究的標準工具。

Kaplan-Meier curves, log-rank test, Cox proportional hazards — the standard toolkit in cancer research.

一、生存分析在處理什麼?

傳統 t-test/ANOVA 假設「結果是已測量的數值」(如基因表現量)。但臨床資料常見:

  • 結果是事件發生時間(死亡、復發、進展)。
  • 許多人「事件還沒發生」就追蹤結束(失聯、研究截止)──稱為 右設限 (right-censoring)
  • 不能因為「沒看到事件」就把人丟掉──這些人也提供「至少活過 X 年」的訊息。

生存分析就是正確處理 censoring 並估計「活過時間 t」的機率函數。

Classic t-test/ANOVA assumes the outcome is a measured number. Clinical data is different:

  • Outcome is time to an event (death, recurrence, progression).
  • Many subjects haven't experienced the event by study end — called right-censoring.
  • You can't just drop them — they still tell you "alive past time X".

Survival analysis handles censoring properly and estimates the probability of surviving past time t.

名詞意義
time觀察時間 (從 baseline 到事件或截止)
event / status1 = 事件發生;0 = 設限
S(t)生存函數 P(survive > t)
h(t)風險函數,瞬間風險
HR (hazard ratio)兩組風險比值
median survival中位生存時間

二、Kaplan-Meier 曲線

library(survival)

# 內建資料:lung 癌症 (228 病人)
data(lung, package = "survival")
str(lung)
# time   = 觀察天數
# status = 1=censored, 2=dead   ← 注意:survival 套件慣例是 1/2 不是 0/1
# sex    = 1=male, 2=female

# 建立生存物件 / Build a Surv object
# 注意:寫 status==2 是因為 lung 用 1/2 編碼
fit <- survfit(Surv(time, status == 2) ~ sex, data = lung)
print(fit)

# 中位生存時間 / Median survival
summary(fit)$table

# 整體存活機率 / Overall S(t)
summary(fit, times = c(180, 365, 730))   # 半年/一年/兩年

# 內建繪圖 / Base plot
plot(fit, col = c("steelblue", "firebrick"),
     xlab = "Days", ylab = "Survival probability",
     mark.time = TRUE)
legend("topright", legend = c("Male","Female"),
       col = c("steelblue","firebrick"), lty = 1)

# log-rank 檢定:兩組生存曲線是否顯著不同?
# Are two survival curves significantly different?
survdiff(Surv(time, status == 2) ~ sex, data = lung)

三、用 survminer 畫漂亮 KM

# install.packages('survminer')
library(survival); library(survminer)
data(lung, package = "survival")

fit <- survfit(Surv(time, status == 2) ~ sex, data = lung)

ggsurvplot(fit, data = lung,
           pval        = TRUE,            # 自動加 log-rank p
           pval.method = TRUE,
           conf.int    = TRUE,            # 95% CI 陰影
           risk.table  = TRUE,            # 下方 at-risk 表
           risk.table.col = "strata",
           palette     = c("steelblue","firebrick"),
           legend.title = "Sex",
           legend.labs = c("Male","Female"),
           xlab        = "Days",
           ylab        = "Overall survival",
           ggtheme     = theme_classic(base_size = 13),
           surv.median.line = "hv")       # 標出中位生存時間

# 多組(>2)也直接支援
fit2 <- survfit(Surv(time, status == 2) ~ ph.ecog, data = lung)
ggsurvplot(fit2, data = lung, pval = TRUE, risk.table = TRUE)

# 存圖 / Save
# p <- ggsurvplot(fit, data = lung, pval = TRUE)
# ggsave("figs/km.pdf", print(p), width = 7, height = 6)

四、Cox 比例風險迴歸 (Cox PH)

KM 適合單一類別變數;要校正多個共變項(年齡、性別、tumor stage、基因表現量)就需要 Cox PH。輸出的係數取 exp 就是 HR

KM works for one categorical variable; to adjust for multiple covariates (age, sex, stage, gene expression) you need Cox PH. exp(coef) gives the HR.

library(survival); library(survminer)
data(lung, package = "survival")

# 多變數 Cox / Multivariable Cox
cox <- coxph(Surv(time, status == 2) ~ age + sex + ph.ecog,
             data = lung)
summary(cox)

# 重點欄位:
# coef         = log HR
# exp(coef)    = HR
# Pr(>|z|)     = p-value (Wald test)
# Concordance  = 0.5(隨機) ~ 1(完美)──模型區辨力 (類似 AUC)
# Likelihood ratio test = 整體模型 vs null 的 p

# 取 HR + 95% CI 一條漂亮表 / Tidy table
broom::tidy(cox, exponentiate = TRUE, conf.int = TRUE)

# Forest plot — 一張圖看所有 HR 與 CI
ggforest(cox, data = lung)

# 比例風險假設檢驗 / Proportional hazards assumption test
test <- cox.zph(cox)
test                    # p < 0.05 表「違反 PH 假設」
plot(test)              # 殘差 vs 時間,水平線代表符合 PH

# 違反 PH 怎麼辦?
# 1. 加入時間互動項:tt(x) 或 strata()
# 2. 把連續變數切組
# 3. 改用其他模型 (e.g., AFT、Weibull、加速失敗時間模型)
⚠️
HR 解讀:連續變數(如年齡)的 HR 是「每增加一單位的相對風險」。HR = 1.05 表示每多一歲,死亡風險上升 5%。類別變數的 HR 是「相對於 reference level」的風險倍數。 Interpreting HR: for continuous variables, HR is the relative risk per one-unit increase. HR = 1.05 means each additional year of age raises hazard by 5%. For categorical, HR is the multiplier vs the reference level.

五、生資典型應用:基因表現對生存的影響

# 假設你已從 TCGA 拿到:
#   clinical: data.frame with time, status, age, sex
#   expr   : 基因 × 樣本 expression matrix (log2 TPM)
#
# 目標:TP53 高表現組是否有不同生存?

library(survival); library(survminer)

set.seed(7)
n <- 200
clin <- data.frame(
  sample = paste0("S", 1:n),
  time   = rexp(n, rate = 1/600),    # 模擬天數
  status = rbinom(n, 1, 0.6),
  age    = round(rnorm(n, 60, 10)),
  sex    = sample(c("M","F"), n, TRUE),
  TP53   = rnorm(n, 8, 2)            # 模擬 log2 TPM
)

# 1) 連續基因表現 → 切高低組(中位數)/ Continuous → high/low by median
clin$TP53_grp <- ifelse(clin$TP53 > median(clin$TP53), "High", "Low")
clin$TP53_grp <- factor(clin$TP53_grp, levels = c("Low","High"))

# 2) KM by 高低組
fit <- survfit(Surv(time, status) ~ TP53_grp, data = clin)
ggsurvplot(fit, data = clin, pval = TRUE, risk.table = TRUE,
           palette = c("steelblue","firebrick"),
           legend.title = "TP53", legend.labs = c("Low","High"))

# 3) 連續 Cox(不切組,校正 age + sex)
cox <- coxph(Surv(time, status) ~ TP53 + age + sex, data = clin)
summary(cox)

# 4) 對「20,000 個基因」一一做 Cox(迴圈)
# univariate_cox <- function(g) {
#   df <- data.frame(time=clin$time, status=clin$status, expr=expr[g, clin$sample])
#   fit <- coxph(Surv(time, status) ~ expr, data = df)
#   c(HR = exp(coef(fit)), p = summary(fit)$coefficients[, "Pr(>|z|)"])
# }
# results <- sapply(rownames(expr), univariate_cox)
# # 別忘了多重檢定校正
# results['padj', ] <- p.adjust(results['p',], method = 'BH')

六、KM 曲線互動模擬

調整兩組樣本的「真實 hazard ratio」與每組大小,看 KM 曲線分離程度與 log-rank p。

Adjust true HR and group size; see how separated the KM curves become and the log-rank p.

📝 自我檢測

1. 為什麼不能直接把 censored 病人從分析中移除?

1. Why can't we just drop censored patients from the analysis?

A. 樣本數會變太少A. Sample size shrinks
B. 只是慣例而已B. Just convention
C. censored 也提供資訊──「至少活過時間 X」對估計 S(t) 是必要的C. They still inform "alive at least until X" — required for unbiased S(t)
D. R 會自動補資料D. R imputes them automatically

2. Cox 模型的 summary(fit)exp(coef) = 1.5 對年齡的意義?

2. In Cox summary(fit), age has exp(coef) = 1.5. What does that mean?

A. 老人比年輕人高 50% 機率死A. Old people have 50% chance of dying
B. 每多一歲,瞬間死亡風險上升 50%(HR = 1.5)B. Each extra year of age multiplies the hazard by 1.5 (HR=1.5)
C. 年齡與生存無關C. Age is unrelated to survival
D. p-value = 1.5D. p-value = 1.5

3. 跑完 cox.zph(fit) 看到 p < 0.05,代表?

3. cox.zph(fit) returns p < 0.05 — what does this signal?

A. 模型整體不顯著A. Model overall is not significant
B. HR 太小B. HR is too small
C. 違反「比例風險」假設──HR 隨時間變化,需重新建模C. Proportional-hazards assumption is violated — HR changes over time, remodel
D. 不需理會D. Ignore it