一、生存分析在處理什麼?
傳統 t-test/ANOVA 假設「結果是已測量的數值」(如基因表現量)。但臨床資料常見:
- 結果是事件發生時間(死亡、復發、進展)。
- 許多人「事件還沒發生」就追蹤結束(失聯、研究截止)──稱為 右設限 (right-censoring)。
- 不能因為「沒看到事件」就把人丟掉──這些人也提供「至少活過 X 年」的訊息。
生存分析就是正確處理 censoring 並估計「活過時間 t」的機率函數。
Classic t-test/ANOVA assumes the outcome is a measured number. Clinical data is different:
- Outcome is time to an event (death, recurrence, progression).
- Many subjects haven't experienced the event by study end — called right-censoring.
- You can't just drop them — they still tell you "alive past time X".
Survival analysis handles censoring properly and estimates the probability of surviving past time t.
| 名詞 | 意義 |
|---|---|
| time | 觀察時間 (從 baseline 到事件或截止) |
| event / status | 1 = 事件發生;0 = 設限 |
| S(t) | 生存函數 P(survive > t) |
| h(t) | 風險函數,瞬間風險 |
| HR (hazard ratio) | 兩組風險比值 |
| median survival | 中位生存時間 |
二、Kaplan-Meier 曲線
library(survival)
# 內建資料:lung 癌症 (228 病人)
data(lung, package = "survival")
str(lung)
# time = 觀察天數
# status = 1=censored, 2=dead ← 注意:survival 套件慣例是 1/2 不是 0/1
# sex = 1=male, 2=female
# 建立生存物件 / Build a Surv object
# 注意:寫 status==2 是因為 lung 用 1/2 編碼
fit <- survfit(Surv(time, status == 2) ~ sex, data = lung)
print(fit)
# 中位生存時間 / Median survival
summary(fit)$table
# 整體存活機率 / Overall S(t)
summary(fit, times = c(180, 365, 730)) # 半年/一年/兩年
# 內建繪圖 / Base plot
plot(fit, col = c("steelblue", "firebrick"),
xlab = "Days", ylab = "Survival probability",
mark.time = TRUE)
legend("topright", legend = c("Male","Female"),
col = c("steelblue","firebrick"), lty = 1)
# log-rank 檢定:兩組生存曲線是否顯著不同?
# Are two survival curves significantly different?
survdiff(Surv(time, status == 2) ~ sex, data = lung)
三、用 survminer 畫漂亮 KM
# install.packages('survminer')
library(survival); library(survminer)
data(lung, package = "survival")
fit <- survfit(Surv(time, status == 2) ~ sex, data = lung)
ggsurvplot(fit, data = lung,
pval = TRUE, # 自動加 log-rank p
pval.method = TRUE,
conf.int = TRUE, # 95% CI 陰影
risk.table = TRUE, # 下方 at-risk 表
risk.table.col = "strata",
palette = c("steelblue","firebrick"),
legend.title = "Sex",
legend.labs = c("Male","Female"),
xlab = "Days",
ylab = "Overall survival",
ggtheme = theme_classic(base_size = 13),
surv.median.line = "hv") # 標出中位生存時間
# 多組(>2)也直接支援
fit2 <- survfit(Surv(time, status == 2) ~ ph.ecog, data = lung)
ggsurvplot(fit2, data = lung, pval = TRUE, risk.table = TRUE)
# 存圖 / Save
# p <- ggsurvplot(fit, data = lung, pval = TRUE)
# ggsave("figs/km.pdf", print(p), width = 7, height = 6)
四、Cox 比例風險迴歸 (Cox PH)
KM 適合單一類別變數;要校正多個共變項(年齡、性別、tumor stage、基因表現量)就需要 Cox PH。輸出的係數取 exp 就是 HR。
KM works for one categorical variable; to adjust for multiple covariates (age, sex, stage, gene expression) you need Cox PH. exp(coef) gives the HR.
library(survival); library(survminer)
data(lung, package = "survival")
# 多變數 Cox / Multivariable Cox
cox <- coxph(Surv(time, status == 2) ~ age + sex + ph.ecog,
data = lung)
summary(cox)
# 重點欄位:
# coef = log HR
# exp(coef) = HR
# Pr(>|z|) = p-value (Wald test)
# Concordance = 0.5(隨機) ~ 1(完美)──模型區辨力 (類似 AUC)
# Likelihood ratio test = 整體模型 vs null 的 p
# 取 HR + 95% CI 一條漂亮表 / Tidy table
broom::tidy(cox, exponentiate = TRUE, conf.int = TRUE)
# Forest plot — 一張圖看所有 HR 與 CI
ggforest(cox, data = lung)
# 比例風險假設檢驗 / Proportional hazards assumption test
test <- cox.zph(cox)
test # p < 0.05 表「違反 PH 假設」
plot(test) # 殘差 vs 時間,水平線代表符合 PH
# 違反 PH 怎麼辦?
# 1. 加入時間互動項:tt(x) 或 strata()
# 2. 把連續變數切組
# 3. 改用其他模型 (e.g., AFT、Weibull、加速失敗時間模型)
五、生資典型應用:基因表現對生存的影響
# 假設你已從 TCGA 拿到:
# clinical: data.frame with time, status, age, sex
# expr : 基因 × 樣本 expression matrix (log2 TPM)
#
# 目標:TP53 高表現組是否有不同生存?
library(survival); library(survminer)
set.seed(7)
n <- 200
clin <- data.frame(
sample = paste0("S", 1:n),
time = rexp(n, rate = 1/600), # 模擬天數
status = rbinom(n, 1, 0.6),
age = round(rnorm(n, 60, 10)),
sex = sample(c("M","F"), n, TRUE),
TP53 = rnorm(n, 8, 2) # 模擬 log2 TPM
)
# 1) 連續基因表現 → 切高低組(中位數)/ Continuous → high/low by median
clin$TP53_grp <- ifelse(clin$TP53 > median(clin$TP53), "High", "Low")
clin$TP53_grp <- factor(clin$TP53_grp, levels = c("Low","High"))
# 2) KM by 高低組
fit <- survfit(Surv(time, status) ~ TP53_grp, data = clin)
ggsurvplot(fit, data = clin, pval = TRUE, risk.table = TRUE,
palette = c("steelblue","firebrick"),
legend.title = "TP53", legend.labs = c("Low","High"))
# 3) 連續 Cox(不切組,校正 age + sex)
cox <- coxph(Surv(time, status) ~ TP53 + age + sex, data = clin)
summary(cox)
# 4) 對「20,000 個基因」一一做 Cox(迴圈)
# univariate_cox <- function(g) {
# df <- data.frame(time=clin$time, status=clin$status, expr=expr[g, clin$sample])
# fit <- coxph(Surv(time, status) ~ expr, data = df)
# c(HR = exp(coef(fit)), p = summary(fit)$coefficients[, "Pr(>|z|)"])
# }
# results <- sapply(rownames(expr), univariate_cox)
# # 別忘了多重檢定校正
# results['padj', ] <- p.adjust(results['p',], method = 'BH')
六、KM 曲線互動模擬
調整兩組樣本的「真實 hazard ratio」與每組大小,看 KM 曲線分離程度與 log-rank p。
Adjust true HR and group size; see how separated the KM curves become and the log-rank p.
📝 自我檢測
1. 為什麼不能直接把 censored 病人從分析中移除?
1. Why can't we just drop censored patients from the analysis?
2. Cox 模型的 summary(fit) 中 exp(coef) = 1.5 對年齡的意義?
2. In Cox summary(fit), age has exp(coef) = 1.5. What does that mean?
3. 跑完 cox.zph(fit) 看到 p < 0.05,代表?
3. cox.zph(fit) returns p < 0.05 — what does this signal?