Step 8: Survival — R Bioinformatics Tutorial

概念

一、生存分析在處理什麼？

傳統 t-test/ANOVA 假設「結果是已測量的數值」（如基因表現量）。但臨床資料常見：

結果是事件發生時間（死亡、復發、進展）。
許多人「事件還沒發生」就追蹤結束（失聯、研究截止）──稱為 右設限 (right-censoring)。
不能因為「沒看到事件」就把人丟掉──這些人也提供「至少活過 X 年」的訊息。

生存分析就是正確處理 censoring 並估計「活過時間 t」的機率函數。

Classic t-test/ANOVA assumes the outcome is a measured number. Clinical data is different:

Outcome is time to an event (death, recurrence, progression).
Many subjects haven't experienced the event by study end — called right-censoring.
You can't just drop them — they still tell you "alive past time X".

Survival analysis handles censoring properly and estimates the probability of surviving past time t.

名詞	意義
time	觀察時間 (從 baseline 到事件或截止)
event / status	1 = 事件發生；0 = 設限
S(t)	生存函數 P(survive > t)
h(t)	風險函數，瞬間風險
HR (hazard ratio)	兩組風險比值
median survival	中位生存時間

KM 曲線

二、Kaplan-Meier 曲線

library(survival)

# 內建資料：lung 癌症 (228 病人)
data(lung, package = "survival")
str(lung)
# time   = 觀察天數
# status = 1=censored, 2=dead   ← 注意：survival 套件慣例是 1/2 不是 0/1
# sex    = 1=male, 2=female

# 建立生存物件 / Build a Surv object
# 注意：寫 status==2 是因為 lung 用 1/2 編碼
fit <- survfit(Surv(time, status == 2) ~ sex, data = lung)
print(fit)

# 中位生存時間 / Median survival
summary(fit)$table

# 整體存活機率 / Overall S(t)
summary(fit, times = c(180, 365, 730))   # 半年/一年/兩年

# 內建繪圖 / Base plot
plot(fit, col = c("steelblue", "firebrick"),
     xlab = "Days", ylab = "Survival probability",
     mark.time = TRUE)
legend("topright", legend = c("Male","Female"),
       col = c("steelblue","firebrick"), lty = 1)

# log-rank 檢定：兩組生存曲線是否顯著不同？
# Are two survival curves significantly different?
survdiff(Surv(time, status == 2) ~ sex, data = lung)

出版品級

三、用 survminer 畫漂亮 KM

# install.packages('survminer')
library(survival); library(survminer)
data(lung, package = "survival")

fit <- survfit(Surv(time, status == 2) ~ sex, data = lung)

ggsurvplot(fit, data = lung,
           pval        = TRUE,            # 自動加 log-rank p
           pval.method = TRUE,
           conf.int    = TRUE,            # 95% CI 陰影
           risk.table  = TRUE,            # 下方 at-risk 表
           risk.table.col = "strata",
           palette     = c("steelblue","firebrick"),
           legend.title = "Sex",
           legend.labs = c("Male","Female"),
           xlab        = "Days",
           ylab        = "Overall survival",
           ggtheme     = theme_classic(base_size = 13),
           surv.median.line = "hv")       # 標出中位生存時間

# 多組（>2）也直接支援
fit2 <- survfit(Surv(time, status == 2) ~ ph.ecog, data = lung)
ggsurvplot(fit2, data = lung, pval = TRUE, risk.table = TRUE)

# 存圖 / Save
# p <- ggsurvplot(fit, data = lung, pval = TRUE)
# ggsave("figs/km.pdf", print(p), width = 7, height = 6)

Cox 模型

四、Cox 比例風險迴歸 (Cox PH)

KM 適合單一類別變數；要校正多個共變項（年齡、性別、tumor stage、基因表現量）就需要 Cox PH。輸出的係數取 exp 就是 HR。

KM works for one categorical variable; to adjust for multiple covariates (age, sex, stage, gene expression) you need Cox PH. exp(coef) gives the HR.

library(survival); library(survminer)
data(lung, package = "survival")

# 多變數 Cox / Multivariable Cox
cox <- coxph(Surv(time, status == 2) ~ age + sex + ph.ecog,
             data = lung)
summary(cox)

# 重點欄位：
# coef         = log HR
# exp(coef)    = HR
# Pr(>|z|)     = p-value (Wald test)
# Concordance  = 0.5(隨機) ~ 1(完美)──模型區辨力 (類似 AUC)
# Likelihood ratio test = 整體模型 vs null 的 p

# 取 HR + 95% CI 一條漂亮表 / Tidy table
broom::tidy(cox, exponentiate = TRUE, conf.int = TRUE)

# Forest plot — 一張圖看所有 HR 與 CI
ggforest(cox, data = lung)

# 比例風險假設檢驗 / Proportional hazards assumption test
test <- cox.zph(cox)
test                    # p < 0.05 表「違反 PH 假設」
plot(test)              # 殘差 vs 時間，水平線代表符合 PH

# 違反 PH 怎麼辦？
# 1. 加入時間互動項：tt(x) 或 strata()
# 2. 把連續變數切組
# 3. 改用其他模型 (e.g., AFT、Weibull、加速失敗時間模型)

⚠️

HR 解讀：連續變數（如年齡）的 HR 是「每增加一單位的相對風險」。HR = 1.05 表示每多一歲，死亡風險上升 5%。類別變數的 HR 是「相對於 reference level」的風險倍數。 Interpreting HR: for continuous variables, HR is the relative risk per one-unit increase. HR = 1.05 means each additional year of age raises hazard by 5%. For categorical, HR is the multiplier vs the reference level.

實戰

五、生資典型應用：基因表現對生存的影響

# 假設你已從 TCGA 拿到：
#   clinical: data.frame with time, status, age, sex
#   expr   : 基因 × 樣本 expression matrix (log2 TPM)
#
# 目標：TP53 高表現組是否有不同生存？

library(survival); library(survminer)

set.seed(7)
n <- 200
clin <- data.frame(
  sample = paste0("S", 1:n),
  time   = rexp(n, rate = 1/600),    # 模擬天數
  status = rbinom(n, 1, 0.6),
  age    = round(rnorm(n, 60, 10)),
  sex    = sample(c("M","F"), n, TRUE),
  TP53   = rnorm(n, 8, 2)            # 模擬 log2 TPM
)

# 1) 連續基因表現 → 切高低組（中位數）/ Continuous → high/low by median
clin$TP53_grp <- ifelse(clin$TP53 > median(clin$TP53), "High", "Low")
clin$TP53_grp <- factor(clin$TP53_grp, levels = c("Low","High"))

# 2) KM by 高低組
fit <- survfit(Surv(time, status) ~ TP53_grp, data = clin)
ggsurvplot(fit, data = clin, pval = TRUE, risk.table = TRUE,
           palette = c("steelblue","firebrick"),
           legend.title = "TP53", legend.labs = c("Low","High"))

# 3) 連續 Cox（不切組，校正 age + sex）
cox <- coxph(Surv(time, status) ~ TP53 + age + sex, data = clin)
summary(cox)

# 4) 對「20,000 個基因」一一做 Cox（迴圈）
# univariate_cox <- function(g) {
#   df <- data.frame(time=clin$time, status=clin$status, expr=expr[g, clin$sample])
#   fit <- coxph(Surv(time, status) ~ expr, data = df)
#   c(HR = exp(coef(fit)), p = summary(fit)$coefficients[, "Pr(>|z|)"])
# }
# results <- sapply(rownames(expr), univariate_cox)
# # 別忘了多重檢定校正
# results['padj', ] <- p.adjust(results['p',], method = 'BH')

互動

六、KM 曲線互動模擬

調整兩組樣本的「真實 hazard ratio」與每組大小，看 KM 曲線分離程度與 log-rank p。

Adjust true HR and group size; see how separated the KM curves become and the log-rank p.

每組樣本數 100

真實 HR (high vs low) 2.0

設限比例 (%) 30

— — —

📝 自我檢測

1. 為什麼不能直接把 censored 病人從分析中移除？

1. Why can't we just drop censored patients from the analysis?

A. 樣本數會變太少A. Sample size shrinks

B. 只是慣例而已B. Just convention

C. censored 也提供資訊──「至少活過時間 X」對估計 S(t) 是必要的C. They still inform "alive at least until X" — required for unbiased S(t)

D. R 會自動補資料D. R imputes them automatically

2. Cox 模型的 summary(fit) 中 exp(coef) = 1.5 對年齡的意義？

2. In Cox summary(fit), age has exp(coef) = 1.5. What does that mean?

A. 老人比年輕人高 50% 機率死A. Old people have 50% chance of dying

B. 每多一歲，瞬間死亡風險上升 50%（HR = 1.5）B. Each extra year of age multiplies the hazard by 1.5 (HR=1.5)

C. 年齡與生存無關C. Age is unrelated to survival

D. p-value = 1.5D. p-value = 1.5

3. 跑完 cox.zph(fit) 看到 p < 0.05，代表？

3. cox.zph(fit) returns p < 0.05 — what does this signal?

A. 模型整體不顯著A. Model overall is not significant

B. HR 太小B. HR is too small

C. 違反「比例風險」假設──HR 隨時間變化，需重新建模C. Proportional-hazards assumption is violated — HR changes over time, remodel

D. 不需理會D. Ignore it