Step 10: Bulk RNA-seq — R Bioinformatics Tutorial

總覽

一、Bulk RNA-seq 分析全流程

FASTQ → BAM

對 reads 比對到參考基因組（HISAT2、STAR、Salmon）。本章不討論這步──通常在 Linux/HPC 完成，輸出 BAM 或 transcript abundance。

Align reads to reference (HISAT2, STAR, Salmon). Not covered here — done on Linux/HPC, outputs BAM or abundance.

BAM → counts matrix

用 Rsubread::featureCounts() 或 tximport 把 BAM/Salmon 結果聚合成「基因 × 樣本」整數矩陣。

Aggregate BAM/Salmon to gene × sample integer matrix via Rsubread::featureCounts() or tximport.

QC + filter

移除低表現基因（如 rowSums(counts) < 10）；PCA 看樣本群聚是否合理。

Filter low-expression genes (e.g. rowSums(counts) < 10); PCA to sanity-check sample groupings.

Normalization + DGE

DESeq2 / edgeR / limma-voom 三選一。本章主推 DESeq2。

Pick DESeq2 / edgeR / limma-voom. We focus on DESeq2.

視覺化

Volcano、MA plot、heatmap of top DEGs（見第 12 章）。

Volcano, MA plot, heatmap of top DEGs (see Step 12).

功能富集

用 clusterProfiler 做 GO / KEGG / GSEA（見第 11 章）。

GO / KEGG / GSEA via clusterProfiler (see Step 11).

工具比較

二、DESeq2 vs edgeR vs limma-voom

—	DESeq2	edgeR	limma-voom
模型	Negative Binomial	Negative Binomial	Linear (after voom)
輸入	Raw integer counts	Raw integer counts	Raw integer counts
標準化	Median-of-ratios	TMM	TMM + voom (mean-variance)
速度	慢	中等	快
樣本數建議	3-100	3-100	100+ 大 cohort
特色	自動 outlier；shrinkage LFC	complex design 穩	大資料快

💡

選哪個？ 樣本 ≤ 50 → DESeq2（最普及，文件最詳）；> 100 大 cohort → limma-voom（速度優勢）。三者結果通常 80–90% 重疊；遇到爭議結果可以三個都跑、取交集。 Which? n ≤ 50 → DESeq2 (most popular, best docs); n > 100 → limma-voom (much faster). Results overlap 80–90%; for borderline genes run all three and intersect.

DESeq2 完整流程

三、DESeq2 完整範例

# BiocManager::install(c('DESeq2','airway'))
library(DESeq2); library(airway)
data(airway)            # 內建 SummarizedExperiment 範例
airway$dex <- relevel(airway$dex, ref = 'untrt')   # 設定 reference

## 1) 建立 DESeqDataSet
dds <- DESeqDataSet(airway, design = ~ cell + dex)   # cell 是批次校正
dds

## 2) 過濾低表現基因 (建議：至少 10 個樣本中有 >= 10 reads)
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep, ]
nrow(dds)

## 3) 執行整套分析（一行搞定！）
dds <- DESeq(dds)        # 估計 size factors → dispersion → GLM fit → Wald

## 4) 取出結果
res <- results(dds, name = 'dex_trt_vs_untrt')   # 或 contrast = c('dex','trt','untrt')
summary(res)
head(res)

## 5) shrunken log2FC (推薦寫進論文)
res_shrunk <- lfcShrink(dds, coef = 'dex_trt_vs_untrt', type = 'apeglm')
plotMA(res_shrunk, ylim = c(-5, 5))

## 6) 排序、篩選 / Sort & filter
res_df <- as.data.frame(res_shrunk)
res_df$gene <- rownames(res_df)
sig <- subset(res_df, padj < 0.05 & abs(log2FoldChange) > 1)
nrow(sig); head(sig[order(sig$padj), ])

## 7) 視覺化用的 transformed counts（log-like，已穩定變異）
vsd <- vst(dds, blind = FALSE)        # 推薦
# rld <- rlog(dds, blind = FALSE)     # 老方法，較慢

## 8) PCA
plotPCA(vsd, intgroup = c('dex','cell'))

## 9) 樣本 distance 熱圖
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
# pheatmap::pheatmap(sampleDistMatrix)

## 10) 存結果 / Save
# write.csv(res_df, 'results/dge_DESeq2.csv', row.names = FALSE)
# saveRDS(dds, 'results/dds.rds')

設計矩陣

四、design formula 是設計的靈魂

design = ~ X + Y + Z 中，最後一項通常是「你想比較的條件」（DESeq2 預設用最後一項做 contrast）。前面項是要校正的共變項（如批次、性別）。

In design = ~ X + Y + Z, the last term is usually "the condition of interest" (DESeq2 contrasts on the last term by default). Earlier terms are covariates to adjust for (batch, sex...).

design	意義
`~ condition`	只有處理組差異
`~ batch + condition`	校正批次後比較處理
`~ sex + age + condition`	校正性別與年齡
`~ patient + condition`	paired design：同病人不同時點
`~ condition + condition:time`	條件 × 時間交互作用

# 多 contrast 同時拿 / Multiple contrasts
res_AvB <- results(dds, contrast = c('condition', 'A', 'B'))   # A vs B
res_AvC <- results(dds, contrast = c('condition', 'A', 'C'))   # A vs C

# Likelihood Ratio Test (LRT)：檢驗整個因子是否重要
# 適合「3+ 組」或「時序」資料
dds_lrt <- DESeq(dds, test = 'LRT', reduced = ~ batch)
res_lrt <- results(dds_lrt)

# 互動項 design / Interaction design
# design(dds) <- ~ time + condition + time:condition
# dds <- DESeq(dds)
# res_int <- results(dds, name = 'timeT2.conditionTrt')   # 看交互作用 p

替代方案

五、edgeR 與 limma-voom 速覽

# BiocManager::install('edgeR')
library(edgeR)

y <- DGEList(counts = counts, group = group)
keep <- filterByExpr(y, group = group)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- calcNormFactors(y, method = "TMM")

design <- model.matrix(~ group)
y <- estimateDisp(y, design)

fit <- glmQLFit(y, design)               # quasi-likelihood (推薦)
qlf <- glmQLFTest(fit, coef = 2)
topTags(qlf, n = 10)
res <- topTags(qlf, n = Inf)$table
res$padj <- res$FDR              # edgeR 用 FDR 欄

# BiocManager::install('limma')
library(limma); library(edgeR)

y <- DGEList(counts = counts, group = group)
keep <- filterByExpr(y, group = group)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- calcNormFactors(y)

design <- model.matrix(~ group)
v <- voom(y, design, plot = TRUE)        # mean-variance 轉換

fit <- lmFit(v, design)
fit <- eBayes(fit)                          # empirical Bayes shrinkage
topTable(fit, coef = 2, number = 10)
res <- topTable(fit, coef = 2, number = Inf, sort.by = "P")
# 欄名：logFC, AveExpr, t, P.Value, adj.P.Val, B

Salmon → DESeq2

六、Salmon/Kallisto quant 結果匯入

# 假設你跑完 Salmon，每個樣本有 quant.sf 在 quants/SAMPLE_NAME/quant.sf
# Salmon ran; each sample has quants//quant.sf

# library(tximport); library(GenomicFeatures); library(DESeq2)

# 1) 建 tx2gene 對照（transcript_id → gene_id）
# txdb <- makeTxDbFromGFF('annotation.gtf')
# k    <- keys(txdb, keytype = 'TXNAME')
# tx2gene <- select(txdb, k, 'GENEID', 'TXNAME')

# 2) 列出 quant.sf 檔
# samples <- read.csv('samples.csv')
# files <- file.path('quants', samples$id, 'quant.sf')
# names(files) <- samples$id
# all(file.exists(files))

# 3) 匯入並聚合到 gene-level
# txi <- tximport(files, type = 'salmon', tx2gene = tx2gene)
# str(txi, max.level = 1)
# # txi$counts (gene × sample), txi$length, txi$abundance

# 4) 直接給 DESeq2
# dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ condition)
# dds <- DESeq(dds)

cat('Salmon → DESeq2 demo. 需先跑 Salmon 並安裝 tximport / DESeq2。\n')

互動

七、為什麼用 Negative Binomial？

Poisson 假設「平均 = 變異」；但真實 RNA-seq counts 變異 >> 平均（過度離散 / overdispersion）。NB 加一個 dispersion φ 來描述這。下圖比較同樣平均下，調整 φ 觀察分佈如何變寬。

Poisson assumes "mean = variance"; real RNA-seq counts have variance ≫ mean (overdispersion). NB adds a dispersion φ. Below: same mean, vary φ to see how the distribution widens.

平均 μ 100

dispersion φ 0.10

— — —

📝 自我檢測

1. DESeq2 的輸入應該是？

1. DESeq2 expects which input?

A. log2 TPMA. log2 TPM

B. Z-score 後的表現量B. Z-scored expression

C. 原始整數 counts（raw integer counts）C. Raw integer counts

D. FPKMD. FPKM

2. 想在比較處理 vs 對照時校正批次效應，design 應寫？

2. To compare treated vs control while adjusting for batch — design?

A. ~ conditionA. ~ condition

B. ~ condition + batchB. ~ condition + batch

C. ~ batch + condition（要比較的放最後）C. ~ batch + condition (variable of interest last)

D. ~ batch * conditionD. ~ batch * condition

3. 為什麼 DESeq2 用 Negative Binomial 而非 Poisson？

3. Why does DESeq2 use Negative Binomial instead of Poisson?

A. NB 比較好寫A. NB is easier to code

B. RNA-seq counts 過度離散，變異 >> 平均；NB 多一個 dispersion 參數來描述B. RNA-seq counts are overdispersed (variance ≫ mean); NB adds a dispersion parameter

C. Poisson 不能處理整數C. Poisson can't handle integers

D. NB 比較快D. NB is faster