一、Bulk RNA-seq 分析全流程
FASTQ → BAM
對 reads 比對到參考基因組(HISAT2、STAR、Salmon)。本章不討論這步──通常在 Linux/HPC 完成,輸出 BAM 或 transcript abundance。
Align reads to reference (HISAT2, STAR, Salmon). Not covered here — done on Linux/HPC, outputs BAM or abundance.
BAM → counts matrix
用 Rsubread::featureCounts() 或 tximport 把 BAM/Salmon 結果聚合成「基因 × 樣本」整數矩陣。
Aggregate BAM/Salmon to gene × sample integer matrix via Rsubread::featureCounts() or tximport.
QC + filter
移除低表現基因(如 rowSums(counts) < 10);PCA 看樣本群聚是否合理。
Filter low-expression genes (e.g. rowSums(counts) < 10); PCA to sanity-check sample groupings.
Normalization + DGE
DESeq2 / edgeR / limma-voom 三選一。本章主推 DESeq2。
Pick DESeq2 / edgeR / limma-voom. We focus on DESeq2.
視覺化
Volcano、MA plot、heatmap of top DEGs(見第 12 章)。
Volcano, MA plot, heatmap of top DEGs (see Step 12).
功能富集
用 clusterProfiler 做 GO / KEGG / GSEA(見第 11 章)。
GO / KEGG / GSEA via clusterProfiler (see Step 11).
二、DESeq2 vs edgeR vs limma-voom
| — | DESeq2 | edgeR | limma-voom |
|---|---|---|---|
| 模型 | Negative Binomial | Negative Binomial | Linear (after voom) |
| 輸入 | Raw integer counts | Raw integer counts | Raw integer counts |
| 標準化 | Median-of-ratios | TMM | TMM + voom (mean-variance) |
| 速度 | 慢 | 中等 | 快 |
| 樣本數建議 | 3-100 | 3-100 | 100+ 大 cohort |
| 特色 | 自動 outlier;shrinkage LFC | complex design 穩 | 大資料快 |
三、DESeq2 完整範例
# BiocManager::install(c('DESeq2','airway'))
library(DESeq2); library(airway)
data(airway) # 內建 SummarizedExperiment 範例
airway$dex <- relevel(airway$dex, ref = 'untrt') # 設定 reference
## 1) 建立 DESeqDataSet
dds <- DESeqDataSet(airway, design = ~ cell + dex) # cell 是批次校正
dds
## 2) 過濾低表現基因 (建議:至少 10 個樣本中有 >= 10 reads)
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep, ]
nrow(dds)
## 3) 執行整套分析(一行搞定!)
dds <- DESeq(dds) # 估計 size factors → dispersion → GLM fit → Wald
## 4) 取出結果
res <- results(dds, name = 'dex_trt_vs_untrt') # 或 contrast = c('dex','trt','untrt')
summary(res)
head(res)
## 5) shrunken log2FC (推薦寫進論文)
res_shrunk <- lfcShrink(dds, coef = 'dex_trt_vs_untrt', type = 'apeglm')
plotMA(res_shrunk, ylim = c(-5, 5))
## 6) 排序、篩選 / Sort & filter
res_df <- as.data.frame(res_shrunk)
res_df$gene <- rownames(res_df)
sig <- subset(res_df, padj < 0.05 & abs(log2FoldChange) > 1)
nrow(sig); head(sig[order(sig$padj), ])
## 7) 視覺化用的 transformed counts(log-like,已穩定變異)
vsd <- vst(dds, blind = FALSE) # 推薦
# rld <- rlog(dds, blind = FALSE) # 老方法,較慢
## 8) PCA
plotPCA(vsd, intgroup = c('dex','cell'))
## 9) 樣本 distance 熱圖
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
# pheatmap::pheatmap(sampleDistMatrix)
## 10) 存結果 / Save
# write.csv(res_df, 'results/dge_DESeq2.csv', row.names = FALSE)
# saveRDS(dds, 'results/dds.rds')
四、design formula 是設計的靈魂
design = ~ X + Y + Z 中,最後一項通常是「你想比較的條件」(DESeq2 預設用最後一項做 contrast)。前面項是要校正的共變項(如批次、性別)。
In design = ~ X + Y + Z, the last term is usually "the condition of interest" (DESeq2 contrasts on the last term by default). Earlier terms are covariates to adjust for (batch, sex...).
| design | 意義 |
|---|---|
~ condition | 只有處理組差異 |
~ batch + condition | 校正批次後比較處理 |
~ sex + age + condition | 校正性別與年齡 |
~ patient + condition | paired design:同病人不同時點 |
~ condition + condition:time | 條件 × 時間 交互作用 |
# 多 contrast 同時拿 / Multiple contrasts
res_AvB <- results(dds, contrast = c('condition', 'A', 'B')) # A vs B
res_AvC <- results(dds, contrast = c('condition', 'A', 'C')) # A vs C
# Likelihood Ratio Test (LRT):檢驗整個因子是否重要
# 適合「3+ 組」或「時序」資料
dds_lrt <- DESeq(dds, test = 'LRT', reduced = ~ batch)
res_lrt <- results(dds_lrt)
# 互動項 design / Interaction design
# design(dds) <- ~ time + condition + time:condition
# dds <- DESeq(dds)
# res_int <- results(dds, name = 'timeT2.conditionTrt') # 看交互作用 p
五、edgeR 與 limma-voom 速覽
# BiocManager::install('edgeR') library(edgeR) y <- DGEList(counts = counts, group = group) keep <- filterByExpr(y, group = group) y <- y[keep, , keep.lib.sizes = FALSE] y <- calcNormFactors(y, method = "TMM") design <- model.matrix(~ group) y <- estimateDisp(y, design) fit <- glmQLFit(y, design) # quasi-likelihood (推薦) qlf <- glmQLFTest(fit, coef = 2) topTags(qlf, n = 10) res <- topTags(qlf, n = Inf)$table res$padj <- res$FDR # edgeR 用 FDR 欄
# BiocManager::install('limma') library(limma); library(edgeR) y <- DGEList(counts = counts, group = group) keep <- filterByExpr(y, group = group) y <- y[keep, , keep.lib.sizes = FALSE] y <- calcNormFactors(y) design <- model.matrix(~ group) v <- voom(y, design, plot = TRUE) # mean-variance 轉換 fit <- lmFit(v, design) fit <- eBayes(fit) # empirical Bayes shrinkage topTable(fit, coef = 2, number = 10) res <- topTable(fit, coef = 2, number = Inf, sort.by = "P") # 欄名:logFC, AveExpr, t, P.Value, adj.P.Val, B
六、Salmon/Kallisto quant 結果匯入
# 假設你跑完 Salmon,每個樣本有 quant.sf 在 quants/SAMPLE_NAME/quant.sf # Salmon ran; each sample has quants//quant.sf # library(tximport); library(GenomicFeatures); library(DESeq2) # 1) 建 tx2gene 對照(transcript_id → gene_id) # txdb <- makeTxDbFromGFF('annotation.gtf') # k <- keys(txdb, keytype = 'TXNAME') # tx2gene <- select(txdb, k, 'GENEID', 'TXNAME') # 2) 列出 quant.sf 檔 # samples <- read.csv('samples.csv') # files <- file.path('quants', samples$id, 'quant.sf') # names(files) <- samples$id # all(file.exists(files)) # 3) 匯入並聚合到 gene-level # txi <- tximport(files, type = 'salmon', tx2gene = tx2gene) # str(txi, max.level = 1) # # txi$counts (gene × sample), txi$length, txi$abundance # 4) 直接給 DESeq2 # dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ condition) # dds <- DESeq(dds) cat('Salmon → DESeq2 demo. 需先跑 Salmon 並安裝 tximport / DESeq2。\n')
七、為什麼用 Negative Binomial?
Poisson 假設「平均 = 變異」;但真實 RNA-seq counts 變異 >> 平均(過度離散 / overdispersion)。NB 加一個 dispersion φ 來描述這。下圖比較同樣平均下,調整 φ 觀察分佈如何變寬。
Poisson assumes "mean = variance"; real RNA-seq counts have variance ≫ mean (overdispersion). NB adds a dispersion φ. Below: same mean, vary φ to see how the distribution widens.
📝 自我檢測
1. DESeq2 的輸入應該是?
1. DESeq2 expects which input?
2. 想在比較處理 vs 對照時校正批次效應,design 應寫?
2. To compare treated vs control while adjusting for batch — design?
3. 為什麼 DESeq2 用 Negative Binomial 而非 Poisson?
3. Why does DESeq2 use Negative Binomial instead of Poisson?