一、Bioconductor 是什麼?
Bioconductor 是 R 的「生資延伸宇宙」──超過 2,300 個專為生物資料設計的套件,從 RNA-seq、scRNA-seq、ChIP-seq、變異、註解、影像到流式細胞,幾乎包山包海。
核心特色:
- 共用資料結構:所有套件都圍繞幾個核心 S4 物件(
SummarizedExperiment、GRanges、DNAStringSet...)。 - 定期釋出:每年 4 月、10 月各一次,與 R 主版本綁定。
- 嚴格審查:每個套件都需通過 R CMD check + vignette + 持續整合測試。
- 強大社群:
https://support.bioconductor.org24h 內幾乎都有套件作者回覆。
Bioconductor is R's bioinformatics universe — 2,300+ packages purpose-built for biological data: RNA-seq, scRNA-seq, ChIP-seq, variants, annotation, imaging, flow cytometry — you name it.
Core features:
- Shared data structures: every package revolves around a few S4 cores (
SummarizedExperiment,GRanges,DNAStringSet...). - Twice-yearly releases: April & October, pinned to R majors.
- Strict review: R CMD check + vignettes + CI required.
- Active community:
https://support.bioconductor.orgusually answers within 24h.
BiocManager::install(),而非 install.packages()──前者會檢查與你的 R/Bioc 版本相容性。第一次使用先 install.packages("BiocManager")。
Install rule: always use BiocManager::install() — never install.packages() for Bioc packages. First-time setup: install.packages("BiocManager").
二、SummarizedExperiment──表現量資料的標準容器
「基因 × 樣本表現矩陣」+ 「基因註解」+ 「樣本中繼資料」三件包成一個物件。同步切片(subset)保證三者永遠對齊。DESeq2、edgeR、limma 都基於它。
Bundles "gene × sample matrix" + "gene annotation" + "sample metadata" into one object. Subsetting keeps all three aligned. DESeq2, edgeR, limma all build on it.
# BiocManager::install('SummarizedExperiment')
library(SummarizedExperiment)
# 建立 / Build
nGene <- 200; nSample <- 6
counts <- matrix(rpois(nGene * nSample, lambda = 100),
nrow = nGene,
dimnames = list(paste0("gene", 1:nGene),
paste0("S", 1:nSample)))
col_data <- DataFrame(condition = rep(c("ctrl","trt"), each = 3),
batch = rep(1:2, times = 3),
row.names = colnames(counts))
row_data <- DataFrame(symbol = paste0("g", 1:nGene),
length = sample(500:5000, nGene),
row.names = rownames(counts))
se <- SummarizedExperiment(assays = list(counts = counts),
colData = col_data,
rowData = row_data)
se
dim(se) # 200 6
assayNames(se) # "counts"
assay(se, "counts")[1:3, ]
colData(se)
rowData(se)
# 子集 / Subset — 同步切三件
se[1:10, 1:3]
se[, se$condition == "trt"] # 樣本 metadata 可直接 $
se[rowSums(assay(se)) > 100, ] # 過濾低表現基因
# 多個 assay (如 raw counts + normalized)
assays(se)$logcounts <- log2(counts + 1)
assayNames(se)
三、GenomicRanges──基因組座標的瑞士刀
GRanges 用來表示「染色體 + 起點 + 終點 + 鏈方向」。peak、exon、SNP、片段都用它。所有 ChIP-seq、ATAC-seq、變異分析都繞著它轉。
GRanges represents "chromosome + start + end + strand". Peaks, exons, SNPs, fragments — they all use it. Every ChIP-seq, ATAC-seq, variant analysis runs through it.
# BiocManager::install('GenomicRanges')
library(GenomicRanges)
# 建立 / Build
gr <- GRanges(seqnames = c("chr1","chr1","chr2","chrX"),
ranges = IRanges(start = c(100, 200, 50, 1000),
end = c(150, 280, 100, 1100)),
strand = c("+", "-", "+", "*"),
gene = c("TP53","BRCA1","EGFR","XIST"))
gr
length(gr)
seqnames(gr)
start(gr); end(gr); width(gr)
strand(gr)
mcols(gr) # metadata columns
# 排序、篩選 / Sort & filter
sort(gr)
gr[seqnames(gr) == "chr1"]
gr[width(gr) > 60]
# 重要操作 / Key ops
shift(gr, 100) # 整體移動 100bp
resize(gr, width = 200, fix = "center") # 以中心展開到 200bp
flank(gr, 50, start = TRUE) # 前 50bp 區域
# 兩組區間的關係 / Range relationships (核心!)
gr1 <- GRanges("chr1", IRanges(c(100, 500), c(200, 600)))
gr2 <- GRanges("chr1", IRanges(c(150, 700), c(180, 800)))
findOverlaps(gr1, gr2) # 找重疊
countOverlaps(gr1, gr2)
intersect(gr1, gr2) # 重疊區
union(gr1, gr2) # 聯集
setdiff(gr1, gr2) # gr1 中不與 gr2 重疊的部分
nearest(gr1, gr2) # 最近的對應
distanceToNearest(gr1, gr2)
distanceToNearest() 或 ChIPseeker::annotatePeak()。
Classic use: "I have 5,000 ChIP-seq peaks — what's the nearest promoter for each?" → distanceToNearest() or ChIPseeker::annotatePeak().
四、Biostrings──DNA/RNA/蛋白質序列
# BiocManager::install('Biostrings')
library(Biostrings)
# 建立序列 / Build
dna <- DNAString("ATGCGTACGTAGCTAGCTAG")
dna2 <- DNAStringSet(c(g1 = "ATGCGT", g2 = "GCATGCAT"))
# 基本操作 / Basic
length(dna2); width(dna2)
reverseComplement(dna)
translate(dna) # codon → amino acid
GC_content <- letterFrequency(dna2, letters = "GC", as.prob = TRUE)
GC_content
# 從 FASTA 讀 / Read FASTA
# fa <- readDNAStringSet('genome.fa')
# names(fa)
# fa[1:3]
# 模式匹配 / Pattern matching
m <- matchPattern("ATG", dna)
start(m); end(m)
# 多序列匹配 (找 motif 起始密碼子等)
vmatch <- vmatchPattern("ATG", dna2)
vmatch
# 取一段 / Extract subsequence
subseq(dna, start = 4, end = 9)
# 寫 FASTA / Write FASTA
# writeXStringSet(dna2, 'output.fa')
五、基因註解:Entrez ↔ Ensembl ↔ Symbol
不同資料庫用不同 ID(Ensembl ENSG00000141510、Entrez 7157、Symbol TP53),常需互轉。Bioconductor 的 org.Hs.eg.db(人類)/ org.Mm.eg.db(小鼠)等套件提供完整對應。
Different databases use different IDs (Ensembl, Entrez, Symbol); you'll constantly need to convert. org.Hs.eg.db (human) / org.Mm.eg.db (mouse) packages provide the full mapping.
# BiocManager::install(c('AnnotationDbi','org.Hs.eg.db'))
library(AnnotationDbi); library(org.Hs.eg.db)
# 看支援哪些 ID 類型 / What ID types are available?
keytypes(org.Hs.eg.db) # SYMBOL, ENTREZID, ENSEMBL, REFSEQ, UNIPROT, ...
columns(org.Hs.eg.db) # 可拿什麼資訊出來
# Symbol → Entrez 與 Ensembl
mapIds(org.Hs.eg.db,
keys = c("TP53","BRCA1","EGFR"),
column = "ENTREZID",
keytype = "SYMBOL")
mapIds(org.Hs.eg.db,
keys = c("TP53","BRCA1","EGFR"),
column = "ENSEMBL",
keytype = "SYMBOL",
multiVals = "first") # 一對多時取第一個
# 一次拿多個欄位 / Multiple columns at once
select(org.Hs.eg.db,
keys = c("TP53","BRCA1"),
columns = c("ENTREZID","ENSEMBL","GENENAME","CHR"),
keytype = "SYMBOL")
# 從 Ensembl 反查 Symbol / Reverse lookup
mapIds(org.Hs.eg.db,
keys = c("ENSG00000141510","ENSG00000012048"),
column = "SYMBOL",
keytype = "ENSEMBL")
org.Mm.eg.db、大鼠 org.Rn.eg.db、果蠅 org.Dm.eg.db 等。Bioc 每次釋出會更新 org 套件──別搭配舊版 R + 新版 org,會 ID 對不齊。
Version & species: org.Hs.eg.db is human only; mouse → org.Mm.eg.db, rat → org.Rn.eg.db, fly → org.Dm.eg.db, etc. Each Bioc release refreshes the org packages — never mix old R with newest org.
六、TxDb / GTF / BSgenome 套件
| 套件家族 | 提供 | 範例 |
|---|---|---|
| TxDb.* | 基因/轉錄本/外顯子座標 | TxDb.Hsapiens.UCSC.hg38.knownGene |
| EnsDb.* | Ensembl 版本,含 biotype | EnsDb.Hsapiens.v86 |
| BSgenome.* | 基因組序列 | BSgenome.Hsapiens.UCSC.hg38 |
| org.* | ID 對應、GO、KEGG 註解 | org.Hs.eg.db |
# 從 GTF 自製 TxDb / Make TxDb from GTF
# library(GenomicFeatures)
# txdb <- makeTxDbFromGFF('Homo_sapiens.GRCh38.110.gtf.gz', format = 'gtf')
# saveDb(txdb, 'data/hg38_v110.sqlite') # 之後用 loadDb() 即可
# 直接用現成的 TxDb / Use prebuilt
# library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
# genes_gr <- genes(txdb) # GRanges of all genes
# exons_gr <- exons(txdb)
# tx_gr <- transcripts(txdb)
# promoters(genes_gr, upstream = 2000, downstream = 200)
# 從基因組讀某段 DNA / Pull DNA for a region
# library(BSgenome.Hsapiens.UCSC.hg38)
# bs <- BSgenome.Hsapiens.UCSC.hg38
# region <- GRanges('chr17', IRanges(7670000, 7680000))
# getSeq(bs, region)
# 任意 GTF/BED/bigWig 讀寫 / Read/write GTF/BED/bigWig
# library(rtracklayer)
# gtf <- import('annotation.gtf') # GRanges
# bed <- import('peaks.bed')
# export(bed, 'peaks_v2.bed')
cat('Demo only — install Bioc packages above to actually load these objects.\n')
七、AnnotationHub & ExperimentHub
不想下載一堆 .gz、自己寫解析?這兩個 hub 提供「R 一句話下載並讀進來」的雲端資料庫,包含 ENCODE、Roadmap、ExAC、TCGA 範例等大量資源。
Tired of hunting down .gz files and writing parsers? These hubs let you fetch ENCODE, Roadmap, ExAC, TCGA examples, etc. with one R call.
# library(AnnotationHub)
# ah <- AnnotationHub()
# query(ah, c('Homo sapiens', 'GTF', 'Ensembl', '110'))
# gtf <- ah[['AH116860']] # 直接拿 GRanges 物件
# library(ExperimentHub)
# eh <- ExperimentHub()
# query(eh, 'TCGA RNA-seq')
# counts <- eh[['EHxxxx']] # 拿 SummarizedExperiment
cat('Hub demo — needs internet & install AnnotationHub / ExperimentHub.\n')
📝 自我檢測
1. 為什麼 DESeq2/limma/edgeR 都用 SummarizedExperiment?
1. Why do DESeq2/limma/edgeR all use SummarizedExperiment?
2. 你有 1000 個 ChIP-seq peaks(GRanges),想找每個 peak 最近的基因。應使用?
2. You have 1,000 ChIP-seq peaks (GRanges); find the nearest gene for each — use?
3. 想把 Ensembl ID ENSG00000141510 轉成 gene symbol,正確套件與函式?
3. Convert Ensembl ID ENSG00000141510 to gene symbol — correct?