STEP 9 / 16

Bioconductor 入門

SummarizedExperiment、GenomicRanges、Biostrings、AnnotationDbi──進入生資世界的四把鑰匙。

SummarizedExperiment, GenomicRanges, Biostrings, AnnotationDbi — your four keys to bioinformatics.

一、Bioconductor 是什麼?

Bioconductor 是 R 的「生資延伸宇宙」──超過 2,300 個專為生物資料設計的套件,從 RNA-seq、scRNA-seq、ChIP-seq、變異、註解、影像到流式細胞,幾乎包山包海。

核心特色:

  • 共用資料結構:所有套件都圍繞幾個核心 S4 物件(SummarizedExperimentGRangesDNAStringSet...)。
  • 定期釋出:每年 4 月、10 月各一次,與 R 主版本綁定。
  • 嚴格審查:每個套件都需通過 R CMD check + vignette + 持續整合測試。
  • 強大社群https://support.bioconductor.org 24h 內幾乎都有套件作者回覆。

Bioconductor is R's bioinformatics universe — 2,300+ packages purpose-built for biological data: RNA-seq, scRNA-seq, ChIP-seq, variants, annotation, imaging, flow cytometry — you name it.

Core features:

  • Shared data structures: every package revolves around a few S4 cores (SummarizedExperiment, GRanges, DNAStringSet...).
  • Twice-yearly releases: April & October, pinned to R majors.
  • Strict review: R CMD check + vignettes + CI required.
  • Active community: https://support.bioconductor.org usually answers within 24h.
💡
安裝必看:務必用 BiocManager::install(),而非 install.packages()──前者會檢查與你的 R/Bioc 版本相容性。第一次使用先 install.packages("BiocManager") Install rule: always use BiocManager::install() — never install.packages() for Bioc packages. First-time setup: install.packages("BiocManager").

二、SummarizedExperiment──表現量資料的標準容器

基因 × 樣本表現矩陣」+ 「基因註解」+ 「樣本中繼資料」三件包成一個物件。同步切片(subset)保證三者永遠對齊。DESeq2、edgeR、limma 都基於它。

Bundles "gene × sample matrix" + "gene annotation" + "sample metadata" into one object. Subsetting keeps all three aligned. DESeq2, edgeR, limma all build on it.

# BiocManager::install('SummarizedExperiment')
library(SummarizedExperiment)

# 建立 / Build
nGene <- 200; nSample <- 6
counts <- matrix(rpois(nGene * nSample, lambda = 100),
                 nrow = nGene,
                 dimnames = list(paste0("gene", 1:nGene),
                                 paste0("S", 1:nSample)))
col_data <- DataFrame(condition = rep(c("ctrl","trt"), each = 3),
                      batch     = rep(1:2, times = 3),
                      row.names = colnames(counts))
row_data <- DataFrame(symbol = paste0("g", 1:nGene),
                      length = sample(500:5000, nGene),
                      row.names = rownames(counts))

se <- SummarizedExperiment(assays = list(counts = counts),
                           colData  = col_data,
                           rowData  = row_data)
se
dim(se)              # 200  6
assayNames(se)       # "counts"
assay(se, "counts")[1:3, ]
colData(se)
rowData(se)

# 子集 / Subset — 同步切三件
se[1:10, 1:3]
se[, se$condition == "trt"]   # 樣本 metadata 可直接 $
se[rowSums(assay(se)) > 100, ]  # 過濾低表現基因

# 多個 assay (如 raw counts + normalized)
assays(se)$logcounts <- log2(counts + 1)
assayNames(se)

三、GenomicRanges──基因組座標的瑞士刀

GRanges 用來表示「染色體 + 起點 + 終點 + 鏈方向」。peak、exon、SNP、片段都用它。所有 ChIP-seq、ATAC-seq、變異分析都繞著它轉。

GRanges represents "chromosome + start + end + strand". Peaks, exons, SNPs, fragments — they all use it. Every ChIP-seq, ATAC-seq, variant analysis runs through it.

# BiocManager::install('GenomicRanges')
library(GenomicRanges)

# 建立 / Build
gr <- GRanges(seqnames = c("chr1","chr1","chr2","chrX"),
              ranges   = IRanges(start = c(100, 200, 50, 1000),
                                 end   = c(150, 280, 100, 1100)),
              strand   = c("+", "-", "+", "*"),
              gene     = c("TP53","BRCA1","EGFR","XIST"))
gr

length(gr)
seqnames(gr)
start(gr); end(gr); width(gr)
strand(gr)
mcols(gr)         # metadata columns

# 排序、篩選 / Sort & filter
sort(gr)
gr[seqnames(gr) == "chr1"]
gr[width(gr) > 60]

# 重要操作 / Key ops
shift(gr, 100)               # 整體移動 100bp
resize(gr, width = 200, fix = "center")  # 以中心展開到 200bp
flank(gr, 50, start = TRUE)  # 前 50bp 區域

# 兩組區間的關係 / Range relationships (核心!)
gr1 <- GRanges("chr1", IRanges(c(100, 500), c(200, 600)))
gr2 <- GRanges("chr1", IRanges(c(150, 700), c(180, 800)))

findOverlaps(gr1, gr2)        # 找重疊
countOverlaps(gr1, gr2)
intersect(gr1, gr2)           # 重疊區
union(gr1, gr2)               # 聯集
setdiff(gr1, gr2)             # gr1 中不與 gr2 重疊的部分
nearest(gr1, gr2)             # 最近的對應
distanceToNearest(gr1, gr2)
💡
典型應用:「ChIP-seq 找到 5,000 個 peaks,每個 peak 最近的 promoter 是哪個基因?」就用 distanceToNearest()ChIPseeker::annotatePeak() Classic use: "I have 5,000 ChIP-seq peaks — what's the nearest promoter for each?" → distanceToNearest() or ChIPseeker::annotatePeak().

四、Biostrings──DNA/RNA/蛋白質序列

# BiocManager::install('Biostrings')
library(Biostrings)

# 建立序列 / Build
dna  <- DNAString("ATGCGTACGTAGCTAGCTAG")
dna2 <- DNAStringSet(c(g1 = "ATGCGT", g2 = "GCATGCAT"))

# 基本操作 / Basic
length(dna2); width(dna2)
reverseComplement(dna)
translate(dna)              # codon → amino acid
GC_content <- letterFrequency(dna2, letters = "GC", as.prob = TRUE)
GC_content

# 從 FASTA 讀 / Read FASTA
# fa <- readDNAStringSet('genome.fa')
# names(fa)
# fa[1:3]

# 模式匹配 / Pattern matching
m <- matchPattern("ATG", dna)
start(m); end(m)

# 多序列匹配 (找 motif 起始密碼子等)
vmatch <- vmatchPattern("ATG", dna2)
vmatch

# 取一段 / Extract subsequence
subseq(dna, start = 4, end = 9)

# 寫 FASTA / Write FASTA
# writeXStringSet(dna2, 'output.fa')

五、基因註解:Entrez ↔ Ensembl ↔ Symbol

不同資料庫用不同 ID(Ensembl ENSG00000141510、Entrez 7157、Symbol TP53),常需互轉。Bioconductor 的 org.Hs.eg.db(人類)/ org.Mm.eg.db(小鼠)等套件提供完整對應。

Different databases use different IDs (Ensembl, Entrez, Symbol); you'll constantly need to convert. org.Hs.eg.db (human) / org.Mm.eg.db (mouse) packages provide the full mapping.

# BiocManager::install(c('AnnotationDbi','org.Hs.eg.db'))
library(AnnotationDbi); library(org.Hs.eg.db)

# 看支援哪些 ID 類型 / What ID types are available?
keytypes(org.Hs.eg.db)       # SYMBOL, ENTREZID, ENSEMBL, REFSEQ, UNIPROT, ...
columns(org.Hs.eg.db)        # 可拿什麼資訊出來

# Symbol → Entrez 與 Ensembl
mapIds(org.Hs.eg.db,
       keys = c("TP53","BRCA1","EGFR"),
       column = "ENTREZID",
       keytype = "SYMBOL")

mapIds(org.Hs.eg.db,
       keys = c("TP53","BRCA1","EGFR"),
       column = "ENSEMBL",
       keytype = "SYMBOL",
       multiVals = "first")          # 一對多時取第一個

# 一次拿多個欄位 / Multiple columns at once
select(org.Hs.eg.db,
       keys = c("TP53","BRCA1"),
       columns = c("ENTREZID","ENSEMBL","GENENAME","CHR"),
       keytype = "SYMBOL")

# 從 Ensembl 反查 Symbol / Reverse lookup
mapIds(org.Hs.eg.db,
       keys = c("ENSG00000141510","ENSG00000012048"),
       column = "SYMBOL",
       keytype = "ENSEMBL")
⚠️
版本與物種對應:org.Hs.eg.db 只有人類;小鼠用 org.Mm.eg.db、大鼠 org.Rn.eg.db、果蠅 org.Dm.eg.db 等。Bioc 每次釋出會更新 org 套件──別搭配舊版 R + 新版 org,會 ID 對不齊。 Version & species: org.Hs.eg.db is human only; mouse → org.Mm.eg.db, rat → org.Rn.eg.db, fly → org.Dm.eg.db, etc. Each Bioc release refreshes the org packages — never mix old R with newest org.

六、TxDb / GTF / BSgenome 套件

套件家族提供範例
TxDb.*基因/轉錄本/外顯子座標TxDb.Hsapiens.UCSC.hg38.knownGene
EnsDb.*Ensembl 版本,含 biotypeEnsDb.Hsapiens.v86
BSgenome.*基因組序列BSgenome.Hsapiens.UCSC.hg38
org.*ID 對應、GO、KEGG 註解org.Hs.eg.db
# 從 GTF 自製 TxDb / Make TxDb from GTF
# library(GenomicFeatures)
# txdb <- makeTxDbFromGFF('Homo_sapiens.GRCh38.110.gtf.gz', format = 'gtf')
# saveDb(txdb, 'data/hg38_v110.sqlite')      # 之後用 loadDb() 即可

# 直接用現成的 TxDb / Use prebuilt
# library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
# genes_gr  <- genes(txdb)             # GRanges of all genes
# exons_gr  <- exons(txdb)
# tx_gr     <- transcripts(txdb)
# promoters(genes_gr, upstream = 2000, downstream = 200)

# 從基因組讀某段 DNA / Pull DNA for a region
# library(BSgenome.Hsapiens.UCSC.hg38)
# bs <- BSgenome.Hsapiens.UCSC.hg38
# region <- GRanges('chr17', IRanges(7670000, 7680000))
# getSeq(bs, region)

# 任意 GTF/BED/bigWig 讀寫 / Read/write GTF/BED/bigWig
# library(rtracklayer)
# gtf <- import('annotation.gtf')          # GRanges
# bed <- import('peaks.bed')
# export(bed, 'peaks_v2.bed')

cat('Demo only — install Bioc packages above to actually load these objects.\n')

七、AnnotationHub & ExperimentHub

不想下載一堆 .gz、自己寫解析?這兩個 hub 提供「R 一句話下載並讀進來」的雲端資料庫,包含 ENCODE、Roadmap、ExAC、TCGA 範例等大量資源。

Tired of hunting down .gz files and writing parsers? These hubs let you fetch ENCODE, Roadmap, ExAC, TCGA examples, etc. with one R call.

# library(AnnotationHub)
# ah <- AnnotationHub()
# query(ah, c('Homo sapiens', 'GTF', 'Ensembl', '110'))
# gtf <- ah[['AH116860']]              # 直接拿 GRanges 物件

# library(ExperimentHub)
# eh <- ExperimentHub()
# query(eh, 'TCGA RNA-seq')
# counts <- eh[['EHxxxx']]             # 拿 SummarizedExperiment

cat('Hub demo — needs internet & install AnnotationHub / ExperimentHub.\n')

📝 自我檢測

1. 為什麼 DESeq2/limma/edgeR 都用 SummarizedExperiment?

1. Why do DESeq2/limma/edgeR all use SummarizedExperiment?

A. 純歷史包袱A. Historical baggage
B. 沒原因B. No reason
C. 把 counts、樣本中繼資料、基因註解綁一起,子集時三者保持對齊──避免常見錯誤C. It binds counts + sample metadata + gene info; subsetting keeps them aligned — avoids classic errors
D. 為了配合 PythonD. For Python compatibility

2. 你有 1000 個 ChIP-seq peaks(GRanges),想找每個 peak 最近的基因。應使用?

2. You have 1,000 ChIP-seq peaks (GRanges); find the nearest gene for each — use?

A. match()A. match()
B. setdiff()B. setdiff()
C. nearest() / distanceToNearest() / ChIPseeker::annotatePeak()C. nearest() / distanceToNearest() / ChIPseeker::annotatePeak()
D. 自己寫 for 迴圈算距離D. Hand-write a for loop

3. 想把 Ensembl ID ENSG00000141510 轉成 gene symbol,正確套件與函式?

3. Convert Ensembl ID ENSG00000141510 to gene symbol — correct?

A. AnnotationDbi::mapIds(org.Hs.eg.db, keys=..., column="SYMBOL", keytype="ENSEMBL")A. AnnotationDbi::mapIds(org.Hs.eg.db, keys=..., column="SYMBOL", keytype="ENSEMBL")
B. strsplit("ENSG", ...)B. strsplit("ENSG", ...)
C. install.packages("ENSG_to_SYMBOL")C. install.packages("ENSG_to_SYMBOL")
D. 不需任何套件,R 內建D. No package needed