STEP 15 / 16

ChIP-seq & ATAC-seq

Peak 註解、差異結合、motif 分析──從 BAM 到生物學洞見的 R 工作流。

Peak annotation, differential binding, motif analysis — BAM to biology in R.

一、ChIP-seq vs ATAC-seq

ChIP-seqATAC-seq
量什麼特定蛋白與 DNA 結合位點開放染色質區域
原理抗體 IP → 純化結合 DNA → 定序Tn5 切開放區 → adapter → 定序
細胞需求
輸出BAM → narrowPeak / broadPeak (MACS2)BAM → narrowPeak (MACS2)
代表用途TF target、組蛋白標記開放染色質、推估 TF 活性
💡
R 上的角色:R 跑 alignment 與 peak calling(這些用 Bowtie2、BWA、MACS2 等命令列工具在 Linux 跑)。R 接手的是:peak 註解、motif、差異結合、視覺化──這也是本章重點。 R's role: R does not do alignment or peak calling (use Bowtie2/BWA/MACS2 on Linux). R takes over for peak annotation, motif analysis, differential binding, visualization — that's this chapter.

二、ChIPseeker:peak 註解標準工具

# BiocManager::install(c('ChIPseeker','TxDb.Hsapiens.UCSC.hg38.knownGene','org.Hs.eg.db'))
# library(ChIPseeker); library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# library(clusterProfiler); library(org.Hs.eg.db)

# txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

# 1) 讀 narrowPeak / BED 檔
# peaks <- readPeakFile('peaks/sample.narrowPeak')
# peaks      # GRanges

# 2) 註解 peak:每個 peak 對應到最近的基因 + genomic feature
# anno <- annotatePeak(peaks,
#                      TxDb = txdb,
#                      tssRegion = c(-3000, 3000),    # promoter 定義
#                      annoDb = 'org.Hs.eg.db',
#                      level = 'gene')
# anno_df <- as.data.frame(anno)
# head(anno_df[, c('seqnames','start','end','annotation','geneId','SYMBOL','distanceToTSS')])

# 3) 視覺化 / Visualize
# plotAnnoPie(anno)        # promoter / intron / intergenic / ... 圓餅
# plotAnnoBar(anno)
# vennpie(anno)
# upsetplot(anno)
# plotDistToTSS(anno)      # peak 距 TSS 距離分布

# 4) 多樣本比較 (e.g. 多個 ChIP)
# files <- list(s1 = 'peaks/s1.narrowPeak', s2 = 'peaks/s2.narrowPeak')
# annoList <- lapply(files, annotatePeak, TxDb = txdb, tssRegion = c(-3000, 3000))
# plotAnnoBar(annoList)

# 5) Peak heatmap:所有 peak 在 TSS 附近 ±3kb 的訊號分布
# promoter <- getPromoters(TxDb = txdb, upstream = 3000, downstream = 3000)
# tagMatrix <- getTagMatrix(peaks, windows = promoter)
# tagHeatmap(tagMatrix, xlim = c(-3000, 3000), color = 'firebrick')
# plotAvgProf(tagMatrix, xlim = c(-3000, 3000))

# 6) GO/KEGG 富集 (拿 anno 中的 geneId)
# entrez_ids <- unique(as.data.frame(anno)$geneId)
# ego <- enrichGO(entrez_ids, OrgDb = org.Hs.eg.db, ont = 'BP', readable = TRUE)
# dotplot(ego)

cat('ChIPseeker demo — install Bioc packages above to run on real peaks.\n')

三、DiffBind:差異結合分析

「兩組樣本的 ChIP/ATAC peak intensity 是否顯著不同?」DiffBind 把所有樣本的 peak 合併、計算 read counts、走 DESeq2/edgeR 框架做差異分析。

"Are peak intensities significantly different between two groups?" DiffBind merges peaks across samples, counts reads, then runs DESeq2/edgeR for differential testing.

# BiocManager::install('DiffBind')
# library(DiffBind)

# 1) 準備 sample sheet (CSV,必填欄:SampleID, Condition, bamReads, Peaks, PeakCaller)
# samples <- read.csv('samples.csv')
# head(samples)
# #   SampleID  Tissue   Condition  Replicate  bamReads          Peaks                PeakCaller
# #   ctrl_1    HeLa     control    1          bam/ctrl_1.bam    peaks/ctrl_1.narrow  narrow

# 2) 建立 dba 物件
# dba <- dba(sampleSheet = samples)
# dba       # 摘要

# 3) 計算每個 peak 的 read count
# dba <- dba.count(dba, summits = 250)         # 以 peak summit ±250bp 為窗

# 4) 設定比較
# dba <- dba.contrast(dba, categories = DBA_CONDITION, minMembers = 2)

# 5) 執行差異分析(內部用 DESeq2)
# dba <- dba.analyze(dba)
# dba.show(dba, bContrasts = TRUE)

# 6) 取結果 / Extract results
# res <- dba.report(dba, contrast = 1, th = 0.05, fold = 1)
# res    # GRanges of differentially bound peaks
# as.data.frame(res)

# 7) 視覺化
# dba.plotMA(dba)            # MA plot
# dba.plotVolcano(dba)       # Volcano
# dba.plotHeatmap(dba, contrast = 1, correlations = FALSE)
# dba.plotPCA(dba, attributes = DBA_CONDITION)

cat('DiffBind demo — needs BAM + peak files + sample sheet to actually run.\n')

四、Motif 分析

工具用途
motifmatchrPWM 匹配(chromVAR 用)
JASPAR2024JASPAR motif 資料庫
TFBSToolsPWM 操作工具
memesMEME R wrapper
monaLisa差異 motif activity
universalmotifmotif 格式互轉
# BiocManager::install(c('motifmatchr','JASPAR2024','TFBSTools',
#                        'BSgenome.Hsapiens.UCSC.hg38'))

# library(motifmatchr); library(JASPAR2024); library(TFBSTools)
# library(BSgenome.Hsapiens.UCSC.hg38)

# 1) 從 JASPAR 拿 motif PWMs
# opts <- list(species = 9606, collection = 'CORE')   # 9606 = human
# pwms <- getMatrixSet(JASPAR2024, opts)
# length(pwms)

# 2) 對 peaks 做 motif 匹配
# bs <- BSgenome.Hsapiens.UCSC.hg38
# motif_matches <- matchMotifs(pwms, peaks, genome = bs)
# motif_matches  # SummarizedExperiment of TRUE/FALSE matches

# 3) 計算每個 peak 中匹配到的 motif 數量 + 對 TF activity (chromVAR)
# library(chromVAR)
# dev <- computeDeviations(object = se_atac,         # ATAC counts SE
#                          annotations = motif_matches)
# variability <- computeVariability(dev)
# head(variability[order(-variability$variability), ])

cat('Motif demo — install motifmatchr + JASPAR2024 + chromVAR for full workflow.\n')
💡
De novo motif discovery:R 中沒有原生工具──實務上會把 peaks 轉 FASTA 餵給 MEME / HOMER(命令列),結果再用 memes 套件讀回 R 視覺化。 De novo motif discovery: there's no native R tool — typically pipe peak FASTAs to MEME/HOMER (CLI) then read results back with memes.

五、在 R 中畫 IGV-like 基因組軌道

# BiocManager::install(c('Gviz','rtracklayer'))
# library(Gviz); library(rtracklayer)
# library(TxDb.Hsapiens.UCSC.hg38.knownGene)

# # 區域定義 / Define region
# chr <- 'chr17'; start <- 7670000; end <- 7700000

# # 基因軌道 / Gene track
# txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
# gene_track <- GeneRegionTrack(txdb, chromosome = chr, start = start, end = end,
#                                name = 'Genes', transcriptAnnotation = 'symbol')

# # bigWig 訊號軌道 (e.g. ChIP-seq, ATAC-seq)
# bw1 <- DataTrack(range = 'data/sample1.bw', chromosome = chr,
#                  start = start, end = end, name = 'ChIP', type = 'h')
# bw2 <- DataTrack(range = 'data/sample2.bw', chromosome = chr,
#                  start = start, end = end, name = 'ATAC', type = 'h')

# # Peak 軌道
# peaks <- import('peaks/sample.narrowPeak')
# peak_track <- AnnotationTrack(peaks, chromosome = chr,
#                                start = start, end = end, name = 'Peaks')

# # 座標軸 + 染色體圖示
# axis  <- GenomeAxisTrack()
# ideo  <- IdeogramTrack(genome = 'hg38', chromosome = chr)

# # 一次畫所有軌道
# plotTracks(list(ideo, axis, gene_track, bw1, bw2, peak_track),
#            from = start, to = end)

# # 進階替代:karyoploteR、ggbio
cat('Gviz demo — needs bigWig / peak files. Replace paths to actually run.\n')

六、scATAC-seq 工具:Signac / ArchR

工具特色
Signac (Satija lab)Seurat 風格 API
ArchR效能極佳,大資料
chromVAR細胞 TF 活性
cisTopic主題模型 cis-module
# 詳見 scRNA-seq tutorial 對應章節(參考 sister 教程)
# 標準流程概念:
#   1) Cell Ranger ATAC → fragments.tsv.gz + filtered_peak_bc_matrix
#   2) Signac::CreateChromatinAssay → 整合進 Seurat
#   3) QC (TSS enrichment, nucleosome signal)
#   4) LSI dimensionality reduction → UMAP → clustering
#   5) GeneActivity matrix → 與 scRNA 整合
#   6) Motif analysis (chromVAR) → cell-type TF activity

cat('scATAC demo — see Signac vignette for full single-cell pipeline.\n')

📝 自我檢測

1. 你拿到 MACS2 的 narrowPeak 檔,想知道每個 peak 最近的基因,最方便的 R 工具?

1. From MACS2 narrowPeak, easiest R tool for nearest-gene annotation?

A. DESeq2A. DESeq2
B. limmaB. limma
C. ChIPseeker::annotatePeak()C. ChIPseeker::annotatePeak()
D. Biostrings::matchPattern()D. Biostrings::matchPattern()

2. 想做「兩組 ChIP-seq peak intensity 差異」最常用的 R 套件?

2. Most common R package for "differential ChIP-seq peak intensity"?

A. ggplot2A. ggplot2
B. SeuratB. Seurat
C. DiffBindC. DiffBind
D. tidyrD. tidyr

3. 關於 ATAC-seq vs ChIP-seq 何者正確?

3. Which statement about ATAC-seq vs ChIP-seq is correct?

A. ATAC-seq 量「開放染色質」;ChIP-seq 量「特定蛋白結合」──需要不同抗體與目的A. ATAC measures "open chromatin"; ChIP measures "specific protein binding" — different antibodies & goals
B. 它們是同義詞B. Synonyms
C. ATAC-seq 必須有抗體C. ATAC-seq needs antibody
D. ChIP-seq 適合 100 個細胞D. ChIP-seq works with 100 cells