integration — scRNA-seq Tutorial

核心概念

一、批次效應

合併不同實驗批次的數據時，聚類結果可能主要反映「樣本來源」而非「細胞類型」。每次實驗的系統性差異（試劑批號、操作手法、定序平台）會掩蓋真正的生物學訊號。

When combining data from different experimental batches, clustering may reflect "sample origin" rather than "cell type." Systematic differences between experiments (reagent lots, operator technique, sequencing platform) mask true biology.

💡

判斷標準：UMAP 上細胞主要按「樣本」而非「類型」分群 → 需要 Integration。Diagnostic: If cells cluster by "sample" rather than "type" on UMAP → Integration needed.

互動模擬

二、整合前後比較

切換觀察整合前（按批次分群）與整合後（按細胞類型分群）的效果。可分別用批次或細胞類型著色。

Toggle between before integration (clustered by batch) and after (clustered by cell type). Color by batch or cell type.

方法比較

三、方法比較

方法	邏輯	速度	適用
Harmony	在 PCA 空間中迭代校正	⚡	通用、大數據集；最常用	Iterative correction in PCA space	General purpose, large datasets; most popular
CCA (Seurat v3)	找跨批次共享相關結構	🔵	細胞組成相似時	Find shared correlation structure across batches	When cell composition is similar
RPCA (Seurat v5)	互惠 PCA 投影	🔵	細胞組成差異大時	Reciprocal PCA projection	When cell composition differs substantially
scVI	變分自編碼器（深度學習）	🐢	複雜批次結構；超大數據	Variational autoencoder (deep learning)	Complex batch structure; very large datasets
scanorama	全景拼接演算法	🔵	Python 生態系	Panoramic stitching algorithm	Python ecosystem

🌳 我該用哪種？

Q1:

初次嘗試/快速探索？ → Harmony。

Q2:

批次間細胞組成差異大？（一個有腫瘤，另一個沒有）→ RPCA。

Q3:

>50 萬細胞？ → Harmony 或 scVI（需 GPU）。

Q4:

複雜批次結構（多中心、多時間點）？ → scVI。

Q1:

First try/exploration? → Harmony.

Q2:

Cell composition differs across batches? (one has tumor, other doesn't) → RPCA.

Q3:

>500K cells? → Harmony or scVI (needs GPU).

Q4:

Complex batch structure (multi-center, multi-timepoint)? → scVI.

⚠️

過度整合的風險：如果某種細胞真的只存在於某個條件（如腫瘤細胞只在疾病樣本中），強行整合可能會錯誤地將它與其他細胞合併。務必保留真實的生物學差異。Over-integration risk: If a cell type genuinely exists in only one condition (e.g., tumor cells only in disease samples), forced integration may incorrectly merge it with other cells. Always preserve true biological differences.

程式碼

四、實作範例

library(harmony)
pbmc <- RunHarmony(pbmc, group.by.vars = "batch")
pbmc <- FindNeighbors(pbmc, reduction = "harmony", dims = 1:20)
pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc <- RunUMAP(pbmc, reduction = "harmony", dims = 1:20)

# Seurat v5 整合
pbmc[["RNA"]] <- split(pbmc[["RNA"]], f = pbmc$batch)
pbmc <- IntegrateLayers(pbmc, method = RPCAIntegration(),
  orig.reduction = "pca", new.reduction = "integrated.rpca")
pbmc[["RNA"]] <- JoinLayers(pbmc[["RNA"]])

import scvi
scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
model = scvi.model.SCVI(adata, n_latent=30)
model.train(max_epochs=100)
adata.obsm["X_scVI"] = model.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.leiden(adata)
sc.tl.umap(adata)