ADVANCED

數據整合

整合多樣本或多批次資料,降低 batch effect 並保留真實生物學差異。

Integrate multi-sample or multi-batch data — remove batch effects while preserving true biological differences.

一、批次效應

合併不同實驗批次的數據時,聚類結果可能主要反映「樣本來源」而非「細胞類型」。每次實驗的系統性差異(試劑批號、操作手法、定序平台)會掩蓋真正的生物學訊號。

When combining data from different experimental batches, clustering may reflect "sample origin" rather than "cell type." Systematic differences between experiments (reagent lots, operator technique, sequencing platform) mask true biology.

💡
判斷標準:UMAP 上細胞主要按「樣本」而非「類型」分群 → 需要 Integration。Diagnostic: If cells cluster by "sample" rather than "type" on UMAP → Integration needed.

二、整合前後比較

切換觀察整合前(按批次分群)與整合後(按細胞類型分群)的效果。可分別用批次或細胞類型著色。

Toggle between before integration (clustered by batch) and after (clustered by cell type). Color by batch or cell type.

三、方法比較

方法邏輯速度適用
Harmony在 PCA 空間中迭代校正通用、大數據集;最常用Iterative correction in PCA spaceGeneral purpose, large datasets; most popular
CCA (Seurat v3)找跨批次共享相關結構🔵細胞組成相似時Find shared correlation structure across batchesWhen cell composition is similar
RPCA (Seurat v5)互惠 PCA 投影🔵細胞組成差異大時Reciprocal PCA projectionWhen cell composition differs substantially
scVI變分自編碼器(深度學習)🐢複雜批次結構;超大數據Variational autoencoder (deep learning)Complex batch structure; very large datasets
scanorama全景拼接演算法🔵Python 生態系Panoramic stitching algorithmPython ecosystem

🌳 我該用哪種?

Q1:
初次嘗試/快速探索? Harmony
Q2:
批次間細胞組成差異大?(一個有腫瘤,另一個沒有) RPCA
Q3:
>50 萬細胞? Harmony 或 scVI(需 GPU)。
Q4:
複雜批次結構(多中心、多時間點)? scVI
Q1:
First try/exploration? Harmony.
Q2:
Cell composition differs across batches? (one has tumor, other doesn't) RPCA.
Q3:
>500K cells? Harmony or scVI (needs GPU).
Q4:
Complex batch structure (multi-center, multi-timepoint)? scVI.
⚠️
過度整合的風險:如果某種細胞真的只存在於某個條件(如腫瘤細胞只在疾病樣本中),強行整合可能會錯誤地將它與其他細胞合併。務必保留真實的生物學差異。Over-integration risk: If a cell type genuinely exists in only one condition (e.g., tumor cells only in disease samples), forced integration may incorrectly merge it with other cells. Always preserve true biological differences.

四、實作範例

library(harmony)
pbmc <- RunHarmony(pbmc, group.by.vars = "batch")
pbmc <- FindNeighbors(pbmc, reduction = "harmony", dims = 1:20)
pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc <- RunUMAP(pbmc, reduction = "harmony", dims = 1:20)
# Seurat v5 整合
pbmc[["RNA"]] <- split(pbmc[["RNA"]], f = pbmc$batch)
pbmc <- IntegrateLayers(pbmc, method = RPCAIntegration(),
  orig.reduction = "pca", new.reduction = "integrated.rpca")
pbmc[["RNA"]] <- JoinLayers(pbmc[["RNA"]])
import scvi
scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
model = scvi.model.SCVI(adata, n_latent=30)
model.train(max_epochs=100)
adata.obsm["X_scVI"] = model.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.leiden(adata)
sc.tl.umap(adata)