clustering — scRNA-seq Tutorial

核心概念

一、聚類分析的目的

在典型的 scRNA-seq 實驗中，我們會獲得成千上萬個細胞的高維基因表達數據。聚類將具有相似轉錄組特徵的細胞歸類在一起——識別細胞異質性、發現新細胞類型、並為後續的細胞類型鑑定奠定基礎。

In a typical scRNA-seq experiment, we obtain high-dimensional gene expression data for thousands of cells. Clustering groups cells with similar transcriptomic profiles together — identifying heterogeneity, discovering new cell types, and laying the foundation for annotation.

三步邏輯

二、聚類三步邏輯

構建 KNN 圖

計算細胞在 PCA 空間中的距離，為每個細胞找到最近的 K 個鄰居，建立網絡。K 值通常由軟體自動設定（Seurat 預設 k=20）。

Calculate distances between cells in PCA space, find K nearest neighbors for each cell, build a network. K is usually set automatically (Seurat default k=20).

計算 SNN 權重

使用共享近鄰 (Shared Nearest Neighbor, SNN) 算法優化連接權重——兩個細胞共享的鄰居越多，它們之間的連接就越強，代表它們越可能是同一群體。

Optimize edge weights using Shared Nearest Neighbor (SNN) — the more neighbors two cells share, the stronger their connection, indicating higher probability of belonging to the same population.

社區發現

使用 Louvain（優化模塊度）或 Leiden（改良版，產生更具連通性的群體）演算法將網絡切割成群集。目前推薦使用 Leiden——它修正了 Louvain 可能產生的斷裂社區問題。

Use Louvain (modularity optimization) or Leiden (improved, produces better-connected communities) to partition the network into clusters. Leiden is currently recommended — it fixes Louvain's issue of potentially producing disconnected communities.

互動模擬

三、Resolution 參數

Resolution 是決定分群細緻度的核心參數。低值產生少量大群，高值產生多個小群。

Resolution is the key parameter controlling clustering granularity. Low values produce few large clusters; high values produce many small ones.

Resolution 0.5

方法比較

四、方法比較

	Graph-based	K-means
預設於	Seurat, Scanpy	傳統 ML	Traditional ML
需預設 K？	不需要（由 resolution 控制粗細）	必須事先指定	No (controlled by resolution)	Must be specified beforehand
形狀	可處理非凸、不規則形狀	假設球形、大小相近	Handles non-convex, irregular shapes	Assumes spherical, similar-sized clusters

常見陷阱

五、常見陷阱

✂️ 過度聚類

現象：同一種細胞被拆成多群。
原因：Resolution 過高，或將技術噪聲誤認為異質性。
檢查：群間 DE 基因極少或差異微小。

Symptom: Same cell type split into multiple clusters.
Cause: Resolution too high, or technical noise mistaken for heterogeneity.
Check: Very few or trivial DE genes between clusters.

🧪 批次效應

現象：聚類反映樣本來源而非細胞類型。
解法：聚類前必須先做 Integration（如 Harmony、RPCA）。

Symptom: Clusters reflect sample origin, not cell type.
Solution: Must perform Integration (Harmony, RPCA) before clustering.

👯 Doublets

現象：UMAP 上出現在兩大群中間的「橋樑」小群，marker 混合了兩種細胞特徵。
解法：用 DoubletFinder 等工具在 QC 階段預先過濾。

Symptom: Small "bridge" cluster between two major groups on UMAP, with mixed markers from both.
Solution: Use DoubletFinder etc. to pre-filter during QC.

程式碼

六、實作範例

# 建立 KNN/SNN 圖
pbmc <- FindNeighbors(pbmc, dims = 1:15)
# Leiden 聚類（推薦）
pbmc <- FindClusters(pbmc, resolution = 0.5, algorithm = 4)
# 嘗試多個 resolution 比較
pbmc <- FindClusters(pbmc, resolution = c(0.2, 0.5, 0.8, 1.2))
# 用 clustree 視覺化不同 resolution 的關係
library(clustree)
clustree(pbmc)

sc.pp.neighbors(adata, n_pcs=15)
sc.tl.leiden(adata, resolution=0.5)
# 嘗試多個 resolution
for r in [0.2, 0.5, 0.8, 1.2]:
    sc.tl.leiden(adata, resolution=r, key_added=f"leiden_{r}")

📝 自我檢測

提高 resolution 參數會？

Increasing the resolution parameter will?

A. 產生更多、更細的 clusterA. Produce more, finer clusters

B. 產生更少、更大的 clusterB. Produce fewer, larger clusters

C. 改變 PCA 的結果C. Change PCA results

D. 移除低品質細胞D. Remove low-quality cells