一、從 Cluster 到細胞類型
聚類只是一個數學上的分群,不會告訴你每個 cluster 是什麼細胞。註釋的任務就是結合已知的標誌基因 (Marker Genes)與領域知識,為每個 cluster 貼上生物學標籤。這是整個 scRNA-seq 分析中最需要生物學專業的步驟。
Clustering is purely a mathematical grouping — it doesn't tell you what cell type each cluster represents. Annotation combines known marker genes with domain expertise to assign biological labels. This is the step that requires the most biological expertise in the entire pipeline.
二、註釋方法
差異表達找 Markers
比較目標 Cluster 與其他所有細胞,尋找表達量顯著升高的標誌基因。這些基因就是該 cluster 的「候選標誌」。然後與已知文獻中的 marker gene 列表比對,推斷細胞類型。
Compare a target cluster against all other cells to find significantly upregulated marker genes. These become the cluster's "candidate markers." Then match against known marker gene lists from literature to infer cell type.
自動註釋工具
使用演算法將你的聚類結果與高質量的參考數據集進行自動比對:
- SingleR (R) — 使用參考數據集的基因表達譜,逐一比對每個細胞
- Azimuth (Seurat 生態系) — 基於 reference mapping,支援多種組織的預建模型
- CellTypist (Python) — 基於機器學習分類器的自動化工具
- scType (R) — 基於 marker gene 資料庫的自動註釋
Algorithms automatically compare your clusters against high-quality reference datasets:
- SingleR (R) — compares each cell's expression against reference profiles
- Azimuth (Seurat ecosystem) — reference mapping with pre-built models for many tissues
- CellTypist (Python) — ML classifier-based automated annotation
- scType (R) — marker gene database-based annotation
專家手動驗證
由領域專家根據已知文獻中的標誌基因列表,逐一審查並確認演算法的分類結果。這一步不可省略——自動工具可能受參考數據集的品質和完整度影響,也可能無法辨識新的或罕見的細胞亞群。
Domain experts individually review and confirm algorithmic results using known marker gene lists from literature. This step is essential — automated tools may be limited by reference dataset quality/completeness and may fail to identify novel or rare subpopulations.
三、Dot Plot 探索器
選擇一個 cluster,觀察各 marker 基因的表達量。高表達的基因組合指向特定細胞類型——試著根據 marker pattern 判斷每個 cluster 的身份。
Select a cluster to view marker gene expression levels. The combination of highly expressed genes points to a specific cell type — try to identify each cluster's identity from its marker pattern.
長條高度 = 平均表達量 | 根據 marker 基因的表達模式判斷細胞類型
Bar height = average expression | Identify cell type from the marker expression pattern
四、PBMC 常見 Markers
| 細胞類型 | Marker Genes |
|---|---|
| CD4+ T | CD3D, CD3E, IL7R |
| CD8+ T | CD3D, CD8A, CD8B |
| B cells | MS4A1 (CD20), CD79A |
| NK | GNLY, NKG7, KLRD1 |
| CD14 Mono | CD14, LYZ, S100A8 |
| FCGR3A Mono | FCGR3A, MS4A7 |
| DC | FCER1A, CST3 |
| Platelet | PPBP, PF4 |
五、手動 vs 自動
🌳 決策流程
六、實作範例
# 找每個 cluster 的 marker genes markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) markers %>% group_by(cluster) %>% slice_max(avg_log2FC, n = 5) # 視覺化驗證 FeaturePlot(pbmc, features = c("CD3D", "MS4A1", "CD14", "GNLY")) DotPlot(pbmc, features = c("CD3D", "MS4A1", "CD14")) + RotatedAxis() # 手動指定細胞類型 new.ids <- c("CD4 T", "CD14 Mono", "B cell", "CD8 T", "NK") names(new.ids) <- levels(pbmc) pbmc <- RenameIdents(pbmc, new.ids) # 自動註釋:SingleR library(SingleR) ref <- celldex::HumanPrimaryCellAtlasData() pred <- SingleR(test = GetAssayData(pbmc), ref = ref, labels = ref$label.main)
# 找 marker genes sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon") sc.pl.rank_genes_groups(adata, n_genes=10) # 視覺化驗證 sc.pl.umap(adata, color=["CD3D", "MS4A1", "CD14"]) sc.pl.dotplot(adata, var_names=["CD3D", "MS4A1", "CD14"], groupby="leiden") # 手動指定 cluster_to_celltype = {"0": "CD4 T", "1": "CD14 Mono", "2": "B cell"} adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_to_celltype) # 自動註釋:CellTypist import celltypist model = celltypist.models.download_models(model="Immune_All_Low.pkl") predictions = celltypist.annotate(adata, model="Immune_All_Low.pkl")
📝 自我檢測
某 cluster 的 top markers 包括 CD3D 和 IL7R,最可能是?
A cluster's top markers include CD3D and IL7R. Most likely?