annotation — scRNA-seq Tutorial

核心概念

一、從 Cluster 到細胞類型

聚類只是一個數學上的分群，不會告訴你每個 cluster 是什麼細胞。註釋的任務就是結合已知的標誌基因 (Marker Genes)與領域知識，為每個 cluster 貼上生物學標籤。這是整個 scRNA-seq 分析中最需要生物學專業的步驟。

Clustering is purely a mathematical grouping — it doesn't tell you what cell type each cluster represents. Annotation combines known marker genes with domain expertise to assign biological labels. This is the step that requires the most biological expertise in the entire pipeline.

三種策略

二、註釋方法

差異表達找 Markers

比較目標 Cluster 與其他所有細胞，尋找表達量顯著升高的標誌基因。這些基因就是該 cluster 的「候選標誌」。然後與已知文獻中的 marker gene 列表比對，推斷細胞類型。

Compare a target cluster against all other cells to find significantly upregulated marker genes. These become the cluster's "candidate markers." Then match against known marker gene lists from literature to infer cell type.

自動註釋工具

使用演算法將你的聚類結果與高質量的參考數據集進行自動比對：

SingleR (R) — 使用參考數據集的基因表達譜，逐一比對每個細胞
Azimuth (Seurat 生態系) — 基於 reference mapping，支援多種組織的預建模型
CellTypist (Python) — 基於機器學習分類器的自動化工具
scType (R) — 基於 marker gene 資料庫的自動註釋

Algorithms automatically compare your clusters against high-quality reference datasets:

SingleR (R) — compares each cell's expression against reference profiles
Azimuth (Seurat ecosystem) — reference mapping with pre-built models for many tissues
CellTypist (Python) — ML classifier-based automated annotation
scType (R) — marker gene database-based annotation

專家手動驗證

由領域專家根據已知文獻中的標誌基因列表，逐一審查並確認演算法的分類結果。這一步不可省略——自動工具可能受參考數據集的品質和完整度影響，也可能無法辨識新的或罕見的細胞亞群。

Domain experts individually review and confirm algorithmic results using known marker gene lists from literature. This step is essential — automated tools may be limited by reference dataset quality/completeness and may fail to identify novel or rare subpopulations.

互動模擬

三、Dot Plot 探索器

選擇一個 cluster，觀察各 marker 基因的表達量。高表達的基因組合指向特定細胞類型——試著根據 marker pattern 判斷每個 cluster 的身份。

Select a cluster to view marker gene expression levels. The combination of highly expressed genes points to a specific cell type — try to identify each cluster's identity from its marker pattern.

長條高度 = 平均表達量｜根據 marker 基因的表達模式判斷細胞類型

Bar height = average expression | Identify cell type from the marker expression pattern

常用 Markers

四、PBMC 常見 Markers

細胞類型	Marker Genes
CD4+ T	CD3D, CD3E, IL7R
CD8+ T	CD3D, CD8A, CD8B
B cells	MS4A1 (CD20), CD79A
NK	GNLY, NKG7, KLRD1
CD14 Mono	CD14, LYZ, S100A8
FCGR3A Mono	FCGR3A, MS4A7
DC	FCER1A, CST3
Platelet	PPBP, PF4

⚠️

上表僅適用於人類 PBMC。不同物種、不同組織的 marker genes 完全不同，務必查閱相關文獻或使用 CellMarker、PanglaoDB 等資料庫。This table is only for human PBMC. Marker genes differ entirely across species and tissues. Always consult relevant literature or databases like CellMarker, PanglaoDB.

決策引導

五、手動 vs 自動

🌳 決策流程

Q1:

你的組織有良好的公開參考數據集嗎？→ 是 → 使用 Azimuth / SingleR 做初步自動註釋，再手動驗證。

Q2:

沒有參考數據集？→ 使用 DEA 找 markers + 文獻比對做純手動註釋。

Q3:

某些 cluster 的 markers 很混亂、不典型？→ 考慮是否為 doublets、過度聚類、或真正的新亞群——可能需回頭調整 QC 或 resolution。

Q1:

Good public reference dataset for your tissue?→ Yes → Use Azimuth / SingleR for initial auto-annotation, then validate manually.

Q2:

No reference dataset?→ Use DEA markers + literature comparison for pure manual annotation.

Q3:

Some clusters have messy/atypical markers?→ Consider doublets, over-clustering, or genuine new subpopulations — may need to revisit QC or resolution.

程式碼

六、實作範例

# 找每個 cluster 的 marker genes
markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
markers %>% group_by(cluster) %>% slice_max(avg_log2FC, n = 5)

# 視覺化驗證
FeaturePlot(pbmc, features = c("CD3D", "MS4A1", "CD14", "GNLY"))
DotPlot(pbmc, features = c("CD3D", "MS4A1", "CD14")) + RotatedAxis()

# 手動指定細胞類型
new.ids <- c("CD4 T", "CD14 Mono", "B cell", "CD8 T", "NK")
names(new.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.ids)

# 自動註釋：SingleR
library(SingleR)
ref <- celldex::HumanPrimaryCellAtlasData()
pred <- SingleR(test = GetAssayData(pbmc), ref = ref, labels = ref$label.main)

# 找 marker genes
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=10)

# 視覺化驗證
sc.pl.umap(adata, color=["CD3D", "MS4A1", "CD14"])
sc.pl.dotplot(adata, var_names=["CD3D", "MS4A1", "CD14"], groupby="leiden")

# 手動指定
cluster_to_celltype = {"0": "CD4 T", "1": "CD14 Mono", "2": "B cell"}
adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_to_celltype)

# 自動註釋：CellTypist
import celltypist
model = celltypist.models.download_models(model="Immune_All_Low.pkl")
predictions = celltypist.annotate(adata, model="Immune_All_Low.pkl")

📝 自我檢測

某 cluster 的 top markers 包括 CD3D 和 IL7R，最可能是？

A cluster's top markers include CD3D and IL7R. Most likely?

A. B 細胞A. B cell

B. NK 細胞B. NK cell

C. CD4+ T 細胞C. CD4+ T cell

D. MonocyteD. Monocyte