為什麼不用所有基因?
典型的 scRNA-seq 數據集包含 2–3 萬個基因,但大多數基因在細胞間的變異極小——例如管家基因(如 GAPDH、ACTB)在每個細胞都差不多表達。這些基因對區分細胞類型沒有幫助,反而會增加運算負擔與噪音。
我們需要的是那些在某些細胞中高度表達、在其他細胞中低表達的基因——這些高變異基因 (HVGs) 攜帶了區分不同細胞群體的關鍵訊號。
A typical scRNA-seq dataset contains 20,000–30,000 genes, but most show minimal variation across cells — housekeeping genes (GAPDH, ACTB) are expressed at similar levels everywhere. They don't help distinguish cell types and add computational burden and noise.
We need genes that are highly expressed in some cells but lowly expressed in others — these Highly Variable Genes (HVGs) carry key signals for distinguishing cell populations.
Mean-Variance 關係圖
下圖模擬了基因的平均表達量與變異度之間的關係。紅色點為被選中的高變異基因。拖動滑桿調整選取數量,觀察如何影響覆蓋範圍。
The plot below simulates the relationship between gene mean expression and variance. Red dots are selected HVGs. Drag the slider to adjust selection count.
選擇方法比較
| 方法 | 說明 | 使用場景 | ||
|---|---|---|---|---|
| VST (Seurat default) | 用局部回歸擬合 mean-variance 關係,取殘差最大的基因 | NormalizeData 後使用 | Local regression fitting mean-variance, select genes with largest residuals | After NormalizeData |
| SCTransform built-in | 負二項回歸直接輸出殘差方差排名 | 使用 SCTransform 時自動完成 | NB regression outputs residual variance ranking directly | Auto-completed with SCTransform |
| Scanpy default | 按 dispersion 排名,或 Seurat-flavored VST | Scanpy 流程 | Rank by dispersion, or Seurat-flavored VST | Scanpy pipeline |
實作範例
# NormalizeData 流程 pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) top10 <- head(VariableFeatures(pbmc), 10) VariableFeaturePlot(pbmc) # 視覺化 mean-variance 圖 LabelPoints(plot, points = top10) # 標記 top 基因 # 注意:若使用 SCTransform,此步驟已自動完成
# 選擇高變異基因 sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3") sc.pl.highly_variable_genes(adata) # 視覺化 # 後續分析只用 HVGs adata = adata[:, adata.var.highly_variable]
📝 自我檢測
管家基因(如 GAPDH)通常不會被選為 HVG,原因是?
Why are housekeeping genes (e.g., GAPDH) typically not selected as HVGs?