STEP 3 / 9

高變異基因選擇

從數萬個基因中篩選出最具「分辨力」的特徵,聚焦生物學訊號、過濾噪音。

Select the most discriminative features from tens of thousands of genes — focus on biological signal, filter noise.

為什麼不用所有基因?

典型的 scRNA-seq 數據集包含 2–3 萬個基因,但大多數基因在細胞間的變異極小——例如管家基因(如 GAPDH、ACTB)在每個細胞都差不多表達。這些基因對區分細胞類型沒有幫助,反而會增加運算負擔與噪音。

我們需要的是那些在某些細胞中高度表達、在其他細胞中低表達的基因——這些高變異基因 (HVGs) 攜帶了區分不同細胞群體的關鍵訊號。

A typical scRNA-seq dataset contains 20,000–30,000 genes, but most show minimal variation across cells — housekeeping genes (GAPDH, ACTB) are expressed at similar levels everywhere. They don't help distinguish cell types and add computational burden and noise.

We need genes that are highly expressed in some cells but lowly expressed in others — these Highly Variable Genes (HVGs) carry key signals for distinguishing cell populations.

Mean-Variance 關係圖

下圖模擬了基因的平均表達量與變異度之間的關係。紅色點為被選中的高變異基因。拖動滑桿調整選取數量,觀察如何影響覆蓋範圍。

The plot below simulates the relationship between gene mean expression and variance. Red dots are selected HVGs. Drag the slider to adjust selection count.

💡
選多少個?2000–3000 個是常見經驗值。若數據細胞異質性很高(如腫瘤微環境),可調高到 3000–5000;若較單純(如純化的某一系 lineage),2000 通常足夠。 How many? 2000–3000 is the common range. For highly heterogeneous data (tumor microenvironment), increase to 3000–5000; for simpler datasets (purified lineage), 2000 usually suffices.

選擇方法比較

方法說明使用場景
VST (Seurat default)用局部回歸擬合 mean-variance 關係,取殘差最大的基因NormalizeData 後使用Local regression fitting mean-variance, select genes with largest residualsAfter NormalizeData
SCTransform built-in負二項回歸直接輸出殘差方差排名使用 SCTransform 時自動完成NB regression outputs residual variance ranking directlyAuto-completed with SCTransform
Scanpy default按 dispersion 排名,或 Seurat-flavored VSTScanpy 流程Rank by dispersion, or Seurat-flavored VSTScanpy pipeline

實作範例

# NormalizeData 流程
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(pbmc), 10)
VariableFeaturePlot(pbmc)  # 視覺化 mean-variance 圖
LabelPoints(plot, points = top10)  # 標記 top 基因

# 注意:若使用 SCTransform,此步驟已自動完成
# 選擇高變異基因
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3")
sc.pl.highly_variable_genes(adata)  # 視覺化

# 後續分析只用 HVGs
adata = adata[:, adata.var.highly_variable]

📝 自我檢測

管家基因(如 GAPDH)通常不會被選為 HVG,原因是?

Why are housekeeping genes (e.g., GAPDH) typically not selected as HVGs?

A. 平均表達量太低A. Average expression too low
B. 只在特定細胞類型中表達B. Only expressed in specific cell types
C. 在所有細胞中穩定表達,細胞間變異小C. Stably expressed across all cells, low inter-cell variance
D. 被 QC 步驟過濾掉了D. Filtered out during QC