Scaling 在做什麼?
經過 Normalization 後,不同基因的表達值範圍仍然差異極大——某些高表達結構蛋白基因可能是低表達轉錄因子的數百倍。如果直接送入 PCA,主成分將被高表達基因主導。
Scaling 的解法:對每個基因執行 Z-score 標準化——減去平均值、除以標準差。這樣每個基因均值為 0、方差為 1,不論原始表達高低,都在同一個尺度上競爭。
After normalization, different genes still have vastly different expression ranges — some structural genes may be hundreds of times higher than transcription factors. Feeding this directly into PCA would let high-expression genes dominate.
Scaling solution: Z-score each gene — subtract mean, divide by std. This makes every gene mean=0, variance=1, putting all genes on equal footing regardless of original expression level.
📐 公式
z = (x − μ) / σ
x = 某細胞某基因的表達值
μ = 該基因在所有細胞的平均值
σ = 該基因在所有細胞的標準差
x = expression value of gene in a cell
μ = mean of that gene across all cells
σ = standard deviation across all cells
⚠️ 差異
Normalization 是「跨基因」操作——校正每個細胞的定序深度。
Scaling 是「跨細胞」操作——讓每個基因在相近尺度上。
兩者互補,不可混淆。
Normalization works "across genes" — corrects each cell's sequencing depth.
Scaling works "across cells" — equalizes each gene's scale.
They are complementary, not interchangeable.
Scaling 前後
回歸技術變數
在 Scaling 步驟中可額外「回歸掉」不想要的變異來源,如粒線體比例、細胞週期分數、定序批次。
During Scaling, you can optionally "regress out" unwanted variation sources: MT%, cell cycle scores, sequencing batch.
實作範例
# 基本 Scaling pbmc <- ScaleData(pbmc) # 進階:回歸 MT% 和細胞週期 pbmc <- ScaleData(pbmc, vars.to.regress = c("percent.mt", "S.Score", "G2M.Score")) # 注意:若使用 SCTransform,此步驟已自動完成
sc.pp.scale(adata, max_value=10) # 進階:回歸 sc.pp.regress_out(adata, ["pct_counts_mt"]) sc.pp.scale(adata, max_value=10)