一、ST normalization 的特殊難題
scRNA normalization 想校正的只有「細胞間的 library size 差異」。在 ST 上,library size 差異本身可能就是訊號——細胞密集的區域真的就會比基質區有更多 RNA。如果無條件地把每個 spot 標準化到同一個總量,會洗掉真實的密度資訊。
所以 ST 的 normalization 多了兩個考量:
- 是不是要保留細胞密度差異?取決於下游分析(spatial domains 通常想保留;deconvolution 不一定)。
- spot 含混合細胞時,SCTransform 的 NB 假設可能失準(混合細胞的 mean-variance 不見得符合 NB)。
scRNA normalization corrects only "between-cell library size." In ST, library-size differences may themselves be signal — dense regions naturally produce more RNA than stromal regions. Forcing every spot to the same total wipes out true density information.
So ST normalization adds two considerations:
- Do we want to preserve cell-density differences? Depends on downstream task (spatial domains usually yes; deconvolution often not).
- For mixed-cell spots, SCTransform's NB assumption may break (mixed cells need not follow NB mean–variance).
二、三種主流方法
| 方法 | 原理 | ST 優點 | 缺點 | |||
|---|---|---|---|---|---|---|
| LogNormalize | 每 spot 除以總 UMI × 10 000,再 log1p | 最簡單、最快、最穩定 | 未處理 mean–variance 偏差 | Per-spot scale to 10 000, then log1p | Simple, fast, robust | Doesn't address mean–variance bias |
| SCTransform | 負二項回歸,把 library size 當 covariate | 同時做 norm + HVG,下游 PCA 通常更乾淨 | spot 含混合細胞時 NB 假設失真;耗時 | NB regression with library size as covariate | Joint norm + HVG; cleaner PCA | NB assumption breaks for mixed-cell spots; slow |
| scran size factor | 先粗分群、再用 deconvolution 估 size factor | 對 zero-inflation 與異質性穩定 | 需要先分群、實作較複雜 | Pre-cluster, then deconvolution-based size factors | Robust to zero-inflation/heterogeneity | Needs pre-clustering, more steps |
互動:normalization 對 mean–variance 的影響
同一份模擬資料,切換不同 normalization。觀察:好的 normalization 會讓基因在整個 expression 範圍都不被 highly-expressed 基因主宰。
Same simulated data under different normalizations. A good normalization keeps the variance distribution flat across the expression range — not dominated by highly expressed genes.
X:log10(mean count);Y:log10(variance)
實作
# 方法 1:LogNormalize vis <- NormalizeData(vis, normalization.method = "LogNormalize", scale.factor = 10000) # 方法 2:SCTransform (建議單一切片或單細胞解析資料) vis <- SCTransform(vis, assay = "Spatial", verbose = FALSE) # 方法 3:scran size factor (Bioconductor) library(scran) qclust <- quickCluster(spe, min.size = 100) spe <- computeSumFactors(spe, clusters = qclust) spe <- logNormCounts(spe)
# LogNormalize (Scanpy default) sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) # 也可走 scran 風格 size factor (透過 anndata2ri) import scran_python as sp sf = sp.compute_size_factors(adata) adata.X = adata.X / sf[:,None] # 或 SCTransform 風格 (Pearson residuals) sc.experimental.pp.normalize_pearson_residuals(adata)
📝 自我檢測
1. 為什麼在 spot-based ST 上「強制每 spot 標準化到相同總量」可能造成資訊流失?
1. Why can "force-scale every spot to the same total" lose information in spot-based ST?
2. 對於混合細胞嚴重的 Visium spot,下列敘述何者較合理?
2. For Visium spots with strong cell mixing, which is more reasonable?
3. scran 的 size factor 為什麼比簡單的 library size 穩定?
3. Why are scran size factors more stable than simple library sizes?