Step 4: Normalization — Spatial Transcriptomics Tutorial

挑戰

一、ST normalization 的特殊難題

scRNA normalization 想校正的只有「細胞間的 library size 差異」。在 ST 上，library size 差異本身可能就是訊號——細胞密集的區域真的就會比基質區有更多 RNA。如果無條件地把每個 spot 標準化到同一個總量，會洗掉真實的密度資訊。

所以 ST 的 normalization 多了兩個考量：

是不是要保留細胞密度差異？取決於下游分析（spatial domains 通常想保留；deconvolution 不一定）。
spot 含混合細胞時，SCTransform 的 NB 假設可能失準（混合細胞的 mean-variance 不見得符合 NB）。

scRNA normalization corrects only "between-cell library size." In ST, library-size differences may themselves be signal — dense regions naturally produce more RNA than stromal regions. Forcing every spot to the same total wipes out true density information.

So ST normalization adds two considerations:

Do we want to preserve cell-density differences? Depends on downstream task (spatial domains usually yes; deconvolution often not).
For mixed-cell spots, SCTransform's NB assumption may break (mixed cells need not follow NB mean–variance).

方法比較

二、三種主流方法

方法	原理	ST 優點	缺點
LogNormalize	每 spot 除以總 UMI × 10 000，再 log1p	最簡單、最快、最穩定	未處理 mean–variance 偏差	Per-spot scale to 10 000, then log1p	Simple, fast, robust	Doesn't address mean–variance bias
SCTransform	負二項回歸，把 library size 當 covariate	同時做 norm + HVG，下游 PCA 通常更乾淨	spot 含混合細胞時 NB 假設失真；耗時	NB regression with library size as covariate	Joint norm + HVG; cleaner PCA	NB assumption breaks for mixed-cell spots; slow
scran size factor	先粗分群、再用 deconvolution 估 size factor	對 zero-inflation 與異質性穩定	需要先分群、實作較複雜	Pre-cluster, then deconvolution-based size factors	Robust to zero-inflation/heterogeneity	Needs pre-clustering, more steps

💡

2024 之後 scverse / Bioconductor 共識：spot-based ST 預設用 LogNormalize 即可，下游若 PCA 結構雜亂再考慮 SCTransform；image-based（單細胞解析）回到 scRNA 邏輯，SCTransform 更合適。 Post-2024 scverse / Bioconductor consensus: LogNormalize is a sensible default for spot-based ST; switch to SCTransform if PCA structure looks noisy. For image-based (single-cell) data, SCTransform follows scRNA logic and works well.

互動模擬

互動：normalization 對 mean–variance 的影響

同一份模擬資料，切換不同 normalization。觀察：好的 normalization 會讓基因在整個 expression 範圍都不被 highly-expressed 基因主宰。

Same simulated data under different normalizations. A good normalization keeps the variance distribution flat across the expression range — not dominated by highly expressed genes.

X：log10(mean count)；Y：log10(variance)

程式碼

實作

# 方法 1：LogNormalize
vis <- NormalizeData(vis, normalization.method = "LogNormalize", scale.factor = 10000)

# 方法 2：SCTransform (建議單一切片或單細胞解析資料)
vis <- SCTransform(vis, assay = "Spatial", verbose = FALSE)

# 方法 3：scran size factor (Bioconductor)
library(scran)
qclust <- quickCluster(spe, min.size = 100)
spe <- computeSumFactors(spe, clusters = qclust)
spe <- logNormCounts(spe)

# LogNormalize (Scanpy default)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# 也可走 scran 風格 size factor (透過 anndata2ri)
import scran_python as sp
sf = sp.compute_size_factors(adata)
adata.X = adata.X / sf[:,None]

# 或 SCTransform 風格 (Pearson residuals)
sc.experimental.pp.normalize_pearson_residuals(adata)

📝 自我檢測

1. 為什麼在 spot-based ST 上「強制每 spot 標準化到相同總量」可能造成資訊流失？

1. Why can "force-scale every spot to the same total" lose information in spot-based ST?

A. 因為 NGS 不準A. Because NGS is inaccurate

B. 因為 log 轉換是有損的B. Because log transform is lossy

C. 因為 spot 的總 UMI 部分反映了真實細胞密度C. Because total UMI per spot partly reflects true cell density

D. 因為 Visium 不允許標準化D. Because Visium doesn't allow normalization

2. 對於混合細胞嚴重的 Visium spot，下列敘述何者較合理？

2. For Visium spots with strong cell mixing, which is more reasonable?

A. SCTransform 仍然完全有效A. SCTransform remains fully appropriate

B. NB 假設可能失真，LogNormalize 通常是穩定起點B. NB assumption may break — LogNormalize is a robust default

C. 不需要任何 normalizationC. No normalization needed

D. 只能用 raw countsD. Only raw counts can be used

3. scran 的 size factor 為什麼比簡單的 library size 穩定？

3. Why are scran size factors more stable than simple library sizes?

A. 它先粗分群再 deconvolution，避免高表達基因主宰A. It pre-clusters and uses deconvolution to avoid bias from top genes

B. 它直接抹掉 library sizeB. It just zeroes out library size

C. 它只能用在 image-based 平台C. It only works on image-based platforms

D. 它跟 scRNA 沒有關係D. It is unrelated to scRNA