STEP 5 / 15

降維與特徵:把空間鄰域寫進 embedding

HVG → PCA → UMAP 是基本款;BANKSY 把空間鄰域納入嵌入是 ST 才有的把戲。

HVG → PCA → UMAP is the baseline; BANKSY's spatially-aware embedding is unique to ST.

一、scRNA-style 降維 vs 空間感知降維

傳統 ST 流程沿用 scRNA:HVG → PCA → UMAP / Leiden。這是「忽略空間」的做法,PCA 完才把分群結果疊回 spatial map。優點:簡單、可重現、跟 scRNA 工具完全相容。缺點:相鄰 spot 的相似性沒被 model 學進去——只有後處理時才看得出空間結構。

空間感知方法(BANKSY、SpaceFlow embedding、STAGATE encoder)則把每個 spot 的鄰域平均表達量當成額外特徵串入向量,所以降維後的 embedding 同時反映「我表達什麼」與「我周遭的鄰居表達什麼」。後續 cluster 自然會傾向形成空間連續的區塊。

The classic ST pipeline mirrors scRNA: HVG → PCA → UMAP / Leiden. This "ignores space" — clusters are computed first, then layered back on the spatial map. Pros: simple, reproducible, fully compatible with the scRNA toolchain. Cons: spot-spot adjacency isn't modeled — spatial structure is only visible post hoc.

Spatially-aware methods (BANKSY, SpaceFlow encoder, STAGATE) concatenate each spot's neighborhood mean expression as extra features, so the embedding reflects both "what I express" and "what my neighbors express." Downstream clusters then form spatially-contiguous regions naturally.

互動:BANKSY 鄰域權重如何改變嵌入

BANKSY 的核心參數 λ(lambda)控制「我自己的表達 vs 鄰域平均表達」的權重。λ = 0 等價於忽略空間;λ 越大越偏重空間鄰域。觀察右側 cluster 結構如何從「混在一起」變成「空間連續區塊」。

BANKSY's core parameter λ controls "self vs neighborhood-mean" weight. λ = 0 ignores space; larger λ leans on neighborhood context. Watch how clusters morph from "tangled" into "spatially contiguous patches" on the right.

顏色:cluster 標籤;底色:兩個生物學區塊

實作

# 標準路線 / Standard scRNA-style
vis <- FindVariableFeatures(vis, nfeatures = 2000)
vis <- ScaleData(vis)
vis <- RunPCA(vis, npcs = 30)
vis <- RunUMAP(vis, dims = 1:30)
DimPlot(vis); SpatialDimPlot(vis)

# 空間感知:BANKSY (Seurat v5 已內建)
library(Banksy)
vis <- RunBanksy(vis, lambda = 0.2, dimx = "x", dimy = "y", assay = "SCT")
vis <- RunPCA(vis, assay = "BANKSY", npcs = 30)
vis <- FindNeighbors(vis, reduction = "pca") |> FindClusters(resolution = 0.6)
# 標準路線 / Standard
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, n_comps=30)
sc.pp.neighbors(adata, n_pcs=30); sc.tl.umap(adata)

# BANKSY (Banksy-py)
import banksy_py as bp
adata = bp.banksy.compute_banksy_matrix(adata, lambda_=0.2, k_geom=15)
sc.tl.pca(adata, layer="banksy", n_comps=30)
sc.pp.neighbors(adata, n_pcs=30); sc.tl.leiden(adata, resolution=0.6)

📝 自我檢測

1. 為什麼純 PCA + Leiden 在 ST 常常產生「空間不連續」的 cluster?

1. Why does plain PCA + Leiden often produce "spatially-discontinuous" clusters in ST?

A. 因為 PCA 不適合 STA. PCA isn't suitable for ST
B. 因為它把空間資訊完全忽略,只看 expressionB. It ignores space entirely, considering only expression
C. 因為 ST 沒有空間資訊C. Because ST has no spatial info
D. 因為 BANKSY 是錯的D. Because BANKSY is wrong

2. BANKSY 的 λ 參數調得很大會發生什麼?

2. What happens if BANKSY's λ is set very high?

A. 完全跟著鄰居走,cluster 變得平滑、可能模糊邊界差異A. Embedding follows neighbors heavily, clusters smooth out — boundaries may blur
B. 完全等同於 PCAB. Becomes equivalent to PCA
C. cluster 數一定變多C. Always increases cluster count
D. 不影響任何結果D. Has no effect

3. SVG 跟 HVG 最關鍵的差別?

3. Key difference between SVG and HVG?

A. SVG 只能用 RA. SVG only runs in R
B. SVG 考慮空間自相關,HVG 只看整體變異B. SVG considers spatial autocorrelation; HVG only looks at overall variance
C. SVG 一定比 HVG 多C. SVGs are always more numerous than HVGs
D. SVG 跟 HVG 是同義詞D. SVG and HVG are synonyms