Step 14: Cell Segmentation — Spatial Transcriptomics Tutorial

為什麼難

一、為什麼是隱形瓶頸？

image-based 平台輸出的是每個 transcript 的 (x, y, z) 點雲。要做下游的 cluster / DE / niche，必須先把這些 transcript 分配到「細胞」這個單位。錯一步，後面整套分析都跟著錯：

分太細：一顆細胞被切成 2-3 個小細胞 → 表達被稀釋 → 真實 cell type 跑掉。
分太大：兩顆相鄰細胞合在一起 → 變成「合成 doublet」→ 形成假的混合細胞群。
邊界錯位：transcript 被歸給隔壁細胞 → marker 表達洩漏。

2025 bioRxiv（Aug）系統性研究指出，絕大多數已發表的研究都直接套用預設 segmentation，沒做驗證，這是領域目前最大的盲點之一。

Image-based platforms output a point cloud of (x, y, z) transcripts. Before any clustering / DE / niche analysis, you must assign transcripts to "cells." A mistake here cascades through everything downstream:

Over-segmentation: one cell split into 2–3 → expression diluted → cell types lost.
Under-segmentation: two neighbors merged → "synthetic doublets" → fake mixed populations.
Boundary error: transcripts assigned to the wrong neighbor → leaked markers.

A 2025 bioRxiv study (Aug 2025) shows most published image-based ST studies use default segmentation without validation — this is currently one of the field's biggest blind spots.

主流方法

二、主流方法

方法	輸入	原理	優劣
Cellpose	DAPI / membrane stain	U-Net 預訓練模型	速度快、結果穩定；MERFISH/Vizgen 預設	DAPI / membrane stain	U-Net pretrained	Fast, stable; default in MERFISH/Vizgen
Baysor	transcript 點雲（可不需影像）	機率模型，按 transcript 共現分群	無 nucleus stain 也可用；參數敏感	Transcript point cloud (optionally with image)	Probabilistic, clusters by co-occurrence	Works without nuclear stain; parameter-sensitive
proseg	transcript 點雲	最新（2024）probabilistic、scalable	在 Xenium 大資料上速度極佳	Transcript point cloud	Newest (2024), probabilistic, scalable	Excellent speed on large Xenium datasets
StarDist / Mesmer	multi-channel 影像	Star-convex polygon 預測	處理 H&E 與 IF 表現好	Multi-channel images	Star-convex polygon prediction	Strong on H&E and IF
UCS (2025)	影像 + transcript	統一 deep learning，融合 imaging + transcript signal	最新跨平台統一方案	Image + transcript	Unified deep learning combining both modalities	Newest unified cross-platform method

工作流程

三、之後要做的事

cell × gene 矩陣

把每個 transcript 點根據 segmentation 標籤累加到所屬細胞，輸出標準 AnnData。

Sum transcripts per segmented cell to produce a standard AnnData.

QC

過小（< 10 transcripts）或過大（> 99% percentile）的細胞通常是切割誤差。

Cells too small (< 10 transcripts) or too large (> 99th pct) are usually segmentation errors.

標準下游

完全像 scRNA：normalization → HVG → PCA → Leiden → marker 註釋。

Looks like scRNA: normalization → HVG → PCA → Leiden → marker annotation.

空間分析

回到 Step 11 (Niche) / Step 10 (CCC)。Image-based 此階段比 spot-based 容易，因為單細胞 + 真實鄰接已知。

Back to Step 11 (Niche) / Step 10 (CCC). Image-based makes this easier than spot-based — single-cell adjacency is exactly known.

⚠️

必做的健康檢查：抽樣 3-5 個區域，把 segmentation 邊界畫到 H&E/DAPI 上目視；統計「平均 transcript / cell」、「unassigned transcript 比例」；至少跑一個替代方法做 sanity check。 Mandatory sanity check: overlay segmentation on H&E/DAPI for 3–5 random ROIs; report mean transcripts/cell and unassigned-transcript fraction; run an alternative method for sanity comparison.

互動模擬

互動：segmentation 半徑

下圖：黑色 + 灰色為「真實的兩個相鄰細胞」（DAPI 以中心圓表示）。拖動「假設細胞半徑」，觀察 transcript 點被分配的方式：太小 → over-seg；太大 → 兩細胞合併。

Below: two adjacent ground-truth cells (DAPI shown as inner circle). Move the "assumed cell radius" — small → over-segmentation; large → merged.

假設半徑 35

—

程式碼

實作

# Seurat v5 直接讀 Xenium 預設 segmentation
xen <- LoadXenium("xenium_out/", fov = "fov")
ImageDimPlot(xen, fov = "fov", axes = TRUE, cols = "polychrome")
xen <- SCTransform(xen, assay = "Xenium") |> RunPCA() |> FindClusters(resolution = 0.4)

# Cellpose (DAPI 通道)
from cellpose import models
mod = models.Cellpose(model_type="cyto3")
masks, flows, styles, diams = mod.eval(dapi_img, diameter=25, channels=[0,0])

# Baysor (CLI)
# baysor run -x x -y y -g gene transcripts.csv -p priors.tif -o baysor_out/

# proseg (CLI, 速度快)
# proseg transcripts.csv --xenium -o proseg_out/

# 用 segmentation mask 把 transcripts 累加成 AnnData
import spatialdata as sd
sdata = sd.read_zarr("xenium.zarr")
adata = sdata.aggregate(by="cell_boundaries", value_key="transcripts")

📝 自我檢測

1. 在 image-based ST 上 over-segmentation 會導致？

1. Over-segmentation in image-based ST causes:

A. 一顆真細胞被切成多個小細胞，表達被稀釋A. A real cell gets split into many small cells with diluted expression

B. 完全沒有問題B. No problem at all

C. 增加細胞 marker 表達C. Increases marker expression

D. 等同 doublet detectionD. Equivalent to doublet detection

2. 沒有 nuclear stain 影像時，下列何者最能幫忙？

2. Without nuclear stain, which method works best?

A. CellposeA. Cellpose

B. MesmerB. Mesmer

C. Baysor / proseg（直接用 transcript 點雲）C. Baysor / proseg (uses transcript cloud directly)

D. PASTED. PASTE

3. 2025 bioRxiv 對 segmentation 提出的最大警告是？

3. Main warning from the 2025 bioRxiv on segmentation?

A. Cellpose 已不再可用A. Cellpose is deprecated

B. 多數論文沒做 segmentation 驗證B. Most papers skip segmentation validation

C. Xenium 不需要 segmentationC. Xenium needs no segmentation

D. Baysor 永遠是最佳選擇D. Baysor is always optimal