STEP 14 / 15

細胞切割:image-based ST 的隱形瓶頸

Xenium / MERFISH / CosMx 的單細胞解析依賴於「先把每顆細胞圈出來」這一步,且每種方法的偏差都不一樣。

Single-cell resolution on Xenium / MERFISH / CosMx depends on "outlining each cell" — and every method has its own bias.

一、為什麼是隱形瓶頸?

image-based 平台輸出的是每個 transcript 的 (x, y, z) 點雲。要做下游的 cluster / DE / niche,必須先把這些 transcript 分配到「細胞」這個單位。錯一步,後面整套分析都跟著錯:

  • 分太細:一顆細胞被切成 2-3 個小細胞 → 表達被稀釋 → 真實 cell type 跑掉。
  • 分太大:兩顆相鄰細胞合在一起 → 變成「合成 doublet」→ 形成假的混合細胞群。
  • 邊界錯位:transcript 被歸給隔壁細胞 → marker 表達洩漏。

2025 bioRxiv(Aug)系統性研究指出,絕大多數已發表的研究都直接套用預設 segmentation,沒做驗證,這是領域目前最大的盲點之一。

Image-based platforms output a point cloud of (x, y, z) transcripts. Before any clustering / DE / niche analysis, you must assign transcripts to "cells." A mistake here cascades through everything downstream:

  • Over-segmentation: one cell split into 2–3 → expression diluted → cell types lost.
  • Under-segmentation: two neighbors merged → "synthetic doublets" → fake mixed populations.
  • Boundary error: transcripts assigned to the wrong neighbor → leaked markers.

A 2025 bioRxiv study (Aug 2025) shows most published image-based ST studies use default segmentation without validation — this is currently one of the field's biggest blind spots.

三、之後要做的事

cell × gene 矩陣

把每個 transcript 點根據 segmentation 標籤累加到所屬細胞,輸出標準 AnnData。

Sum transcripts per segmented cell to produce a standard AnnData.

QC

過小(< 10 transcripts)或過大(> 99% percentile)的細胞通常是切割誤差。

Cells too small (< 10 transcripts) or too large (> 99th pct) are usually segmentation errors.

標準下游

完全像 scRNA:normalization → HVG → PCA → Leiden → marker 註釋。

Looks like scRNA: normalization → HVG → PCA → Leiden → marker annotation.

空間分析

回到 Step 11 (Niche) / Step 10 (CCC)。Image-based 此階段比 spot-based 容易,因為單細胞 + 真實鄰接已知。

Back to Step 11 (Niche) / Step 10 (CCC). Image-based makes this easier than spot-based — single-cell adjacency is exactly known.

⚠️
必做的健康檢查:抽樣 3-5 個區域,把 segmentation 邊界畫到 H&E/DAPI 上目視;統計「平均 transcript / cell」、「unassigned transcript 比例」;至少跑一個替代方法做 sanity check。 Mandatory sanity check: overlay segmentation on H&E/DAPI for 3–5 random ROIs; report mean transcripts/cell and unassigned-transcript fraction; run an alternative method for sanity comparison.

互動:segmentation 半徑

下圖:黑色 + 灰色為「真實的兩個相鄰細胞」(DAPI 以中心圓表示)。拖動「假設細胞半徑」,觀察 transcript 點被分配的方式:太小 → over-seg;太大 → 兩細胞合併。

Below: two adjacent ground-truth cells (DAPI shown as inner circle). Move the "assumed cell radius" — small → over-segmentation; large → merged.

實作

# Seurat v5 直接讀 Xenium 預設 segmentation
xen <- LoadXenium("xenium_out/", fov = "fov")
ImageDimPlot(xen, fov = "fov", axes = TRUE, cols = "polychrome")
xen <- SCTransform(xen, assay = "Xenium") |> RunPCA() |> FindClusters(resolution = 0.4)
# Cellpose (DAPI 通道)
from cellpose import models
mod = models.Cellpose(model_type="cyto3")
masks, flows, styles, diams = mod.eval(dapi_img, diameter=25, channels=[0,0])

# Baysor (CLI)
# baysor run -x x -y y -g gene transcripts.csv -p priors.tif -o baysor_out/

# proseg (CLI, 速度快)
# proseg transcripts.csv --xenium -o proseg_out/

# 用 segmentation mask 把 transcripts 累加成 AnnData
import spatialdata as sd
sdata = sd.read_zarr("xenium.zarr")
adata = sdata.aggregate(by="cell_boundaries", value_key="transcripts")

📝 自我檢測

1. 在 image-based ST 上 over-segmentation 會導致?

1. Over-segmentation in image-based ST causes:

A. 一顆真細胞被切成多個小細胞,表達被稀釋A. A real cell gets split into many small cells with diluted expression
B. 完全沒有問題B. No problem at all
C. 增加細胞 marker 表達C. Increases marker expression
D. 等同 doublet detectionD. Equivalent to doublet detection

2. 沒有 nuclear stain 影像時,下列何者最能幫忙?

2. Without nuclear stain, which method works best?

A. CellposeA. Cellpose
B. MesmerB. Mesmer
C. Baysor / proseg(直接用 transcript 點雲)C. Baysor / proseg (uses transcript cloud directly)
D. PASTED. PASTE

3. 2025 bioRxiv 對 segmentation 提出的最大警告是?

3. Main warning from the 2025 bioRxiv on segmentation?

A. Cellpose 已不再可用A. Cellpose is deprecated
B. 多數論文沒做 segmentation 驗證B. Most papers skip segmentation validation
C. Xenium 不需要 segmentationC. Xenium needs no segmentation
D. Baysor 永遠是最佳選擇D. Baysor is always optimal