一、為什麼是隱形瓶頸?
image-based 平台輸出的是每個 transcript 的 (x, y, z) 點雲。要做下游的 cluster / DE / niche,必須先把這些 transcript 分配到「細胞」這個單位。錯一步,後面整套分析都跟著錯:
- 分太細:一顆細胞被切成 2-3 個小細胞 → 表達被稀釋 → 真實 cell type 跑掉。
- 分太大:兩顆相鄰細胞合在一起 → 變成「合成 doublet」→ 形成假的混合細胞群。
- 邊界錯位:transcript 被歸給隔壁細胞 → marker 表達洩漏。
2025 bioRxiv(Aug)系統性研究指出,絕大多數已發表的研究都直接套用預設 segmentation,沒做驗證,這是領域目前最大的盲點之一。
Image-based platforms output a point cloud of (x, y, z) transcripts. Before any clustering / DE / niche analysis, you must assign transcripts to "cells." A mistake here cascades through everything downstream:
- Over-segmentation: one cell split into 2–3 → expression diluted → cell types lost.
- Under-segmentation: two neighbors merged → "synthetic doublets" → fake mixed populations.
- Boundary error: transcripts assigned to the wrong neighbor → leaked markers.
A 2025 bioRxiv study (Aug 2025) shows most published image-based ST studies use default segmentation without validation — this is currently one of the field's biggest blind spots.
二、主流方法
| 方法 | 輸入 | 原理 | 優劣 | |||
|---|---|---|---|---|---|---|
| Cellpose | DAPI / membrane stain | U-Net 預訓練模型 | 速度快、結果穩定;MERFISH/Vizgen 預設 | DAPI / membrane stain | U-Net pretrained | Fast, stable; default in MERFISH/Vizgen |
| Baysor | transcript 點雲(可不需影像) | 機率模型,按 transcript 共現分群 | 無 nucleus stain 也可用;參數敏感 | Transcript point cloud (optionally with image) | Probabilistic, clusters by co-occurrence | Works without nuclear stain; parameter-sensitive |
| proseg | transcript 點雲 | 最新(2024)probabilistic、scalable | 在 Xenium 大資料上速度極佳 | Transcript point cloud | Newest (2024), probabilistic, scalable | Excellent speed on large Xenium datasets |
| StarDist / Mesmer | multi-channel 影像 | Star-convex polygon 預測 | 處理 H&E 與 IF 表現好 | Multi-channel images | Star-convex polygon prediction | Strong on H&E and IF |
| UCS (2025) | 影像 + transcript | 統一 deep learning,融合 imaging + transcript signal | 最新跨平台統一方案 | Image + transcript | Unified deep learning combining both modalities | Newest unified cross-platform method |
三、之後要做的事
cell × gene 矩陣
把每個 transcript 點根據 segmentation 標籤累加到所屬細胞,輸出標準 AnnData。
Sum transcripts per segmented cell to produce a standard AnnData.
QC
過小(< 10 transcripts)或過大(> 99% percentile)的細胞通常是切割誤差。
Cells too small (< 10 transcripts) or too large (> 99th pct) are usually segmentation errors.
標準下游
完全像 scRNA:normalization → HVG → PCA → Leiden → marker 註釋。
Looks like scRNA: normalization → HVG → PCA → Leiden → marker annotation.
空間分析
回到 Step 11 (Niche) / Step 10 (CCC)。Image-based 此階段比 spot-based 容易,因為單細胞 + 真實鄰接已知。
Back to Step 11 (Niche) / Step 10 (CCC). Image-based makes this easier than spot-based — single-cell adjacency is exactly known.
互動:segmentation 半徑
下圖:黑色 + 灰色為「真實的兩個相鄰細胞」(DAPI 以中心圓表示)。拖動「假設細胞半徑」,觀察 transcript 點被分配的方式:太小 → over-seg;太大 → 兩細胞合併。
Below: two adjacent ground-truth cells (DAPI shown as inner circle). Move the "assumed cell radius" — small → over-segmentation; large → merged.
實作
# Seurat v5 直接讀 Xenium 預設 segmentation xen <- LoadXenium("xenium_out/", fov = "fov") ImageDimPlot(xen, fov = "fov", axes = TRUE, cols = "polychrome") xen <- SCTransform(xen, assay = "Xenium") |> RunPCA() |> FindClusters(resolution = 0.4)
# Cellpose (DAPI 通道) from cellpose import models mod = models.Cellpose(model_type="cyto3") masks, flows, styles, diams = mod.eval(dapi_img, diameter=25, channels=[0,0]) # Baysor (CLI) # baysor run -x x -y y -g gene transcripts.csv -p priors.tif -o baysor_out/ # proseg (CLI, 速度快) # proseg transcripts.csv --xenium -o proseg_out/ # 用 segmentation mask 把 transcripts 累加成 AnnData import spatialdata as sd sdata = sd.read_zarr("xenium.zarr") adata = sdata.aggregate(by="cell_boundaries", value_key="transcripts")
📝 自我檢測
1. 在 image-based ST 上 over-segmentation 會導致?
1. Over-segmentation in image-based ST causes:
2. 沒有 nuclear stain 影像時,下列何者最能幫忙?
2. Without nuclear stain, which method works best?
3. 2025 bioRxiv 對 segmentation 提出的最大警告是?
3. Main warning from the 2025 bioRxiv on segmentation?