REFERENCES

scRNA-seq References Index

為 scRNA-seq Interactive Tutorial 12 個主題提供原始論文、官方文件、Best Practices 綜述與 marker gene 資料庫的完整佐證。

References, citations, and sources for the scRNA-seq Interactive Tutorial — papers, official documents, and benchmarks with DOI / URL links.

如何使用這份資料?

本頁針對教學中提到的每個工具、演算法與生物學概念,整理學術出處供讀者深入查閱。引用標籤含義:

For every tool, algorithm, and biological concept mentioned in the tutorial, this page collects academic sources for deeper reading. Citation tag meanings:

📄

Paper

原始論文 · 含 DOI / PubMed

Original papers with DOI / PubMed

📘

Doc

官方文件、vignette、tutorial

Official documentation, vignettes, tutorials

Best Practice

系統性綜述或 community 推薦

Systematic reviews and community recommendations

📊

Benchmark

方法評比 / 獨立 benchmarking

Method comparisons and independent benchmarks

🗄️

Database

Marker gene / 細胞註釋資料庫

Marker gene and cell annotation databases

📚

Book

線上免費書籍與綜合教材

Free online books and comprehensive textbooks

本頁目錄

⭐ Best Practices 綜述論文

整體 scRNA-seq 分析流程的權威性教科書級綜述。建議從這兩篇開始閱讀,能對全流程的設計取捨建立全景觀。

Authoritative textbook-level reviews of the entire scRNA-seq analysis workflow. Start with these two papers to build a panoramic view of design tradeoffs across the full pipeline.

🧰 核心分析框架

scRNA-seq 流程兩大主流軟體生態系。Seurat (R) 與 Scanpy (Python) 在功能上高度相容,多數方法兩端都有實作。

The two mainstream software ecosystems for scRNA-seq workflows. Seurat (R) and Scanpy (Python) are highly feature-compatible, with most methods implemented on both sides.

🧪 品質管控 (Quality Control)

細胞層級指標 (nFeature/nCount/percent.mt)、基因層級過濾,以及進階的 doublet 偵測與 ambient RNA 移除。

Cell-level metrics (nFeature/nCount/percent.mt), gene-level filtering, plus advanced doublet detection and ambient RNA removal.

Doublet 偵測

Ambient RNA / Soup

自適應閾值與 QC 工具

⚖️ 標準化 (Normalization)

✨ 高變異基因選擇 (Variable Features)

📏 縮放 (Scaling)

📉 主成分分析 (PCA)

🧩 聚類 (Clustering)

🗺️ UMAP / t-SNE 視覺化

🏷️ 細胞註釋 (Cell Annotation)

🧾 差異表達分析 (Differential Expression)

🔗 資料整合 (Integration)

💬 細胞通訊 (Cell-Cell Communication)

🛤️ 軌跡分析 (Trajectory / Pseudotime)

🗄️ Marker Gene 資料庫與細胞圖譜

原教學「Annotation」章節示範的 PBMC marker(CD3D、IL7R、MS4A1、CD79A、GNLY、NKG7、KLRD1、CD14、LYZ、S100A8、FCGR3A、MS4A7、FCER1A、CST3、PPBP、PF4 等)在以下資料庫中可被追溯到原始文獻。

The PBMC markers demonstrated in the tutorial’s “Annotation” chapter (CD3D, IL7R, MS4A1, CD79A, GNLY, NKG7, KLRD1, CD14, LYZ, S100A8, FCGR3A, MS4A7, FCER1A, CST3, PPBP, PF4, etc.) can be traced back to their original literature in the databases below.

📖
查找指南: 所有 DOI 連結指向出版社頁面。若需下載 PDF,可搭配 PubMedPMCarXivbioRxiv 找開放近用版本。

📌 教學註記與細節

下方為閱讀本 scRNA-seq 教學 HTML 與本 reference 比對後,發現的可能不完整、易誤解或可加強之處。不修改原教學檔案,僅在此說明以利參照。

Below are points discovered while cross-checking the tutorial HTML against this reference list — places that may be incomplete, easily misread, or worth expanding. The original tutorial files are not modified; clarifications are provided here for cross-reference.

補充說明 · Notes

Normalization:SCTransform 不一定最佳

教學若預設推薦 SCTransform (Hafemeister & Satija 2019) 為「最佳」normalization,需修正。Ahlmann-Eltze & Huber (2023, Nat Methods) 系統 benchmark 22 種 scRNA-seq normalization 方法,於下游 cluster 重現性、DE、軌跡分析比較,結論:(1) 簡單的 shifted log (log(y/s + 1)) 搭配 size-factor(如 scran::computeSumFactors)在多數任務表現等同或優於 SCTransform / GLM-PCA / scVI 等複雜方法;(2) Pearson residuals (SCT v2) 在 highly variable gene 篩選有優勢,但全矩陣輸出可能引入 spurious negative values;(3) 不存在「one-size-fits-all」normalization。實務:(a) 探索階段直接用 NormalizeData() shifted log 即可;(b) 若 sequencing depth 差異 >10× 才加 size-factor regression;(c) 跨方法交叉驗證 cluster stability。

Ahlmann-Eltze & Huber (2023, Nat Methods) benchmarked 22 normalization methods and found simple shifted log with size factors matches or beats SCTransform / GLM-PCA / scVI on cluster reproducibility, DE, and trajectory tasks. There is no universally best normalization; cross-validate cluster stability.

Sources: Ahlmann-Eltze C, Huber W (2023) Comparison of transformations for single-cell RNA-seq data, Nat Methods 20:665–672. DOI: 10.1038/s41592-023-01814-1; Hafemeister C, Satija R (2019) SCTransform, Genome Biology 20:296.

補充說明 · Notes

Clustering:k.param=20 的影響

教學常逕用 FindNeighbors(..., k.param=20) + FindClusters(resolution=0.5) 為預設,需注意此選擇 偏向中等 granularity:(1) k 過小(<10)會放大 noise,產生過多 micro-clusters;(2) k 過大(>50)會吞噬 rare cell types(如 plasmacytoid DC、Tregs);(3) 小資料集(<1000 細胞)20 已接近全體,cluster 變得不穩;(4) 不同資料尺度應 sweep kresolution,配合 silhouette、ROGUE 或 clustree 評估穩定性。實務:先以 clustree::clustree() 看 resolution 階梯,找到「stable plateau」;再以 scran::clusterCells()bluster::clusterRows() 比較不同 k 的 cluster purity。

Default k.param=20 in Seurat is not neutral — too small inflates micro-clusters, too large merges rare populations. Sweep k and resolution, evaluate with clustree / silhouette / ROGUE; on small datasets k=20 can equal the cell count.

Sources: Hao Y et al. (2024) Seurat v5, Nat Biotechnol 42:293–304. DOI: 10.1038/s41587-023-01767-y; Zappia L, Oshlack A (2018) clustree, GigaScience 7:giy083; Liu et al. (2020) ROGUE, Nat Commun 11:3155.

補充說明 · Notes

Doublet:工具選擇與基準

教學若僅推介 DoubletFinder (McGinnis 2019),需補充更新證據。Xi & Li (2021, Cell Systems) 系統 benchmark 9 種 doublet detection 工具於 16 個 ground-truth 資料集,結論:(1) scDblFinder (Germain 2021, F1000Research) 在準確度、執行速度、跨資料穩定性上總體最佳;(2) DoubletFinder 表現次之,但需手動估 expected rate;(3) Scrublet (Wolock 2019) 速度快但 recall 較低;(4) 不同工具偵到的 doublets 重疊有限,建議 ≥2 工具交集使用;(5) 任何工具均無法偵到 homotypic doublets(同型細胞融合),仍需以 nCount_RNA + nFeature_RNA 上限過濾。實務:scDblFinder 為首選,並注意 10x Multiome、Visium HD 等多模態資料須用模態專用 doublet 偵測(如 AMULET for ATAC)。

Xi & Li (2021, Cell Systems) benchmarked 9 doublet callers across 16 ground-truth datasets: scDblFinder ranked best overall, DoubletFinder second, Scrublet fastest but lower recall. Use consensus of ≥2 callers; no tool detects homotypic doublets — combine with nCount/nFeature filters.

Sources: Xi NM, Li JJ (2021) Benchmarking computational doublet-detection methods, Cell Systems 12:176–194. DOI: 10.1016/j.cels.2020.11.008; Germain PL et al. (2021) scDblFinder, F1000Research 10:979.

補充說明 · Notes

Pseudotime 的本質與誤用

教學常呈現 Monocle3 / Slingshot / PAGA / scVelo 的 trajectory 並暗示「細胞發育順序」,需釐清:(1) Pseudotime 為 reconstruction,不是 ground truth,只反映「樣本內表達相似性的拓撲順序」,不必對應實際時間軸;(2) Saelens et al. (2019, Nat Biotechnol) 系統 benchmark 45 種 trajectory inference 工具,發現不同方法在同一資料可推得迥異拓撲;(3) RNA velocity (La Manno 2018; Bergen 2020 scVelo) 雖加入動力學,但 Bergen et al. (2021, Mol Syst Biol) 指出 splicing kinetics 假設在許多 lineage 違反(如 erythroid maturation),可能反推錯誤方向;(4) 不同 root cell 選擇即決定 pseudotime 方向。實務:報告 trajectory 須 (a) 同時測 ≥2 工具確認 topology 一致、(b) 用 lineage tracing / scNT-seq / sci-fate 等實驗驗證、(c) 避免 over-interpret 「leaves」為終末分化態。

Pseudotime is a reconstruction, not a true time axis. Saelens et al. (2019, Nat Biotechnol) showed different tools give different topologies on the same data. RNA velocity assumptions (constant splicing/degradation rates) are violated in erythroid maturation (Bergen 2021, Mol Syst Biol). Validate with lineage tracing or metabolic-labeling sequencing.

Sources: Saelens W et al. (2019) A comparison of single-cell trajectory inference methods, Nat Biotechnol 37:547–554. DOI: 10.1038/s41587-019-0071-9; Bergen V et al. (2021) RNA velocity—current challenges, Mol Syst Biol 17:e10282.

補充說明 · Notes

Annotation:自動化與專家審查的平衡

教學若僅推介單一 annotation 工具,需平衡介紹三類策略:(1) Marker-based 手動:以 dotplot/violin 對照 PanglaoDB、CellMarker 2.0 (Hu 2023, NAR)、Azimuth references,優點可解讀、可控;缺點主觀且 rare cell type 易漏。(2) SingleR / scmap / Symphony(reference-based 相關係數):快速、需 bulk 或 scRNA reference;對 reference 不涵蓋的 cell type 會強制錯誤分類。(3) scANVI / Celltypist (Domínguez Conde 2022, Science)(probabilistic deep model):可輸出不確定性、支援 zero-shot 與 fine-tune;需 GPU。Tan & Cahan (2019, Cell Systems) 與 Abdelaal et al. (2019, Genome Biology) benchmark 顯示 沒有單一最佳工具,跨方法 majority vote 表現最穩。實務:(a) 任何自動 annotation 都應以 marker gene + 領域專家審查;(b) 報告中需列出 reference 版本、unmapped cell 比例、低 confidence 比例。

No single annotation tool dominates. Abdelaal et al. (2019, Genome Biology) and Tan & Cahan (2019, Cell Systems) recommend cross-method majority vote followed by marker review. SingleR is fast and reference-bound; scANVI/Celltypist give probabilistic outputs; manual marker review remains indispensable.

Sources: Abdelaal T et al. (2019) A comparison of automatic cell identification methods for single-cell RNA sequencing data, Genome Biology 20:194. DOI: 10.1186/s13059-019-1795-z; Domínguez Conde C et al. (2022) Celltypist, Science 376:eabl5197.

補充說明 · Notes

Integration:方法選擇與 over-correction

教學若預設 Harmony 或 Seurat CCA 為「萬用」integration,需澄清三大主流的取捨。Luecken et al. (2022, Nat Methods) scIB benchmark 16 種 integration 方法,於 13 個任務評估 batch removal vs biological conservation:(1) Harmony (Korsunsky 2019):PCA-空間 fast soft-clustering,適合中等規模、組成差異小;過度校正風險高。(2) scVI / scANVI (Lopez 2018; Xu 2021):VAE 模型,最佳「保留生物變異 + 校正 batch」綜合分;但需 GPU、訓練時間長、超參數敏感。(3) Seurat CCA / RPCA:基於 anchor,CCA 較 sensitive 但 over-correction 易;RPCA 為 v4+ 推薦折衷。(4) BBKNN:快速 KNN 校正,適合 cell-atlas 規模。實務:(a) 用 scIB metrics(kBET、iLISI、ARI)量化評估;(b) 警惕「批次差異 = 生物學差異」的情境(如疾病 vs 對照來自不同批次),integration 會抹除真實效應;(c) 報告須明列方法、版本、batch covariate。

Luecken et al. (2022, Nat Methods) scIB benchmark: scVI/scANVI top overall for biology conservation; Harmony fastest with over-correction risk; Seurat CCA sensitive but tends to over-correct (use RPCA). Quantify with kBET/iLISI/ARI. Watch for confounded batch × biology designs where integration removes real effects.

Sources: Luecken MD et al. (2022) Benchmarking atlas-level data integration in single-cell genomics, Nat Methods 19:41–50. DOI: 10.1038/s41592-021-01336-8; Korsunsky I et al. (2019) Harmony, Nat Methods 16:1289.

補充說明 · Notes

UMAP:誤把幾何當生物學

教學若以 UMAP cluster 間距推論「細胞型別相關性」,需強烈警告。Chari & Pachter (2023, PLOS Comput Biol) The specious art of single-cell genomics 系統論證:(1) UMAP / t-SNE 將高維點映射到 2D 必然 扭曲全域結構,cluster 間「距離」幾乎與真實高維距離無關;(2) 用同樣資料隨機 seed 可得到拓撲不同的 UMAP;(3) UMAP「分支」常被誤讀為發育軌跡,實為投影 artifact;(4) Pearson correlation between UMAP distance and true distance 常 <0.3。實務:(a) UMAP 僅作 local neighborhood 視覺化,不要用 cluster 間距論親緣;(b) 軌跡分析應在 PCA 或 diffusion map 空間進行;(c) 報告應 (i) 公開隨機 seed、(ii) 同時呈現 PCA pairs 圖、(iii) 不在 UMAP 上做統計推論。Kobak & Linderman (2021, Nat Biotechnol) 提出 PCA-initialized t-SNE/UMAP 可改善 reproducibility,但仍不解決全域距離問題。

Chari & Pachter (2023, PLOS Comput Biol) demonstrate UMAP/t-SNE inherently distort global structure; inter-cluster distances do not reflect biology and can change with random seed. Use UMAP only for local neighborhood viz; perform trajectory and statistics in PCA / diffusion-map space; always disclose seeds and complement with PCA pair plots.

Sources: Chari T, Pachter L (2023) The specious art of single-cell genomics, PLOS Comput Biol 19:e1011288. DOI: 10.1371/journal.pcbi.1011288; Kobak D, Linderman GC (2021) Initialization is critical for preserving global data structure in both t-SNE and UMAP, Nat Biotechnol 39:156.

Doublet detection:Xi & Li 2021 Cell Systems benchmark

教學在 QC 章節介紹 doublet 偵測時,常只列工具名稱(Scrublet, DoubletFinder, scDblFinder)而未給出選擇依據。Xi & Li 2021 Cell Systems「Benchmarking Computational Doublet-Detection Methods for Single-Cell RNA Sequencing Data」系統比較 9 種 doublet 偵測工具於 16 真實+模擬資料集:DoubletFinderscDblFinder 在 AUPRC 兩端均居前列;Scrublet 速度最快但對 inter-cell-type doublet 召回較弱;intra-cell-type doublet(同類細胞合併)所有工具皆難偵測。實務原則:(1) 推薦 scDblFinder (Germain 2022 F1000Research) 作為預設,因其同時用 cluster-based 與 random doublet simulation;(2) 預期 doublet rate 應依 10x 公佈值(~0.4% per 1000 cells loaded)設置;(3) 偵測前後務必比較 nFeature / nCount 分布;(4) 跨樣本聯合 doublet 偵測可降低 batch 干擾。

When the QC chapter introduces doublet detection, it often lists tools (Scrublet, DoubletFinder, scDblFinder) without selection criteria. Xi & Li 2021 Cell Systems ('Benchmarking Computational Doublet-Detection Methods for Single-Cell RNA Sequencing Data') benchmarks 9 tools on 16 real+simulated datasets: DoubletFinder and scDblFinder lead AUPRC on both ends; Scrublet is fastest but weaker on inter-cell-type doublets; intra-cell-type doublets (same-type merges) are hard for all tools. Rules: (1) recommend scDblFinder (Germain 2022 F1000Research) as default — it combines cluster-based and random doublet simulation; (2) set expected doublet rate per 10x's published curve (~0.4% per 1,000 cells loaded); (3) always compare nFeature / nCount distributions before/after; (4) joint doublet detection across samples reduces batch interference.

Tran 2020 Genome Biology:batch correction 在無真 batch 時會傷害訊號

教學在 integration 章節常以「先 batch correct 再分析」作為標配,需補充警告:Tran, Ang, Chevrier, Zhang, Lee, Goh, Chen 2020 Genome Biology「A benchmark of batch-effect correction methods for single-cell RNA sequencing data」比較 14 種整合方法(Harmony, scVI, BBKNN, Seurat v3 CCA, MNN, Scanorama, Liger 等)於 5 個情境後指出:(1) 無真 batch(同實驗、同 platform)時做 over-correction,部分方法(如 BBKNN 預設參數)會把不同細胞類型「壓平」進同一 cluster;(2) Harmony 與 Seurat v3 在大多情境表現穩健但 hyperparameter 敏感;(3) scVI 在大 batch 效應強時優勢明顯但小批次易 overfit。實務原則:(1) 先以 PCA / UMAP on raw 資料評估 batch 效應強度;(2) 若 silhouette by batch < 0.05 且 cell types 已分群,可不做 batch correction;(3) Luecken 2022 Nat Methods 的 scIB-metrics 同時報 batch removal 與 bio conservation 雙軸;(4) 報告校正前後 marker gene 表現是否被壓抑。

When the integration chapter teaches 'batch-correct first, then analyse', add a caution: Tran, Ang, Chevrier, Zhang, Lee, Goh, Chen 2020 Genome Biology ('A benchmark of batch-effect correction methods for single-cell RNA sequencing data') compares 14 methods (Harmony, scVI, BBKNN, Seurat v3 CCA, MNN, Scanorama, Liger, etc.) across 5 scenarios and finds: (1) over-correction when no real batch exists (same experiment, same platform) — some methods (BBKNN at defaults) collapse distinct cell types into one cluster; (2) Harmony and Seurat v3 are robust across most scenarios but hyperparameter-sensitive; (3) scVI excels under strong batch effects but overfits on small batches. Rules: (1) first inspect batch strength on raw PCA / UMAP; (2) if silhouette-by-batch < 0.05 and cell types separate cleanly, skip correction; (3) Luecken 2022 Nat Methods's scIB-metrics reports batch removal and biology conservation on separate axes; (4) verify that marker-gene expression is not suppressed before/after correction.

補充說明 · Notes

QC:DoubletFinder API 名稱

教學中若仍示範 doubletFinder_v3(),需更新:此為 DoubletFinder < 2.0.4 的舊 API。自 2023 年發行的 DoubletFinder 2.0.4 起,函式已正名為 doubletFinder()paramSweep()(不再帶 _v3 後綴)。新版本仍向後相容舊名稱,但建議讀者使用新 API;同時對應的參數(pN、pK、nExp)介面未變。

If the tutorial still demonstrates doubletFinder_v3(), update it: that is the pre-2.0.4 API. Since DoubletFinder 2.0.4 (2023) the functions are renamed doubletFinder() and paramSweep() (no _v3 suffix). Old names still work for backward compatibility, but new code should use the new API. Parameters (pN, pK, nExp) are unchanged.

Sources: GitHub: chris-mcginnis-ucsf/DoubletFinder README & release notes (v2.0.4, 2023).

補充說明 · Notes

Annotation:FCGR3A Mono 的常用名

教學表格若寫「FCGR3A Mono」,需補充:FCGR3A 即 CD16。在多數教科書、Seurat 官方 PBMC tutorial 與 Azimuth PBMC reference 中,此細胞群通常稱為 CD16+ Monocyte非典型單核球 (non-classical monocyte)。三個名稱(FCGR3A Mono / CD16+ Mono / non-classical Mono)指向同一細胞群,文獻可互換;報告時建議使用 CD16+ Monocyte 以對齊免疫學社群慣例。

If the tutorial table lists "FCGR3A Mono", note that FCGR3A is CD16. Textbooks, the Seurat PBMC tutorial, and the Azimuth PBMC reference all use CD16+ Monocyte or non-classical monocyte for the same population. The three names are interchangeable; prefer CD16+ Monocyte in reports to align with immunology conventions.

Sources: Hao Y et al. (2021) Integrated analysis of multimodal single-cell data, Cell 184:3573 (Azimuth PBMC reference); Seurat PBMC3K vignette.

補充說明 · Notes

Annotation:DC marker 的細分

教學表格若以 FCER1ACST3 標示 DC,需補充:(1) CST3 為廣義 myeloid / 單核細胞與 DC 共表達基因,特異性低;(2) FCER1A 主要標示 cDC2(conventional DC type 2);(3) cDC1 常用 marker 為 CLEC9AXCR1BATF3;(4) pDC(漿細胞樣 DC)常用 marker 為 LILRA4IL3RA (CD123)、CLEC4C;(5) AS-DC(Villani 2017 Science 描述的新亞群)以 AXLSIGLEC6 標示。若要精細註釋 DC 子群,建議搭配 Azimuth PBMC referenceCellTypist immune model 使用。

If the tutorial uses FCER1A + CST3 to label DCs, refine: CST3 is shared across myeloid cells (low specificity); FCER1A mainly marks cDC2. cDC1 markers: CLEC9A, XCR1, BATF3. pDC markers: LILRA4, IL3RA (CD123), CLEC4C. AS-DC (Villani 2017): AXL, SIGLEC6. For fine DC sub-typing, use Azimuth PBMC reference or CellTypist immune model.

Sources: Domínguez Conde C et al. (2022) Cross-tissue immune cell analysis reveals tissue-specific features in humans, Science 376:eabl5197. DOI: 10.1126/science.abl5197; Villani AC et al. (2017) Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors, Science 356:eaah4573.

補充說明 · Notes

DE:跨條件比較的當前共識

教學若僅提到 FindMarkers(test.use="wilcox") 做跨條件比較,需更新:Squair et al. 2021 Nat Commun「Confronting false discoveries in single-cell differential expression」與 Murphy et al. 2022 Nat Commun「Limitations of cell-cell communication inference from single-cell RNA sequencing」皆指出:(1) 把 cluster × sample 做 pseudobulk 後使用 edgeR-LRTDESeq2-LRT,比直接對單細胞做 Wilcoxon / MAST 在控制 FDR 上明顯較佳,後者會把同一樣本內細胞當獨立樣本而過度膨脹 type I error;(2) sample 數 (donor / replicate) 過少時 (n < 4 per group) pseudobulk 也不可靠,須謹慎;(3) 對 cell-type rare 或樣本內 zero-inflation 高的情境,可改用 MAST hurdle model + random effect 補充。實務:(a) 跨 donor / condition 的 DE 必須 pseudobulk;(b) 報告須明列 sample 數、pseudobulk strategy 與檢定方法。

If the tutorial only mentions FindMarkers(test.use="wilcox") for cross-condition DE, update: Squair et al. 2021 Nat Commun ("Confronting false discoveries in single-cell differential expression") and Murphy et al. 2022 Nat Commun show pseudobulk by cluster × sample followed by edgeR-LRT or DESeq2-LRT controls FDR far better than single-cell Wilcoxon / MAST, which treats cells within a sample as independent and inflates type I error. With very few samples (n < 4 per group) pseudobulk is also unreliable; consider MAST hurdle model with random effects. Always report sample count and pseudobulk strategy.

Sources: Squair JW et al. (2021) Confronting false discoveries in single-cell differential expression, Nat Commun 12:5692. DOI: 10.1038/s41467-021-25960-2; Murphy AE et al. (2022), Nat Commun 13:7980. DOI: 10.1038/s41467-022-35519-4.