normalization — scRNA-seq Tutorial

為什麼

一、為什麼需要標準化？

在實驗過程中，即使是相同的細胞，也會因為以下技術原因導致觀測到的 Read Counts 不同。標準化的核心任務就是校正這些誤差，確保後續分析的準確性。

During experiments, even identical cells will show different read counts due to technical factors. Normalization corrects these biases to ensure accurate downstream analysis.

📊

定序深度

不同細胞在測序儀上獲得的總 Reads 數不同，導致整體信號強度的基線差異。一個細胞可能獲得 5000 reads，另一個獲得 50000 reads——這純粹是技術差異。

Different cells receive different total reads from the sequencer, causing baseline signal intensity differences. One cell might get 5,000 reads while another gets 50,000 — a purely technical difference.

🧲

捕獲效率

不同細胞在逆轉錄成 cDNA 或文庫備製過程中的效率存在隨機差異，並非每個 RNA 分子都能被成功捕獲和測量。

Random differences in reverse transcription and library preparation efficiency mean not every RNA molecule is successfully captured and measured.

💧

液滴大小

在 10x Genomics 等基於微流控的技術中，每個液滴包含的試劑量可能有微小差異，影響反應效率。

In microfluidics-based technologies (10x Genomics), each droplet may contain slightly different reagent volumes, affecting reaction efficiency.

🧬

RNA 含量差異

不同大小或類型的細胞（如靜止的淋巴細胞 vs. 活躍分裂的腫瘤細胞）本身攜帶的總體 RNA 絕對量就不同。這既是技術問題也是生物學問題。

Cells of different sizes/types (quiescent lymphocyte vs. actively dividing tumor cell) carry different total RNA amounts — both a technical and biological issue.

互動模擬

二、標準化方法與數據分佈

點擊下方按鈕，觀察不同標準化方法如何改變數據的數學分布。單細胞數據極度稀疏（充滿 0），傳統方法逐漸顯露不足，促使了進階演算法的誕生。

Click buttons to observe how different normalization methods change the data distribution. Single-cell data is extremely sparse (full of zeros), and traditional methods have shown limitations, driving the development of advanced algorithms.

方法比較

三、關鍵特性比較矩陣

以下對比三種主要標準化方法的核心邏輯與適用場景。

Comparison of core logic and use cases for the three main normalization methods.

特性	LogNormalize	Scran	SCTransform
核心邏輯	線性縮放 + 對數轉換	基於細胞池的解卷積	正則化負二項回歸殘差	Linear scaling + log transform	Pool-based deconvolution	Regularized NB regression residuals
速度	⚡ 極快	🔵 中等	🐢 較慢
零值處理	一般	較好	最佳
適用場景	快速探索、簡單數據	細胞間 RNA 含量差異極大	正式分析（目前主流推薦）	Quick exploration, simple datasets	Large RNA content variation across cells	Formal analysis (currently recommended)

🌳 我該用哪種方法？

Q1:

你用 Seurat 嗎？ → 是 → 預設 LogNormalize 可用於初步探索；正式分析建議 SCTransform。

Q2:

你用 Scanpy 嗎？ → 是 → 預設 normalize_total + log1p（等同 LogNormalize）。

Q3:

數據中有極端大小不同的細胞？（如腫瘤 + 免疫細胞混合）→ 是 → 考慮 Scran 的 pooling deconvolution。

Q1:

Using Seurat? → Yes → Default LogNormalize for exploration; SCTransform for formal analysis.

Q2:

Using Scanpy? → Yes → Default normalize_total + log1p (equivalent to LogNormalize).

Q3:

Extreme cell size differences? (e.g., tumor + immune mixed) → Yes → Consider Scran pooling deconvolution.

工作流程

四、建議的標準化工作流

初步過濾

除去線粒體比例過高或總基因數過少的無效數據。

Remove cells with high MT% or too few genes.

執行 SCTransform

利用負二項回歸模型過濾技術雜訊。SCTransform(object, vars.to.regress = "percent.mt")

Use NB regression to filter technical noise. SCTransform(object, vars.to.regress = "percent.mt")

降維與聚類

利用 SCTransform 產生的 Pearson Residuals 直接進行 PCA 與分群。

Use SCTransform's Pearson residuals directly for PCA and clustering.

FAQ

五、常見問題

Normalization 處理定序深度差異，讓不同細胞的基線一致。Scaling（如 ScaleData）是將基因調整為平均值 0、方差 1 (Z-score)，主要給 PCA 使用。兩者是不同步驟。

Normalization addresses sequencing depth differences to equalize baselines across cells. Scaling (e.g., ScaleData) transforms genes to mean=0, variance=1 (Z-score), primarily for PCA. They are different steps.

請勿過度解讀！所有標準化本質上是數學模擬與推斷。特別是極端低表達基因，經轉換後可能產生假陽性。始終需結合生物學背景驗證。

Don't over-interpret! All normalization is essentially mathematical estimation. Extremely low-expression genes may produce false positives after transformation. Always validate with biological context.

不能。標準化處理的是單個樣本內的技術差異。跨批次的系統性偏差需要專門的數據整合方法（如 Harmony、RPCA、scVI），詳見 Integration 章節。

No. Normalization handles within-sample technical differences. Cross-batch systematic biases require dedicated integration methods (Harmony, RPCA, scVI), covered in the Integration chapter.

程式碼

六、實作範例

# 方法一：LogNormalize（預設）
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

# 方法二：SCTransform（推薦正式分析）
# 同時完成 normalization + feature selection + scaling
pbmc <- SCTransform(pbmc, vars.to.regress = "percent.mt", verbose = FALSE)

# SCTransform v2（更快更穩定）
pbmc <- SCTransform(pbmc, vst.flavor = "v2")

# LogNormalize equivalent
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()  # 保存原始 normalized 數據

# Scran (需要 R 互通或使用 scran wrapper)
# import scran; size_factors = scran.computeSumFactors(adata)

💡

Normalization ≠ Scaling。Normalization 處理細胞間定序深度差異；Scaling 是基因做 Z-score 轉換。兩者互補、不可混淆。Normalization ≠ Scaling. Normalization corrects cell-level depth differences; Scaling is gene-level Z-score transformation. Complementary but distinct steps.

📝 自我檢測

1. LogNormalize 的核心三步驟是？

1. What are the three core steps of LogNormalize?

A. 減去平均值，除以標準差，取 logA. Subtract mean, divide by std, take log

B. 除以細胞總計數，乘以 scale factor (10000)，取 log(x+1)B. Divide by cell total counts, multiply by scale factor (10000), take log(x+1)

C. 用負二項回歸模型擬合每個基因的殘差C. Fit negative binomial regression for each gene

D. 把所有基因縮放到 0–1 之間D. Scale all genes to 0–1 range