Step 1: QC — scRNA-seq Tutorial

概覽

為什麼 QC 是最關鍵的第一步？

在 scRNA-seq 的數據分析流程中，品質管制 (Quality Control) 是極其關鍵的第一步。由於單細胞捕獲技術的限制和細胞本身的狀態差異，原始數據中會包含大量無用的雜訊（如死細胞、破裂液滴、雙細胞等）。

為什麼這很重要？如果沒有嚴格的 QC，這些低品質數據會嚴重干擾後續的降維、分群 (Clustering) 和差異表現基因分析，甚至導致得出完全錯誤的生物學結論。例如，死細胞可能因為高粒線體比例而自成一個虛假的 cluster；doublets 可能在 UMAP 上形成不真實的「過渡狀態」群體。

In the scRNA-seq data analysis pipeline, Quality Control (QC) is the most critical first step. Due to limitations of single-cell capture technology and cellular state variations, raw data inevitably contains a large amount of noise (dead cells, ruptured droplets, doublets, etc.).

Why does this matter? Without stringent QC, low-quality data will severely interfere with downstream dimensionality reduction, clustering, and differential expression analysis — potentially leading to completely erroneous biological conclusions. For example, dead cells may form spurious clusters due to high mitochondrial fractions; doublets may create artificial "transitional" populations on UMAP plots.

💡

核心原則：QC 沒有「放諸四海皆準」的閾值。最佳做法是根據你的物種、組織類型（如心肌細胞粒線體本就偏高）和定序技術進行調整。務必查看小提琴圖與散佈圖後再決定閾值。 Core principle: There is no universal QC threshold. Best practice is to adjust based on your species, tissue type (e.g., cardiomyocytes naturally have high MT%), and sequencing technology. Always examine violin plots and scatter plots before setting thresholds.

核心指標

一、細胞層級的 QC 核心指標

細胞層級的過濾旨在剔除品質不良的「細胞條碼 (Cell Barcodes)」。以下是三個最核心的指標：

Cell-level filtering aims to remove poor-quality "Cell Barcodes." The three most critical metrics are:

🔢

nFeature_RNA

單一細胞中檢測到的基因總數。
• 過低 (<200)：細胞破裂或空液滴，只捕獲了少量環境 RNA。
• 過高 (>4000–6000)：暗示為雙細胞 (Doublets)，因為兩個細胞的基因合併在一起，偵測到的基因數異常高。

Number of unique genes detected per cell.
• Too low (<200): Cell lysis or empty droplet — only ambient RNA captured.
• Too high (>4000–6000): Suggests doublets — genes from two cells merged, artificially inflating gene counts.

📊

nCount_RNA

測得的所有轉錄本分子總數 (UMI counts)。應與基因數量呈現高度正相關——正常細胞會沿著對角線分佈，偏離這條軌跡的通常是有品質問題的細胞。異常偏高或偏低都值得注意。

Total UMI (Unique Molecular Identifier) counts per cell. Should correlate strongly with nFeature — healthy cells distribute along the diagonal. Cells deviating from this trajectory usually have quality issues. Abnormally high or low values both warrant attention.

🔋

percent.mt

判斷細胞死活/健康狀態的最重要指標。細胞破裂時，細胞質中的 mRNA 會流失到外界，但粒線體因為有自己的雙層膜包裹，其 RNA 容易殘留在液滴中。因此高 MT% 代表這可能是一個死亡或垂死的細胞。
• 常見閾值：大於 5%~20% 剔除（因組織而異）。

The most important indicator of cell viability. When a cell ruptures, cytoplasmic mRNA leaks out, but mitochondrial RNA tends to remain due to the double membrane. High MT% therefore suggests a dead or dying cell.
• Common threshold: Remove cells above 5–20% (varies by tissue).

互動模擬

QC 散佈圖模擬器

拖動下方滑桿，觀察不同閾值如何篩選細胞。綠色點為通過 QC 的細胞，紅色點為被剔除的細胞。注意好的細胞應沿著對角線分佈；偏離軌跡的通常品質有問題。

Drag the sliders below to see how different thresholds filter cells. Green dots pass QC; red dots are removed. Note that healthy cells should distribute along the diagonal; those deviating from this trajectory usually have quality issues.

——

最少基因數 200

最多基因數 4000

最高 MT% 10%

X: nCount_RNA | Y: nFeature_RNA | 點大小：MT%

基因層級

二、基因層級的 QC

除了過濾細胞，我們也需要過濾掉沒有分析價值的基因，以減少運算負擔和統計雜訊。

極低表現基因過濾：通常會將「在少於 3 個細胞中表現的基因」剔除。因為這些基因極有可能是測序錯誤的產物，或者表現量低到無法在後續分析中提供任何統計顯著性。

Besides filtering cells, we also need to remove genes with no analytical value to reduce computational burden and statistical noise.

Low-expression gene filtering: Genes detected in fewer than 3 cells are typically removed. These are most likely sequencing artifacts or are expressed at levels too low to provide statistical significance in downstream analyses.

🧬 核糖體基因比例

過低有時暗示細胞狀態不佳，但此指標不如粒線體通用。通常針對 RPS/RPL 開頭基因計算。某些流程會選擇性移除或標記，但並非普遍做法。

Abnormally low ratios sometimes suggest poor cell quality, though this metric is less universally applied than MT%. Calculated for genes starting with RPS/RPL. Some pipelines remove or flag these, but it's not standard practice.

🩸 紅血球基因比例

針對血液或富含血管組織，若非專門研究紅血球，通常會過濾掉 HBA、HBB 表現過高的細胞。這些細胞可能是混入的紅血球碎片，會干擾下游分群結果。

For blood or vascular-rich tissues, cells with high HBA/HBB expression are typically filtered unless studying red blood cells specifically. These may be contaminating RBC fragments that interfere with downstream clustering.

進階 QC

三、進階 QC：處理技術性假象

基本的閾值過濾無法解決所有問題。即使過濾了極端值，仍有難以辨識的假象混入數據。現代分析需加入以下進階步驟：

Basic threshold filtering cannot solve all problems. Even after removing extreme values, subtle artifacts remain in the data. Modern analysis requires these additional advanced steps:

👯 雙細胞偵測 (Doublet Detection)

兩個細胞被包裹在同一個液滴中，稱為雙細胞 (Doublets)。它們會在 UMAP 分群上形成不真實的「過渡狀態」。簡單的 nCount/nFeature 上限無法完全剔除由兩個小細胞組成的 Doublets——因為它們的指標可能只是略高於正常範圍。

解決方案：使用演算法模擬雙細胞表現特徵（隨機合併兩個真實細胞的表達譜），然後計算每個細胞與這些模擬 doublet 的相似程度（Doublet Score）。分數高的細胞被標記為疑似 doublet 並移除。

常用工具：

DoubletFinder (R) — 在 PCA 空間中模擬 doublets，計算 pANN score
Scrublet (Python) — 類似邏輯，速度快，適合大數據集
scDblFinder (R/Bioconductor) — 使用隨機森林分類器，準確度高

Two cells captured in the same droplet are called doublets. They form artificial "transitional states" on UMAP. Simple nCount/nFeature upper limits cannot fully remove doublets composed of two small cells — their metrics may be only slightly above normal range.

Solution: Algorithms simulate doublet expression profiles (by randomly merging two real cells' expression), then score each cell by similarity to these simulated doublets. High-scoring cells are flagged and removed.

Common tools:

DoubletFinder (R) — simulates doublets in PCA space, computes pANN score
Scrublet (Python) — similar logic, fast, suitable for large datasets
scDblFinder (R/Bioconductor) — uses random forest classifier, high accuracy

🫧 環境背景 RNA (Soup)

破裂細胞的 RNA 釋放到懸浮液中，被一起封裝到液滴裡。這導致原本不該表現某基因的細胞（如 T 細胞）錯誤檢測出該基因（如肝細胞特異性基因 ALB）。這種污染稱為「Soup」。

解決方案：利用空液滴（不含細胞的液滴）估算背景 RNA 分佈，然後從每個細胞的表現矩陣中「減去」這部分污染值。

常用工具：

SoupX (R) — 利用已知的 marker genes 估算污染比例
CellBender (Python/GPU) — 深度學習方法，能同時移除背景 RNA 和空液滴

RNA from ruptured cells is released into suspension and co-encapsulated into droplets. This causes cells to falsely detect genes they shouldn't express (e.g., T cells showing liver-specific gene ALB). This contamination is called "soup."

Solution: Estimate background RNA distribution from empty droplets (those without cells), then "subtract" this contamination from each cell's expression matrix.

Common tools:

SoupX (R) — uses known marker genes to estimate contamination fraction
CellBender (Python/GPU) — deep learning approach, removes ambient RNA and empty droplets simultaneously

決策引導

四、QC 閾值設定策略與最佳實踐

🌳 閾值策略決策流程

Q1:

你有多個樣本 / 批次嗎？ → 是 → 建議使用自適應閾值（MAD-based），讓每個樣本根據自身分佈決定閾值。

Q2:

你的組織有特殊 MT% 特徵嗎？（如心肌細胞 MT% 天然偏高）→ 是 → 不要套用通用的 5% 閾值，需根據組織背景調高。

Q3:

是否關注稀有細胞群？ → 是 → 建議寧可寬鬆；過嚴的過濾可能將靜止期/小型細胞誤判為低品質而移除。

Q4:

初次分析？ → 是 → 先用寬鬆閾值做初步分群，如果發現某 cluster 全是高 MT% 且缺乏特異 marker，再回頭在 QC 步驟將其剔除。

Q1:

Do you have multiple samples/batches? → Yes → Use adaptive thresholds (MAD-based) so each sample determines its own thresholds from its data distribution.

Q2:

Does your tissue have unusual MT%? (e.g., cardiomyocytes have naturally high MT%) → Yes → Do not apply a generic 5% threshold; adjust based on tissue biology.

Q3:

Are you looking for rare cell populations? → Yes → Be lenient; overly strict filtering may remove quiescent or small cells mistaken for low-quality.

Q4:

First analysis? → Yes → Use lenient thresholds for initial clustering; if a cluster has high MT% and no specific markers, go back and tighten QC.

策略	固定閾值	自適應閾值
概念	使用主觀的死板數值 (如 MT% < 5%)	基於數據分佈 (Data-driven) 決定	Fixed subjective values (e.g., MT% < 5%)	Data-driven, based on actual distribution
方法	文獻常規經驗值	中位數 ± 3 × MAD	Literature-based empirical values	Median ± 3 × MAD
工具	Seurat (manual)	scater `isOutlier()`
適用	單樣本、快速探索	多樣本整合、正式分析	Single sample, quick exploration	Multi-sample integration, formal analysis

⚠️

核心專家建議：
① 沒有放諸四海皆準的閾值。
② 聯合分佈（散佈圖）比單一指標更重要——好的細胞沿對角線分佈。
③ 寧可稍微寬鬆，也不要過度過濾——過嚴可能把罕見亞群過濾掉。
④ 將 QC 視為迭代過程——初步寬鬆分群後再回頭收緊。

Expert tips:
① No universal thresholds exist.
② Joint distributions (scatter plots) are more important than single metrics — healthy cells follow the diagonal.
③ Better slightly lenient than over-filtered — overly strict QC may remove rare subpopulations.
④ Treat QC as iterative — do lenient initial clustering, then tighten if needed.

程式碼

QC 實作範例

# 讀入 10X 數據 / Load 10X data
library(Seurat)
pbmc.data <- Read10X(data.dir = "filtered_feature_bc_matrix/")
pbmc <- CreateSeuratObject(counts = pbmc.data, min.cells = 3, min.features = 200)

# 計算 MT% / Calculate MT%
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

# 視覺化（先觀察再決定閾值！） / Visualize FIRST, then decide thresholds
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

# 過濾（依觀察結果調整！） / Filter (adjust based on observations!)
pbmc <- subset(pbmc,
  subset = nFeature_RNA > 200 & nFeature_RNA < 4000 & percent.mt < 10
)

# 進階：Doublet 偵測 / Advanced: Doublet detection
library(DoubletFinder)
pbmc <- doubletFinder_v3(pbmc, PCs = 1:15, pN = 0.25, pK = 0.09,
  nExp = round(0.05 * nrow(pbmc@meta.data)))

# 進階：環境 RNA 移除 / Advanced: Ambient RNA removal
library(SoupX)
sc <- load10X("raw_feature_bc_matrix/")
sc <- autoEstCont(sc)
out <- adjustCounts(sc)

import scanpy as sc
import scrublet as scr

# 讀入數據 / Load data
adata = sc.read_10x_mtx("filtered_feature_bc_matrix/", var_names="gene_symbols")

# 基因層級過濾 / Gene-level filtering
sc.pp.filter_genes(adata, min_cells=3)

# 計算 QC 指標 / Calculate QC metrics
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

# 視覺化 / Visualize
sc.pl.violin(adata, ["n_genes_by_counts", "total_counts", "pct_counts_mt"])
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")

# 過濾 / Filter
adata = adata[(adata.obs.n_genes_by_counts > 200) &
              (adata.obs.n_genes_by_counts < 4000) &
              (adata.obs.pct_counts_mt < 10)]

# 進階：Scrublet doublet detection
scrub = scr.Scrublet(adata.X)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
adata.obs["doublet_score"] = doublet_scores
adata = adata[~predicted_doublets]

⚠️

先看圖再設閾值！上面的數字（200 / 4000 / 10%）只是示範。實際分析時務必先執行 VlnPlot 或 sc.pl.violin 觀察數據分佈後再決定。 Visualize before setting thresholds! The numbers above (200/4000/10%) are just examples. Always run VlnPlot or sc.pl.violin to examine your data distribution first.

📝 自我檢測

1. 某細胞的 nFeature_RNA 極高、nCount_RNA 也極高，最可能的原因是？

1. A cell has extremely high nFeature_RNA and nCount_RNA. Most likely cause?

A. 該細胞正在大量轉錄基因A. The cell is highly transcriptionally active

B. 定序深度特別深的正常細胞B. A normal cell with unusually deep sequencing

C. 可能是 doublet（雙細胞）C. Likely a doublet

D. 粒線體基因佔比過高D. High mitochondrial gene fraction

2. 你正在分析心肌組織的 scRNA-seq 數據，發現大量細胞的 MT% 介於 8–15%。最佳做法是？

2. You're analyzing cardiac tissue scRNA-seq data and many cells have MT% between 8–15%. Best approach?

A. 仍然使用 5% 作為閾值，嚴格過濾A. Still use 5% threshold, filter strictly

B. 根據組織特性調高閾值，因為心肌細胞粒線體含量天然偏高B. Increase threshold because cardiomyocytes naturally have high mitochondrial content

C. 移除所有 MT 相關基因C. Remove all MT-related genes

D. 不需要做 QC，直接下游分析D. Skip QC entirely

3. 關於 Doublet 偵測，以下何者正確？

3. Which statement about doublet detection is correct?

A. 設定 nFeature 上限就能完全移除所有 doubletsA. Setting an nFeature upper limit completely removes all doublets

B. 需要使用專門演算法（如 DoubletFinder/Scrublet）來模擬並計算 doublet scoreB. Specialized algorithms (DoubletFinder/Scrublet) are needed to simulate and score doublets

C. Doublets 只會出現在 10X Genomics 平台C. Doublets only occur on 10X Genomics platforms

D. Doublets 不會影響下游分析結果D. Doublets don't affect downstream analysis