為什麼 QC 是最關鍵的第一步?
在 scRNA-seq 的數據分析流程中,品質管制 (Quality Control) 是極其關鍵的第一步。由於單細胞捕獲技術的限制和細胞本身的狀態差異,原始數據中會包含大量無用的雜訊(如死細胞、破裂液滴、雙細胞等)。
為什麼這很重要?如果沒有嚴格的 QC,這些低品質數據會嚴重干擾後續的降維、分群 (Clustering) 和差異表現基因分析,甚至導致得出完全錯誤的生物學結論。例如,死細胞可能因為高粒線體比例而自成一個虛假的 cluster;doublets 可能在 UMAP 上形成不真實的「過渡狀態」群體。
In the scRNA-seq data analysis pipeline, Quality Control (QC) is the most critical first step. Due to limitations of single-cell capture technology and cellular state variations, raw data inevitably contains a large amount of noise (dead cells, ruptured droplets, doublets, etc.).
Why does this matter? Without stringent QC, low-quality data will severely interfere with downstream dimensionality reduction, clustering, and differential expression analysis — potentially leading to completely erroneous biological conclusions. For example, dead cells may form spurious clusters due to high mitochondrial fractions; doublets may create artificial "transitional" populations on UMAP plots.
一、細胞層級的 QC 核心指標
細胞層級的過濾旨在剔除品質不良的「細胞條碼 (Cell Barcodes)」。以下是三個最核心的指標:
Cell-level filtering aims to remove poor-quality "Cell Barcodes." The three most critical metrics are:
nFeature_RNA
單一細胞中檢測到的基因總數。
• 過低 (<200):細胞破裂或空液滴,只捕獲了少量環境 RNA。
• 過高 (>4000–6000):暗示為雙細胞 (Doublets),因為兩個細胞的基因合併在一起,偵測到的基因數異常高。
Number of unique genes detected per cell.
• Too low (<200): Cell lysis or empty droplet — only ambient RNA captured.
• Too high (>4000–6000): Suggests doublets — genes from two cells merged, artificially inflating gene counts.
nCount_RNA
測得的所有轉錄本分子總數 (UMI counts)。應與基因數量呈現高度正相關——正常細胞會沿著對角線分佈,偏離這條軌跡的通常是有品質問題的細胞。異常偏高或偏低都值得注意。
Total UMI (Unique Molecular Identifier) counts per cell. Should correlate strongly with nFeature — healthy cells distribute along the diagonal. Cells deviating from this trajectory usually have quality issues. Abnormally high or low values both warrant attention.
percent.mt
判斷細胞死活/健康狀態的最重要指標。細胞破裂時,細胞質中的 mRNA 會流失到外界,但粒線體因為有自己的雙層膜包裹,其 RNA 容易殘留在液滴中。因此高 MT% 代表這可能是一個死亡或垂死的細胞。
• 常見閾值:大於 5%~20% 剔除(因組織而異)。
The most important indicator of cell viability. When a cell ruptures, cytoplasmic mRNA leaks out, but mitochondrial RNA tends to remain due to the double membrane. High MT% therefore suggests a dead or dying cell.
• Common threshold: Remove cells above 5–20% (varies by tissue).
QC 散佈圖模擬器
拖動下方滑桿,觀察不同閾值如何篩選細胞。綠色點為通過 QC 的細胞,紅色點為被剔除的細胞。注意好的細胞應沿著對角線分佈;偏離軌跡的通常品質有問題。
Drag the sliders below to see how different thresholds filter cells. Green dots pass QC; red dots are removed. Note that healthy cells should distribute along the diagonal; those deviating from this trajectory usually have quality issues.
X: nCount_RNA | Y: nFeature_RNA | 點大小:MT%
二、基因層級的 QC
除了過濾細胞,我們也需要過濾掉沒有分析價值的基因,以減少運算負擔和統計雜訊。
極低表現基因過濾:通常會將「在少於 3 個細胞中表現的基因」剔除。因為這些基因極有可能是測序錯誤的產物,或者表現量低到無法在後續分析中提供任何統計顯著性。
Besides filtering cells, we also need to remove genes with no analytical value to reduce computational burden and statistical noise.
Low-expression gene filtering: Genes detected in fewer than 3 cells are typically removed. These are most likely sequencing artifacts or are expressed at levels too low to provide statistical significance in downstream analyses.
🧬 核糖體基因比例
過低有時暗示細胞狀態不佳,但此指標不如粒線體通用。通常針對 RPS/RPL 開頭基因計算。某些流程會選擇性移除或標記,但並非普遍做法。
Abnormally low ratios sometimes suggest poor cell quality, though this metric is less universally applied than MT%. Calculated for genes starting with RPS/RPL. Some pipelines remove or flag these, but it's not standard practice.
🩸 紅血球基因比例
針對血液或富含血管組織,若非專門研究紅血球,通常會過濾掉 HBA、HBB 表現過高的細胞。這些細胞可能是混入的紅血球碎片,會干擾下游分群結果。
For blood or vascular-rich tissues, cells with high HBA/HBB expression are typically filtered unless studying red blood cells specifically. These may be contaminating RBC fragments that interfere with downstream clustering.
三、進階 QC:處理技術性假象
基本的閾值過濾無法解決所有問題。即使過濾了極端值,仍有難以辨識的假象混入數據。現代分析需加入以下進階步驟:
Basic threshold filtering cannot solve all problems. Even after removing extreme values, subtle artifacts remain in the data. Modern analysis requires these additional advanced steps:
👯 雙細胞偵測 (Doublet Detection)
兩個細胞被包裹在同一個液滴中,稱為雙細胞 (Doublets)。它們會在 UMAP 分群上形成不真實的「過渡狀態」。簡單的 nCount/nFeature 上限無法完全剔除由兩個小細胞組成的 Doublets——因為它們的指標可能只是略高於正常範圍。
解決方案:使用演算法模擬雙細胞表現特徵(隨機合併兩個真實細胞的表達譜),然後計算每個細胞與這些模擬 doublet 的相似程度(Doublet Score)。分數高的細胞被標記為疑似 doublet 並移除。
常用工具:
- DoubletFinder (R) — 在 PCA 空間中模擬 doublets,計算 pANN score
- Scrublet (Python) — 類似邏輯,速度快,適合大數據集
- scDblFinder (R/Bioconductor) — 使用隨機森林分類器,準確度高
Two cells captured in the same droplet are called doublets. They form artificial "transitional states" on UMAP. Simple nCount/nFeature upper limits cannot fully remove doublets composed of two small cells — their metrics may be only slightly above normal range.
Solution: Algorithms simulate doublet expression profiles (by randomly merging two real cells' expression), then score each cell by similarity to these simulated doublets. High-scoring cells are flagged and removed.
Common tools:
- DoubletFinder (R) — simulates doublets in PCA space, computes pANN score
- Scrublet (Python) — similar logic, fast, suitable for large datasets
- scDblFinder (R/Bioconductor) — uses random forest classifier, high accuracy
🫧 環境背景 RNA (Soup)
破裂細胞的 RNA 釋放到懸浮液中,被一起封裝到液滴裡。這導致原本不該表現某基因的細胞(如 T 細胞)錯誤檢測出該基因(如肝細胞特異性基因 ALB)。這種污染稱為「Soup」。
解決方案:利用空液滴(不含細胞的液滴)估算背景 RNA 分佈,然後從每個細胞的表現矩陣中「減去」這部分污染值。
常用工具:
- SoupX (R) — 利用已知的 marker genes 估算污染比例
- CellBender (Python/GPU) — 深度學習方法,能同時移除背景 RNA 和空液滴
RNA from ruptured cells is released into suspension and co-encapsulated into droplets. This causes cells to falsely detect genes they shouldn't express (e.g., T cells showing liver-specific gene ALB). This contamination is called "soup."
Solution: Estimate background RNA distribution from empty droplets (those without cells), then "subtract" this contamination from each cell's expression matrix.
Common tools:
- SoupX (R) — uses known marker genes to estimate contamination fraction
- CellBender (Python/GPU) — deep learning approach, removes ambient RNA and empty droplets simultaneously
四、QC 閾值設定策略與最佳實踐
🌳 閾值策略決策流程
| 策略 | 固定閾值 | 自適應閾值 | ||
|---|---|---|---|---|
| 概念 | 使用主觀的死板數值 (如 MT% < 5%) | 基於數據分佈 (Data-driven) 決定 | Fixed subjective values (e.g., MT% < 5%) | Data-driven, based on actual distribution |
| 方法 | 文獻常規經驗值 | 中位數 ± 3 × MAD | Literature-based empirical values | Median ± 3 × MAD |
| 工具 | Seurat (manual) | scater isOutlier() | ||
| 適用 | 單樣本、快速探索 | 多樣本整合、正式分析 | Single sample, quick exploration | Multi-sample integration, formal analysis |
① 沒有放諸四海皆準的閾值。
② 聯合分佈(散佈圖)比單一指標更重要——好的細胞沿對角線分佈。
③ 寧可稍微寬鬆,也不要過度過濾——過嚴可能把罕見亞群過濾掉。
④ 將 QC 視為迭代過程——初步寬鬆分群後再回頭收緊。
① No universal thresholds exist.
② Joint distributions (scatter plots) are more important than single metrics — healthy cells follow the diagonal.
③ Better slightly lenient than over-filtered — overly strict QC may remove rare subpopulations.
④ Treat QC as iterative — do lenient initial clustering, then tighten if needed.
QC 實作範例
# 讀入 10X 數據 / Load 10X data library(Seurat) pbmc.data <- Read10X(data.dir = "filtered_feature_bc_matrix/") pbmc <- CreateSeuratObject(counts = pbmc.data, min.cells = 3, min.features = 200) # 計算 MT% / Calculate MT% pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") # 視覺化(先觀察再決定閾值!) / Visualize FIRST, then decide thresholds VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") # 過濾(依觀察結果調整!) / Filter (adjust based on observations!) pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 4000 & percent.mt < 10 ) # 進階:Doublet 偵測 / Advanced: Doublet detection library(DoubletFinder) pbmc <- doubletFinder_v3(pbmc, PCs = 1:15, pN = 0.25, pK = 0.09, nExp = round(0.05 * nrow(pbmc@meta.data))) # 進階:環境 RNA 移除 / Advanced: Ambient RNA removal library(SoupX) sc <- load10X("raw_feature_bc_matrix/") sc <- autoEstCont(sc) out <- adjustCounts(sc)
import scanpy as sc import scrublet as scr # 讀入數據 / Load data adata = sc.read_10x_mtx("filtered_feature_bc_matrix/", var_names="gene_symbols") # 基因層級過濾 / Gene-level filtering sc.pp.filter_genes(adata, min_cells=3) # 計算 QC 指標 / Calculate QC metrics adata.var["mt"] = adata.var_names.str.startswith("MT-") sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True) # 視覺化 / Visualize sc.pl.violin(adata, ["n_genes_by_counts", "total_counts", "pct_counts_mt"]) sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts") # 過濾 / Filter adata = adata[(adata.obs.n_genes_by_counts > 200) & (adata.obs.n_genes_by_counts < 4000) & (adata.obs.pct_counts_mt < 10)] # 進階:Scrublet doublet detection scrub = scr.Scrublet(adata.X) doublet_scores, predicted_doublets = scrub.scrub_doublets() adata.obs["doublet_score"] = doublet_scores adata = adata[~predicted_doublets]
VlnPlot 或 sc.pl.violin 觀察數據分佈後再決定。
Visualize before setting thresholds! The numbers above (200/4000/10%) are just examples. Always run VlnPlot or sc.pl.violin to examine your data distribution first.
📝 自我檢測
1. 某細胞的 nFeature_RNA 極高、nCount_RNA 也極高,最可能的原因是?
1. A cell has extremely high nFeature_RNA and nCount_RNA. Most likely cause?
2. 你正在分析心肌組織的 scRNA-seq 數據,發現大量細胞的 MT% 介於 8–15%。最佳做法是?
2. You're analyzing cardiac tissue scRNA-seq data and many cells have MT% between 8–15%. Best approach?
3. 關於 Doublet 偵測,以下何者正確?
3. Which statement about doublet detection is correct?