STEP 5 / 9

主成分分析 (PCA)

將數千個基因維度壓縮成數十個主成分,保留關鍵變異與降低噪音。

Compress thousands of gene dimensions into tens of principal components — retain key variation, reduce noise.

PCA 的角色

即使只使用 2000 個 HVGs,數據仍然是 2000 維的——這對聚類和視覺化來說太高了(「維度災難」)。PCA 找出數據中方差最大的方向(主成分),用前 N 個主成分就能捕捉絕大部分的真實訊號,而後面的主成分多半是噪音。

Even with just 2000 HVGs, the data is still 2000-dimensional — too high for clustering and visualization ("curse of dimensionality"). PCA finds directions of maximum variance (principal components); the first N PCs capture most true signal while later PCs are mostly noise.

🗜️

降維

從 ~2000 維壓縮到 10–50 維。

Compress from ~2000 to 10–50 dimensions.

🧹

去噪

丟棄低方差主成分等同過濾噪音。

Discarding low-variance PCs effectively filters noise.

加速

降維後聚類、UMAP 計算大幅提升。

Dramatically speeds up clustering and UMAP computation.

Elbow Plot

Elbow Plot 顯示每個主成分解釋的方差比例。找到「手肘」——曲線開始趨平的位置。拖動滑桿設定選擇。

Elbow Plot shows variance explained by each PC. Find the "elbow" — where the curve flattens. Drag slider to set your choice.

🌳 如何決定 nPCs?

方法1:
Elbow Plot — 選在曲線趨平處,通常 10–20。
方法2:
JackStraw test — 統計檢定每個 PC 的顯著性(計算較慢)。
經驗值:
大多數情況 15–30 即足夠。稍微選多比選少安全——寧可包含一點噪音,也不要遺漏訊號。
Method 1:
Elbow Plot — select where curve flattens, typically 10–20.
Method 2:
JackStraw test — statistical test for PC significance (computationally slow).
Rule of thumb:
15–30 usually sufficient. Slightly more is safer than fewer — better to include some noise than miss signal.

實作範例

pbmc <- RunPCA(pbmc, features = VariableFeatures(pbmc))
DimPlot(pbmc, reduction = "pca")
ElbowPlot(pbmc, ndims = 50)
# JackStraw(可選,較慢)
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:30)
JackStrawPlot(pbmc, dims = 1:30)
sc.tl.pca(adata, n_comps=50)
sc.pl.pca(adata)
sc.pl.pca_variance_ratio(adata, n_pcs=50)

📝 自我檢測

PCA 主要壓縮哪個方向的維度?

PCA primarily compresses which dimension?

A. 減少細胞數量(行)A. Reduce cell count (rows)
B. 壓縮基因維度(列),從數千維到數十維B. Compress gene dimensions (columns), from thousands to tens
C. 將數據降到 2D 以便視覺化C. Reduce to 2D for visualization
D. 移除批次效應D. Remove batch effects