Step 16: Permutation, Bootstrap, GSEA — Statistical Inference Tutorial

概覽

為何在生信中重抽樣是必要技能

許多生物資料分析的統計量並沒有漂亮的閉合分布：基因集富集分數 (ES)、網絡中介中心度、scRNA-seq 軌跡距離、複雜的 ratio 與 trimmed mean——這些都很難套教科書的 t/F 分布。重抽樣用「資料本身」生成虛無分布或不確定度估計，繞過參數假設：

排列檢定：在 H₀ 下「標籤可交換」，重複洗牌組標籤，計算每次的統計量，比較觀察值與這些排列值——黃金標準的 FWER / FDR 控制。
Bootstrap：從原資料抽樣（with replacement），重複計算統計量得其抽樣分布——拿來算 SE、CI、bias。
GSEA：以「樣本標籤排列」對排序後的基因清單做 KS-like 富集檢定——是排列檢定在生物路徑層級的應用。

Many biological statistics lack a closed-form distribution: GSEA enrichment scores, network betweenness, scRNA-seq trajectory distances, complex ratios and trimmed means — none fit textbook t / F. Resampling uses the data itself to generate a null or an uncertainty estimate, bypassing parametric assumptions:

Permutation tests: under H₀ labels are exchangeable; shuffle labels repeatedly, recompute the statistic, compare the observed value to the permutation distribution — gold-standard FWER / FDR control.
Bootstrap: resample with replacement from the data; recompute the statistic to obtain its sampling distribution — gives SE, CI, bias.
GSEA: applies sample-label permutation to a KS-like enrichment score on a ranked gene list — permutation testing at the pathway level.

💡

選擇原則：檢定 H₀ → 排列；估計 SE / CI → bootstrap。兩者用法不可互換。Bootstrap 的 CI 並不保證 Type I 控制；排列的不確定度不能直接拿來當 SE。 How to choose: testing H₀ → permutation; estimating SE / CI → bootstrap. They are NOT interchangeable. A bootstrap CI does not guarantee Type I control; permutation does not directly give a SE.

核心概念

一、排列檢定與 Bootstrap

🔀

排列檢定 (permutation)

H₀：兩組來自同一分布，組別標籤可交換。
洗牌標籤 B 次，每次重算統計量 → 得到虛無分布。
p = (未來排列中 ≥ 觀察值的比例)。
對小樣本、非常態資料、自訂統計量都有效。
限制：n 太小 → 排列數有限 → p 無法低於 1/(B+1)。

H₀: both groups come from the same distribution; labels are exchangeable.
Shuffle labels B times, recompute the statistic → null distribution.
p = (fraction of permutations ≥ observed value).
Works for small samples, non-normal data, custom statistics.
Caveat: very small n → few unique permutations → p cannot go below 1/(B+1).

🥾

Bootstrap

從原資料抽 n 個（with replacement），重複 B 次。
得到統計量的經驗抽樣分布，可算 SE、percentile CI、BCa CI。
適用：複雜統計量（中介效應、變異成分、網絡度量）、無 SE 公式的場合。
需大量重抽樣（≥1000）；可能在尾端表現不佳。
對相依資料用 block bootstrap 或 cluster bootstrap。

Sample n observations with replacement from the data; repeat B times.
Yields the empirical sampling distribution: SE, percentile CI, BCa CI.
Use for: complex statistics (mediation, variance components, network metrics) where no SE formula exists.
Needs many resamples (≥1000); tail behaviour can be poor.
For dependent data use block or cluster bootstrap.

二、GSEA 與基因集檢定

把所有基因按某個統計量（log fold change、moderated t、signal-to-noise）排序，GSEA 的問題是：某條基因集 S（例如 Hallmark INTERFERON_ALPHA）的成員，是否「集中在排序的上端或下端」？

Running Enrichment Score (ES)：從上而下掃描排序清單，遇到 S 的成員就加分（依排序值加權），遇到非成員就減分。ES = 全程偏離 0 的最大絕對值。顯著性由樣本標籤排列得到——重新洗組別→重排基因→重算 ES→得 ES 虛無分布。

兩種對立的虛無：(a) self-contained（如 roast）—— H₀：S 中無基因有差異表現。(b) competitive（如 GSEA、camera）—— H₀：S 中的基因「比 S 外的基因更顯著」。生物學家通常想要 competitive，因為它回答「這條 pathway 比背景更被擾動嗎？」。

ORA（Over-Representation Analysis，hypergeometric / Fisher）用 DE 基因清單對基因集的交集做超幾何檢定——速度快，但忽略了排序、丟失了亞顯著的累積訊號，且不處理基因間相關性。camera / roast 校正了相關性；fgsea 用適應性排列大幅加速。

Rank all genes by a statistic (log fold change, moderated t, signal-to-noise). GSEA asks: do the members of a gene set S (e.g. Hallmark INTERFERON_ALPHA) cluster at the top or bottom of this ranking?

Running Enrichment Score (ES): walk the ranked list top to bottom; add to a running sum at each S member (weighted by rank value), subtract at non-members. ES is the maximum absolute deviation from 0. Significance comes from sample-label permutation — shuffle group labels → re-rank genes → recompute ES → build an ES null distribution.

Two competing nulls: (a) self-contained (e.g. roast) — H₀: no gene in S is differentially expressed. (b) competitive (e.g. GSEA, camera) — H₀: genes in S are no more differential than genes outside S. Biologists usually want the competitive null because it answers "is this pathway more perturbed than background?".

ORA (Over-Representation Analysis, hypergeometric / Fisher's exact) computes the overlap of a DE gene list with a set — fast, but ignores ranking, loses sub-significant accumulating signal, and ignores inter-gene correlation. camera / roast correct for correlation; fgsea uses adaptive permutation for huge speedups.

⚠️

樣本 vs 基因排列：GSEA 預設用樣本標籤排列——保留基因間相關性，是正確的虛無。基因排列（隨機洗基因 ID）會破壞 pathway 內基因間共變異，膨脹 ES 顯著性、產生大量假陽性。當樣本太少（n < 8）以致排列數不足時，採用 fgsea 的 gamma 近似或 roast 的旋轉檢定。

Sample vs gene permutation: GSEA's default is sample-label permutation — it preserves inter-gene correlation and is the correct null. Gene-set permutation (shuffling gene IDs) destroys within-pathway correlation, inflates ES significance, and floods false positives. When n is too small (n < 8) for enough sample permutations, use fgsea's gamma approximation or roast's rotation test.

三、計算捷徑

對 m=20,000 基因 × 1,000 排列 = 2×10⁷ 次測試，naive 實作會在 2026 年也跑很久。常用加速：

fgsea::fgsea：用 adaptive multilevel 排列，把 p < 10⁻⁶ 的 pathway 也能精準估計。
limma::camera：用 inter-gene correlation 解析調整 t 統計量，無須排列。
roast / mroast：用旋轉檢定（rotation test）取代排列——適合 n ≥ 3 的小樣本。
SAM (Significance Analysis of Microarrays)：對排列得到的虛無分布做 FDR——早期但仍實用。
permutation-based GWAS p-value：對表型排列以校正 LD 結構。

For m = 20,000 genes × 1,000 permutations = 2×10⁷ tests, naive code would still be slow in 2026. Standard accelerations:

fgsea::fgsea: adaptive multilevel permutation — accurately estimates p < 10⁻⁶ for top pathways.
limma::camera: analytic correction using inter-gene correlation — no permutation needed.
roast / mroast: rotation test replaces permutation — viable for n ≥ 3.
SAM: early FDR-via-permutation method, still useful.
Permutation GWAS p-values: phenotype permutation to retain LD structure.

互動模擬

GSEA：滾動 ES 與排列虛無

下方是合成的 500 基因排序清單，目標基因集（綠色標記）共 30 個成員，集中於上端。running ES 沿排序累積：紅曲線上揚＝集合的「命中」加分，下落＝「未命中」減分。調整集中強度 與 n_perm，右下小圖即時更新排列虛無，黑線為觀察 ES，紅虛線為 p < 0.05 閾值。

Below is a synthetic ranked list of 500 genes; the target gene set (green marks) has 30 members concentrated near the top. The running ES walks the list: red rises at "hits", falls at "misses". Adjust concentration and n_perm; the inset updates the permutation null in real time — black line is observed ES, red dashed line is the p = 0.05 quantile.

ES — p — B —

集中強度 0.80

B 500

主圖：running ES（紅）+ 命中位置（綠線） · 副圖：排列下 ES 虛無分布

程式碼

實作

# --- R --- 排列 / Bootstrap / GSEA

# 1) 自製排列檢定（兩組均值差）
obs <- mean(x[g=="A"]) - mean(x[g=="B"])
perm <- replicate(10000, {
  gp <- sample(g)
  mean(x[gp=="A"]) - mean(x[gp=="B"])
})
mean(abs(perm) >= abs(obs))             # two-sided p

# 2) Bootstrap CI of a complex statistic
library(boot)
b <- boot(data, statistic = function(d, i) cor(d[i,1], d[i,2]), R = 2000)
boot.ci(b, type = c("perc", "bca"))

# 3) fgsea — fast GSEA on ranked stats
library(fgsea); library(msigdbr)
pathways <- split(msigdbr_df$gene_symbol, msigdbr_df$gs_name)
ranked   <- setNames(res$stat, res$gene)               # named numeric
fg <- fgsea(pathways, stats = ranked, minSize=15, maxSize=500)
head(fg[order(fg$padj), ], 10)

# 4) limma::camera / roast — correlation-aware
library(limma)
idx <- ids2indices(pathways, rownames(v$E))
camera(v, idx, design, contrast = cont)
mroast(v, idx, design, contrast = cont, nrot = 9999)

# 5) ORA（hypergeometric / Fisher）
library(clusterProfiler)
enrichGO(de_genes, OrgDb="org.Hs.eg.db", ont="BP", pAdjustMethod="BH")

# --- Python ---
import numpy as np
from scipy import stats

# 1) scipy permutation_test — vectorised, exact when feasible
def mean_diff(a, b): return a.mean() - b.mean()
res = stats.permutation_test((x_A, x_B), mean_diff, n_resamples=10000,
                             alternative="two-sided", permutation_type="independent")
res.statistic, res.pvalue

# 2) Bootstrap CI
res = stats.bootstrap((x,), np.median, n_resamples=2000, method="BCa")
res.confidence_interval

# 3) GSEA via gseapy
import gseapy as gp
pre = gp.prerank(rnk=ranked_df, gene_sets="MSigDB_Hallmark_2020",
                 min_size=15, max_size=500, permutation_num=1000)
pre.res2d.sort_values("NOM p-val").head(10)

# 4) Hypergeometric ORA
gp.enrichr(gene_list=de_genes, gene_sets="GO_Biological_Process_2023")

# 5) Block permutation (for time-series / clustered data)
def block_perm(x, block_size, B):
    blocks = x.reshape(-1, block_size)
    out = []
    for _ in range(B):
        idx = np.random.permutation(blocks.shape[0])
        out.append(blocks[idx].ravel())
    return np.array(out)

🚫

常見錯誤：(1) 用基因排列跑 GSEA — 破壞相關性結構、假陽性爆增。(2) n=3 vs 3 跑只有 20 種唯一排列，卻報 p = 0.001 — 這個 p 不可能存在。(3) 把 bootstrap percentile CI 直接拿來宣稱「95% Type I 控制」— bootstrap 是估計工具，不是檢定。(4) ORA 用「padj < 0.05」的 DE 基因清單跑 — 對小研究幾乎沒結果；GSEA 用全部排序基因更穩健。 Common mistakes: (1) Using gene permutation in GSEA — destroys correlation, inflates false positives. (2) n=3 vs 3 has only 20 unique permutations, yet reporting p = 0.001 — that p is impossible. (3) Using a bootstrap percentile CI as a "95% Type I" claim — bootstrap estimates uncertainty, not error rate. (4) ORA over the "padj < 0.05" DE list — yields nothing in underpowered studies; GSEA over the full ranking is more robust.

📝 自我檢測

1. 為什麼樣本標籤排列得到的 GSEA p-value 比基因排列的更保守、更可信？

1. Why does sample-label permutation yield a more conservative and reliable GSEA p-value than gene-set permutation?

A. 樣本排列計算速度更慢A. Sample permutation runs more slowly

B. 樣本排列只測試 H₀ 的一部分B. Sample permutation only tests part of H₀

C. 樣本排列保留基因間的相關結構；基因排列破壞共變異，讓 ES 的虛無太窄、p 太小C. Sample permutation preserves inter-gene correlation; gene permutation destroys it, making the ES null too narrow and p-values too small

D. 樣本排列不需要 FDR 校正D. Sample permutation requires no FDR correction

2. 何種情境下，bootstrap 比排列檢定更合適？

2. When is the bootstrap preferable to a permutation test?

A. 要嚴格控制 Type I 錯誤率A. When strict Type I control is required

B. 要估計複雜統計量的 SE / CI（沒有自然「可交換標籤」可洗牌）B. When estimating SE / CI for a complex statistic (no natural exchangeable labels to permute)

C. 樣本量極小（n=3）C. With extremely small samples (n=3)

D. 兩者完全可互換D. The two are completely interchangeable

3. Competitive 基因集檢定（如 GSEA、camera）與 self-contained（如 roast）的本質差別？

3. The essential difference between competitive (GSEA, camera) and self-contained (roast) gene-set tests?

A. competitive 只能用排列；self-contained 必須用旋轉A. Competitive must use permutation; self-contained must use rotation

B. competitive 只適用單樣本；self-contained 適用兩樣本B. Competitive only fits one-sample designs; self-contained fits two-sample designs

C. competitive 問「S 中基因比 S 外更顯著嗎？」；self-contained 問「S 中至少一個基因有差異嗎？」C. Competitive asks "are S members more differential than non-S?"; self-contained asks "is any gene in S differentially expressed?"

D. competitive 不能校正基因間相關D. Competitive cannot correct for inter-gene correlation