為何在生信中重抽樣是必要技能
許多生物資料分析的統計量並沒有漂亮的閉合分布:基因集富集分數 (ES)、網絡中介中心度、scRNA-seq 軌跡距離、複雜的 ratio 與 trimmed mean——這些都很難套教科書的 t/F 分布。重抽樣用「資料本身」生成虛無分布或不確定度估計,繞過參數假設:
- 排列檢定:在 H₀ 下「標籤可交換」,重複洗牌組標籤,計算每次的統計量,比較觀察值與這些排列值——黃金標準的 FWER / FDR 控制。
- Bootstrap:從原資料抽樣(with replacement),重複計算統計量得其抽樣分布——拿來算 SE、CI、bias。
- GSEA:以「樣本標籤排列」對排序後的基因清單做 KS-like 富集檢定——是排列檢定在生物路徑層級的應用。
Many biological statistics lack a closed-form distribution: GSEA enrichment scores, network betweenness, scRNA-seq trajectory distances, complex ratios and trimmed means — none fit textbook t / F. Resampling uses the data itself to generate a null or an uncertainty estimate, bypassing parametric assumptions:
- Permutation tests: under H₀ labels are exchangeable; shuffle labels repeatedly, recompute the statistic, compare the observed value to the permutation distribution — gold-standard FWER / FDR control.
- Bootstrap: resample with replacement from the data; recompute the statistic to obtain its sampling distribution — gives SE, CI, bias.
- GSEA: applies sample-label permutation to a KS-like enrichment score on a ranked gene list — permutation testing at the pathway level.
一、排列檢定與 Bootstrap
排列檢定 (permutation)
- H₀:兩組來自同一分布,組別標籤可交換。
- 洗牌標籤 B 次,每次重算統計量 → 得到虛無分布。
- p = (未來排列中 ≥ 觀察值的比例)。
- 對小樣本、非常態資料、自訂統計量都有效。
- 限制:n 太小 → 排列數有限 → p 無法低於 1/(B+1)。
- H₀: both groups come from the same distribution; labels are exchangeable.
- Shuffle labels B times, recompute the statistic → null distribution.
- p = (fraction of permutations ≥ observed value).
- Works for small samples, non-normal data, custom statistics.
- Caveat: very small n → few unique permutations → p cannot go below 1/(B+1).
Bootstrap
- 從原資料抽 n 個(with replacement),重複 B 次。
- 得到統計量的經驗抽樣分布,可算 SE、percentile CI、BCa CI。
- 適用:複雜統計量(中介效應、變異成分、網絡度量)、無 SE 公式的場合。
- 需大量重抽樣(≥1000);可能在尾端表現不佳。
- 對相依資料用 block bootstrap 或 cluster bootstrap。
- Sample n observations with replacement from the data; repeat B times.
- Yields the empirical sampling distribution: SE, percentile CI, BCa CI.
- Use for: complex statistics (mediation, variance components, network metrics) where no SE formula exists.
- Needs many resamples (≥1000); tail behaviour can be poor.
- For dependent data use block or cluster bootstrap.
二、GSEA 與基因集檢定
把所有基因按某個統計量(log fold change、moderated t、signal-to-noise)排序,GSEA 的問題是:某條基因集 S(例如 Hallmark INTERFERON_ALPHA)的成員,是否「集中在排序的上端或下端」?
Running Enrichment Score (ES):從上而下掃描排序清單,遇到 S 的成員就加分(依排序值加權),遇到非成員就減分。ES = 全程偏離 0 的最大絕對值。顯著性由樣本標籤排列得到——重新洗組別→重排基因→重算 ES→得 ES 虛無分布。
兩種對立的虛無:(a) self-contained(如 roast)—— H₀:S 中無基因有差異表現。(b) competitive(如 GSEA、camera)—— H₀:S 中的基因「比 S 外的基因更顯著」。生物學家通常想要 competitive,因為它回答「這條 pathway 比背景更被擾動嗎?」。
ORA(Over-Representation Analysis,hypergeometric / Fisher)用 DE 基因清單對基因集的交集做超幾何檢定——速度快,但忽略了排序、丟失了亞顯著的累積訊號,且不處理基因間相關性。camera / roast 校正了相關性;fgsea 用適應性排列大幅加速。
Rank all genes by a statistic (log fold change, moderated t, signal-to-noise). GSEA asks: do the members of a gene set S (e.g. Hallmark INTERFERON_ALPHA) cluster at the top or bottom of this ranking?
Running Enrichment Score (ES): walk the ranked list top to bottom; add to a running sum at each S member (weighted by rank value), subtract at non-members. ES is the maximum absolute deviation from 0. Significance comes from sample-label permutation — shuffle group labels → re-rank genes → recompute ES → build an ES null distribution.
Two competing nulls: (a) self-contained (e.g. roast) — H₀: no gene in S is differentially expressed. (b) competitive (e.g. GSEA, camera) — H₀: genes in S are no more differential than genes outside S. Biologists usually want the competitive null because it answers "is this pathway more perturbed than background?".
ORA (Over-Representation Analysis, hypergeometric / Fisher's exact) computes the overlap of a DE gene list with a set — fast, but ignores ranking, loses sub-significant accumulating signal, and ignores inter-gene correlation. camera / roast correct for correlation; fgsea uses adaptive permutation for huge speedups.
fgsea 的 gamma 近似或 roast 的旋轉檢定。fgsea's gamma approximation or roast's rotation test.三、計算捷徑
對 m=20,000 基因 × 1,000 排列 = 2×10⁷ 次測試,naive 實作會在 2026 年也跑很久。常用加速:
fgsea::fgsea:用 adaptive multilevel 排列,把 p < 10⁻⁶ 的 pathway 也能精準估計。limma::camera:用 inter-gene correlation 解析調整 t 統計量,無須排列。roast/mroast:用旋轉檢定(rotation test)取代排列——適合 n ≥ 3 的小樣本。- SAM (Significance Analysis of Microarrays):對排列得到的虛無分布做 FDR——早期但仍實用。
- permutation-based GWAS p-value:對表型排列以校正 LD 結構。
For m = 20,000 genes × 1,000 permutations = 2×10⁷ tests, naive code would still be slow in 2026. Standard accelerations:
fgsea::fgsea: adaptive multilevel permutation — accurately estimates p < 10⁻⁶ for top pathways.limma::camera: analytic correction using inter-gene correlation — no permutation needed.roast/mroast: rotation test replaces permutation — viable for n ≥ 3.- SAM: early FDR-via-permutation method, still useful.
- Permutation GWAS p-values: phenotype permutation to retain LD structure.
GSEA:滾動 ES 與排列虛無
下方是合成的 500 基因排序清單,目標基因集(綠色標記)共 30 個成員,集中於上端。running ES 沿排序累積:紅曲線上揚=集合的「命中」加分,下落=「未命中」減分。調整集中強度 與 n_perm,右下小圖即時更新排列虛無,黑線為觀察 ES,紅虛線為 p < 0.05 閾值。
Below is a synthetic ranked list of 500 genes; the target gene set (green marks) has 30 members concentrated near the top. The running ES walks the list: red rises at "hits", falls at "misses". Adjust concentration and n_perm; the inset updates the permutation null in real time — black line is observed ES, red dashed line is the p = 0.05 quantile.
主圖:running ES(紅)+ 命中位置(綠線) · 副圖:排列下 ES 虛無分布
實作
# --- R --- 排列 / Bootstrap / GSEA # 1) 自製排列檢定(兩組均值差) obs <- mean(x[g=="A"]) - mean(x[g=="B"]) perm <- replicate(10000, { gp <- sample(g) mean(x[gp=="A"]) - mean(x[gp=="B"]) }) mean(abs(perm) >= abs(obs)) # two-sided p # 2) Bootstrap CI of a complex statistic library(boot) b <- boot(data, statistic = function(d, i) cor(d[i,1], d[i,2]), R = 2000) boot.ci(b, type = c("perc", "bca")) # 3) fgsea — fast GSEA on ranked stats library(fgsea); library(msigdbr) pathways <- split(msigdbr_df$gene_symbol, msigdbr_df$gs_name) ranked <- setNames(res$stat, res$gene) # named numeric fg <- fgsea(pathways, stats = ranked, minSize=15, maxSize=500) head(fg[order(fg$padj), ], 10) # 4) limma::camera / roast — correlation-aware library(limma) idx <- ids2indices(pathways, rownames(v$E)) camera(v, idx, design, contrast = cont) mroast(v, idx, design, contrast = cont, nrot = 9999) # 5) ORA(hypergeometric / Fisher) library(clusterProfiler) enrichGO(de_genes, OrgDb="org.Hs.eg.db", ont="BP", pAdjustMethod="BH")
# --- Python --- import numpy as np from scipy import stats # 1) scipy permutation_test — vectorised, exact when feasible def mean_diff(a, b): return a.mean() - b.mean() res = stats.permutation_test((x_A, x_B), mean_diff, n_resamples=10000, alternative="two-sided", permutation_type="independent") res.statistic, res.pvalue # 2) Bootstrap CI res = stats.bootstrap((x,), np.median, n_resamples=2000, method="BCa") res.confidence_interval # 3) GSEA via gseapy import gseapy as gp pre = gp.prerank(rnk=ranked_df, gene_sets="MSigDB_Hallmark_2020", min_size=15, max_size=500, permutation_num=1000) pre.res2d.sort_values("NOM p-val").head(10) # 4) Hypergeometric ORA gp.enrichr(gene_list=de_genes, gene_sets="GO_Biological_Process_2023") # 5) Block permutation (for time-series / clustered data) def block_perm(x, block_size, B): blocks = x.reshape(-1, block_size) out = [] for _ in range(B): idx = np.random.permutation(blocks.shape[0]) out.append(blocks[idx].ravel()) return np.array(out)
📝 自我檢測
1. 為什麼樣本標籤排列得到的 GSEA p-value 比基因排列的更保守、更可信?
1. Why does sample-label permutation yield a more conservative and reliable GSEA p-value than gene-set permutation?
2. 何種情境下,bootstrap 比排列檢定更合適?
2. When is the bootstrap preferable to a permutation test?
3. Competitive 基因集檢定(如 GSEA、camera)與 self-contained(如 roast)的本質差別?
3. The essential difference between competitive (GSEA, camera) and self-contained (roast) gene-set tests?