Step 14: Multiple Testing & FDR — Statistical Inference Tutorial

概覽

為什麼基因組規模的篩查必須校正？

假設你檢定 m = 20,000 個基因，全部在虛無假設下，仍會期待 m·α = 1,000 個 p < 0.05 的「顯著」結果。不校正的 p-value 在多重檢定下沒有意義。校正方法分兩條路：(1) FWER（Family-Wise Error Rate）控制「至少一個」假陽性的機率——非常嚴格，適合驗證型研究；(2) FDR（False Discovery Rate）控制「被宣告為顯著之中的假陽性比例」——更貼近發現型研究的需求。

本章介紹三個現代 FDR 標準：(a) Benjamini-Hochberg (BH) 程序——簡單、無分布假設、PRDS 相依下仍然有效；(b) Storey 的 pFDR / q-value——估計 π₀（真的虛無比例），比 BH 更有檢力；(c) Independent Hypothesis Weighting (IHW)——利用「均值表現量」等共變數，把檢力分配給更有信號的子集，仍保有 FDR 控制。

If you test m = 20,000 genes all under the null, you expect m·α = 1,000 "significant" results at p < 0.05. Unadjusted p-values are meaningless at genome scale. Two correction families exist: (1) FWER (Family-Wise Error Rate) controls the probability of any false positive — strict, suits confirmatory work; (2) FDR (False Discovery Rate) controls the proportion of false positives among rejections — better suited to discovery.

This chapter introduces three modern FDR standards: (a) Benjamini-Hochberg (BH) — simple, distribution-free, valid under PRDS dependence; (b) Storey's pFDR / q-value — estimates π₀ (true-null fraction), more powerful than BH; (c) Independent Hypothesis Weighting (IHW) — uses a covariate (e.g. mean expression) to redirect power toward signal-rich subsets while preserving FDR control.

💡

核心原則：FWER 答「我能否信任任何一個結果」；FDR 答「我可以容忍清單中 5% 是假的」。基因組研究幾乎一律選 FDR——因為「至少一個假陽性」在 20,000 個檢定下幾乎必然發生，卻不影響整體推論。 Core principle: FWER answers "can I trust any single result"; FDR answers "I'm willing to accept 5% of my hits being false". Genome-wide work almost universally chooses FDR — "at least one false positive" is nearly inevitable across 20,000 tests yet does not invalidate the overall list.

核心概念

一、FWER vs FDR 與 BH 程序

🛡️

FWER：Bonferroni / Holm

Bonferroni：在 α/m 下拒絕——最簡單但最保守。
Holm：步降程序——比 Bonferroni 嚴格更有檢力，仍控 FWER。
適用情境：驗證型研究、安全性報告、臨床終點。
代價：在 m=20,000 時幾乎無檢力，會錯失真實效應。

Bonferroni: reject at α/m — simplest but most conservative.
Holm: step-down — strictly more powerful than Bonferroni, still FWER-valid.
Best for: confirmatory work, safety endpoints, clinical primary outcomes.
Cost: at m=20,000 power collapses; real effects get missed.

📊

FDR：BH / BY / Storey q

BH (1995)：排序 p-value，找最大的 i 使 p₍ᵢ₎ ≤ i·q/m，拒絕前 i 個。
BY (2001)：對任意相依有效——乘以調和級數，較保守。
Storey q：估計 π₀ 後得 q-value——通常比 BH 多 10–30% 發現。
local FDR (lfdr)：每個特徵的「來自虛無」後驗機率。

BH (1995): sort p-values; find largest i with p₍ᵢ₎ ≤ i·q/m; reject top i.
BY (2001): valid under arbitrary dependence (× harmonic factor) — conservative.
Storey q: estimates π₀ → q-value; typically 10–30% more discoveries than BH.
local FDR (lfdr): per-feature posterior probability of being null.

⚖️

IHW：共變數加權

用獨立共變數（如基因平均表現量）對檢定加權。
有資料的子集 → 更多 α 預算；低表現子集 → 較少預算。
共變數需與 p-value 在 H₀ 下獨立——否則破壞 FDR 控制。
DESeq2 預設用 IHW 替代 BH。

Weights tests by an independent covariate (e.g. mean expression).
Signal-rich strata get more α budget; low-expression strata less.
Covariate must be independent of p-value under H₀ — else FDR breaks.
DESeq2 ships with IHW as a drop-in replacement for BH.

🔗

相依性與 BH 的有效性

PRDS（正相依）下 BH 仍有效——大多數基因表現、GWAS LD 屬此類。
任意相依 → 用 BY，或permutation-based FDR。
強相關（基因網絡、長 LD 區塊）會讓 BH 略保守，但不會失效。
對極端相依結構，permutation 是黃金標準。

Under PRDS (positive dependence) BH remains valid — covers most expression / GWAS LD.
Arbitrary dependence → use BY or permutation-based FDR.
Strong correlation (networks, long LD blocks) makes BH slightly conservative, not invalid.
For extreme dependence, permutation is the gold standard.

⚠️

q-value ≠ p-value：q = 0.05 不代表「5% Type I 錯誤率」，而是「若我把這個基因列入，整批發現中假陽性的期望比例是 5%」。把 q-value 當 p-value 報告會誤導讀者。

q-value is NOT a p-value: q = 0.05 does NOT mean a 5% Type I error rate. It means "if I include this gene, the expected proportion of false positives among rejections is 5%". Reporting q-values as if they were p-values is misleading.

二、過濾、診斷與生信流程整合

p-value 直方圖是 FDR 最重要的診斷：H₀ 下 p-value 應為 Uniform(0,1)，所以直方圖應該平坦——加上 0 附近的尖峰（真實訊號）。如果出現「中段凸起」或「右側上升」，表示你的檢定本身偏離 H₀ 分布（批次效應、隱性共變數、模型錯置），這時做任何 FDR 校正都無效——必須先修檢定。

獨立過濾（independent filtering）：在做 FDR 前，移除與 p-value 在 H₀ 下獨立的低訊號特徵（如極低 count 的基因）——這會增加檢力但不破壞 FDR。但若過濾準則基於檢定統計量本身（如先取 |LFC| > 1 才做 FDR），則 FDR 失效。IHW 把過濾的精神原則化：用共變數加權，理論完備。

生信標準流程：RNA-seq DE 用 BH 或 IHW（DESeq2 預設）；GWAS 用 5×10⁻⁸ 的 Bonferroni 門檻；MS 蛋白質體（FragPipe / MaxQuant）用 target-decoy FDR；ChIP-seq peak calling 用 MACS2 的 q-value；single-cell DE 用每組獨立的 BH。

The p-value histogram is the single most important FDR diagnostic: under H₀ p is Uniform(0,1), so the histogram should be flat with a spike near 0 (real signal). A bump in the middle or rising right tail means your test itself deviates from its null distribution (batch effects, hidden covariates, misspecified model) — no amount of FDR correction can fix this; fix the test first.

Independent filtering: remove low-signal features whose distribution is independent of the p-value under H₀ (e.g. very-low-count genes) before applying FDR — this boosts power without invalidating FDR. But filtering on the test statistic itself (e.g. first keeping |LFC| > 1 and then computing FDR) breaks FDR. IHW formalises filtering into principled covariate weighting.

Standard bioinfo pipelines: RNA-seq DE uses BH or IHW (DESeq2 default); GWAS uses the 5×10⁻⁸ Bonferroni threshold; MS proteomics (FragPipe / MaxQuant) uses target-decoy FDR; ChIP-seq peak calling uses MACS2 q-values; single-cell DE applies BH within each cluster.

互動模擬

BH 程序視覺化

下方散布圖中，每點是一個基因的 p-value 與排序 i 的對應位置。紅線是 i·q/m——BH 接受所有低於此線的點。拖動滑桿觀察：q 增加，紅線陡升，更多點被接受；π₀ 降低（更多基因是真實訊號），p-value 集中在左下角，BH 與 Storey 的差距變大。

Each dot is a gene at rank i with its p-value. The red line is i·q/m — BH rejects every dot below it. Slide q upward and the threshold rises; lower π₀ (more truly-non-null genes) packs p-values into the lower-left, widening the gap between BH and Storey.

BH — Storey — Bonferroni —

q 門檻 0.05

π₀ 0.80

m 2000

X: 排序 i | Y: p-value | 紅線：BH 閾值 i·q/m

程式碼

實作：BH / Storey / IHW

# --- R --- 多重檢定校正
# p 是長度 m 的 p-value 向量；res 是 DESeq2 結果

# 1) 基本 BH / Holm / Bonferroni
padj_bh   <- p.adjust(p, method="BH")      # FDR
padj_holm <- p.adjust(p, method="holm")    # FWER
padj_bon  <- p.adjust(p, method="bonferroni")
sum(padj_bh < 0.05)                          # 發現數

# 2) Storey q-value（更高檢力）
library(qvalue)
qv <- qvalue(p)
qv$pi0                                          # 估計的真實虛無比例
sum(qv$qvalues < 0.05)

# 3) IHW —— 用共變數加權 (DESeq2 預設)
library(IHW); library(DESeq2)
ihw_res <- ihw(pvalue ~ baseMean, data=res, alpha=0.05)
rejections(ihw_res)
plot(ihw_res)                                  # 看各組權重分布

# 4) p-value 直方圖診斷（最重要的 sanity check）
hist(p, breaks=50, main="p-value histogram")
# 理想：左側尖峰 + 右側平坦; 中段凸起 = 模型有問題

# --- Python ---
import numpy as np, matplotlib.pyplot as plt
from statsmodels.stats.multitest import multipletests

# 1) BH / Holm / Bonferroni
rej, padj, _, _ = multipletests(p, alpha=0.05, method="fdr_bh")
np.sum(rej)
_, padj_holm, _, _ = multipletests(p, method="holm")

# 2) Storey π₀ 估計（簡易版）
def pi0_storey(p, lam=0.5):
    return np.mean(p > lam) / (1 - lam)
pi0 = pi0_storey(p)

# 3) IHW 透過 rpy2 橋接 R（沒有原生 Python 版）
from rpy2.robjects import r, FloatVector
r.assign("p", FloatVector(p)); r.assign("bm", FloatVector(base_mean))
r('library(IHW); ihw_res <- ihw(p, bm, alpha=0.05)')

# 4) p-value 直方圖診斷
plt.hist(p, bins=50); plt.xlabel("p-value"); plt.show()
# 平坦 + 左側尖峰 → 模型 OK

🚫

常見錯誤：(1) 先用 |LFC| > 1 過濾再做 BH —— 破壞 FDR。(2) 在 p-value 直方圖右側上升時直接報 BH —— 你校正的是錯的 p-value。(3) 把 q-value 標成「Type I 錯誤率 5%」—— 概念錯置。(4) 對 GWAS 跨整個基因組做 BH 卻忽視 LD —— 應該先做 LD-aware clumping。 Common mistakes: (1) Filtering by |LFC| > 1 then applying BH — breaks FDR. (2) Running BH on p-values whose histogram rises on the right — you're correcting the wrong null. (3) Labelling q-values as "Type I error rate 5%" — wrong concept. (4) Applying BH genome-wide for GWAS while ignoring LD — do LD-aware clumping first.

📝 自我檢測

1. 對基因組規模篩查（m = 20,000），為什麼 FDR 通常比 FWER 更合理？

1. For a genome-wide screen (m = 20,000), why is FDR generally preferred over FWER?

A. FDR 在數學上比 FWER 嚴格A. FDR is mathematically stricter than FWER

B. FWER 不適用於連續資料B. FWER does not apply to continuous data

C. FWER 在 20,000 檢定下會犧牲幾乎所有檢力；FDR 容忍少量假陽性換取大量真陽性發現C. FWER sacrifices nearly all power across 20,000 tests; FDR trades a few false positives for many true discoveries

D. FDR 不需要校正D. FDR requires no correction

2. 「這個基因的 q-value = 0.05」最精確的解讀是？

2. The most precise interpretation of "this gene's q-value = 0.05" is:

A. 此基因為假陽性的機率為 5%A. The probability that this gene is a false positive is 5%

B. 在 H₀ 下看到此 p-value 或更極端的機率為 5%B. Probability of seeing this p-value or more extreme under H₀ is 5%

C. 若把此基因納入發現清單，整批發現中假陽性的期望比例為 5%C. If this gene is included in the discovery list, the expected proportion of false positives among rejections is 5%

D. 5% 的 Type I 錯誤率D. A 5% Type I error rate

3. BH 程序在何種相依性下嚴格有效？

3. Under what dependence is the BH procedure strictly valid?

A. 僅在完全獨立時A. Only under complete independence

B. 在獨立或 PRDS（正相依）下B. Under independence or PRDS (positive dependence)

C. 在任意相依下都嚴格有效C. Strictly valid under arbitrary dependence

D. 僅在 p-value 為常態時D. Only when p-values are normally distributed