STEP 12 / 13

多重檢定與 FDR (Multiple Testing & FDR)

當你一次測 20,000 個基因,α = 0.05 不再是「5%」——它是 1,000 個偽陽性。多重檢定校正不是錦上添花,是高通量資料的入場券。

When you run 20,000 tests at once, α = 0.05 is no longer "5%" — it is 1,000 false positives. Multiple-testing correction is not a finishing touch; it is the price of admission to high-throughput data.

為什麼需要校正?

單一檢定下 α = 0.05 的意思是:「如果 H₀ 真的成立,我有 5% 機率犯第一型錯誤」。這在「一篇論文一個假設」的年代沒問題。但當代生物學一次測整個基因組:RNA-seq 大約 m ≈ 20,000 個基因、GWAS 大約 m ≈ 10⁶ 個 SNP、proteomics 大約 m ≈ 5,000 個 peptide。

假設所有 H₀ 都成立(沒有任何基因真的差異表現)。在 α = 0.05 下,預期偽陽性數 = 0.05 × m:m = 20 時 1 個、m = 1,000 時 50 個、m = 20,000 時 1,000 個。一篇宣稱「我們找到 1,200 個 DEG, p < 0.05」的論文,其中 1,000 個可能根本是噪音。

更精確地說,「至少 1 個偽陽性」的機率叫 FWER(Family-Wise Error Rate)。獨立檢定下 FWER = 1 − (1 − α)^m。m = 14 時 FWER 已經 ≈ 51%;m = 100 時 ≈ 99.4%。多重檢定校正就是把這個失控的錯誤率拉回可控範圍。

For a single test, α = 0.05 says: "If H₀ is true, I have a 5% chance of a Type I error." That worked fine in the one-hypothesis-per-paper era. Modern biology measures the whole genome at once: RNA-seq has m ≈ 20,000 genes, GWAS m ≈ 10⁶ SNPs, proteomics m ≈ 5,000 peptides.

Suppose all H₀ are true (no gene is truly differential). At α = 0.05, the expected number of false positives is 0.05 × m: 1 for m = 20, 50 for m = 1,000, and 1,000 for m = 20,000. A paper claiming "we found 1,200 DEGs at p < 0.05" may well be reporting 1,000 noise hits.

The probability of at least one false positive is the FWER (Family-Wise Error Rate). Under independence, FWER = 1 − (1 − α)^m. By m = 14, FWER already ≈ 51%; by m = 100, ≈ 99.4%. Multiple-testing correction drags this runaway error rate back into something controllable.

🚨
歷史警鐘 — Ioannidis 2005 PLOS Med:「Why Most Published Research Findings Are False」。Ioannidis 用貝氏推理證明:在低先驗機率、高 m、未校正的研究領域,多數「顯著」發現是錯的。這篇文章催生了 reproducibility crisis 的整個論述。FDR 校正不只是統計品味,是科學誠信。 Historical alarm — Ioannidis 2005 PLOS Medicine: "Why Most Published Research Findings Are False." Using Bayesian reasoning Ioannidis showed that in fields with low prior probability, high m, and no correction, most "significant" findings are wrong. The paper launched the entire reproducibility-crisis conversation. FDR control is not a stylistic preference — it is scientific integrity.

一、FWER vs FDR

多重檢定有兩個競爭目標。FWER 控制「至少一個錯」的機率——適合臨床試驗、確認性研究,但對 m 大時極度保守。FDR(Benjamini & Hochberg 1995)改控制「宣稱為顯著的之中,有多少比例是錯」——犧牲嚴格性換來在 m 大時的可用 power,是高通量資料的事實標準。

Multiple testing has two competing objectives. FWER controls the chance of any false positive — appropriate for clinical trials and confirmatory work, but ruthlessly conservative when m is large. FDR (Benjamini & Hochberg 1995) instead controls the proportion of false positives among declared discoveries — trading strict guarantees for usable power at high m. It is the de-facto standard in high-throughput biology.

概念 定義 保證什麼 適用情境
FWERP(V ≥ 1),V = 偽陽性數整批分析中「連 1 個錯都不要」的機率 ≤ α少量檢定、確認性試驗、Phase III 臨床試驗、藥物 multiple endpointP(V ≥ 1), V = # false positivesChance of any false positive in the family ≤ αFew tests, confirmatory trials, Phase III, regulatory multiple endpoints
FDRE[V / R],R = 宣稱顯著數(V/R = 0 當 R = 0)在被「宣稱顯著」的清單中,偽陽性的期望比例 ≤ qRNA-seq DEG、GWAS、proteomics、scRNA-seq、screen——所有 m > 100 的探索性分析E[V / R], R = # rejections (V/R := 0 if R = 0)Expected fraction of false positives among declared discoveries ≤ qRNA-seq DE, GWAS, proteomics, scRNA-seq, screens — any exploratory analysis with m > 100
pFDRE[V/R | R > 0](Storey 2002)條件版本,給予 q-value 的精確語意當「至少有一個顯著」幾乎必然時(高訊號資料)E[V/R | R > 0] (Storey 2002)Conditional version, defining the q-valueWhen ≥ 1 discovery is essentially certain (high-signal data)
local FDRP(H₀ true | T = t)(Efron 2008)給定觀察到的統計量值,個別檢定為偽陽性的後驗機率empirical Bayes、需要逐 gene 的偽陽性風險P(H₀ true | T = t) (Efron 2008)Posterior probability of false positive for an individual testEmpirical Bayes; when you need per-gene false-positive risk
⌜ FWER = P(V ≥ 1)  ·  FDR = E[V / max(R, 1)]  ·  pFDR = E[V/R | R > 0] ⌝ 關鍵直覺:FWER 是「機率」(介於 0–1);FDR 是「比例的期望值」(同樣 0–1 但意義不同)。FDR ≤ q = 0.05 表示「100 個被宣稱為顯著者中,平均有 5 個是錯的」——這跟「p < 0.05」完全不同。 ⌜ FWER = P(V ≥ 1)  ·  FDR = E[V / max(R, 1)]  ·  pFDR = E[V/R | R > 0] ⌝ Key intuition: FWER is a probability (0–1); FDR is an expected proportion (also 0–1 but different meaning). FDR ≤ q = 0.05 means "of 100 declared discoveries, an average of 5 are false" — which is not at all the same as "p < 0.05".
💡
FDR 的革命性:1995 年以前,分析 m = 10,000 的微陣列等於把所有 p 乘以 10,000——幾乎不可能找到任何「顯著」基因。BH 把問題重新框架為「我願意接受 5% 的清單是錯的」,瞬間讓基因體學變得統計上可行。Benjamini & Hochberg 1995 是統計學引用次數最多的論文之一。 Why FDR was revolutionary: Before 1995, analysing an m = 10,000 microarray meant multiplying every p-value by 10,000 — almost nothing survived. BH reframed the question as "I will tolerate 5% of my list being wrong," which made genomics statistically feasible overnight. Benjamini & Hochberg 1995 is now one of the most-cited statistics papers ever.

m 越大,問題越大

拖動 m 滑桿(1 → 20,000)。觀察三個量:(1) 未校正的 FWER(至少一個偽陽性的機率,獨立時 = 1 − 0.95^m,m 大時極快趨近 1);(2) 預期偽陽性數 = 0.05 × m(線性增長);(3) Bonferroni 校正後個別檢定的 α' = 0.05/m。試試 m = 14:FWER 已過 50%;m = 20,000:α' = 2.5 × 10⁻⁶——這就是 GWAS 用 5×10⁻⁸ 的 Bonferroni 邏輯。

Drag the m slider (1 → 20,000). Watch three quantities: (1) uncorrected FWER (probability of ≥1 false positive; under independence = 1 − 0.95^m, blows up fast); (2) expected false positives = 0.05 × m (linear); (3) per-test Bonferroni threshold α' = 0.05/m. Try m = 14: FWER already > 50%. m = 20,000: α' = 2.5 × 10⁻⁶ — the same logic behind GWAS's 5×10⁻⁸ Bonferroni cut-off.

三條曲線:未校正 FWER(橘)· 預期 FP 數(沙)· Bonferroni 後每檢定 α'(綠)Three curves: uncorrected FWER (orange) · expected FP (sand) · Bonferroni per-test α' (teal)

二、方法家譜

FWER 派從 Bonferroni(1936)走到 Holm(1979)和 Hochberg(1988),追求嚴格的「至少 1 個錯」控制。FDR 派則從 BH(1995)出發,衍生出 BY(任意相依)、Storey q-value(自適應 π₀)、local FDR(Efron empirical Bayes)、IHW(共變量加權)、2dGBH(結構化資料)。

The FWER tradition runs Bonferroni (1936) → Holm (1979) → Hochberg (1988), insisting on strict "no false positives at all" control. The FDR tradition starts at BH (1995) and branches into BY (arbitrary dependence), Storey q-value (adapts to π₀), local FDR (Efron empirical Bayes), IHW (covariate-weighted), and 2dGBH (structured data).

Bonferroni

拒絕 H₀ᵢ 若 pᵢ < α/m。控制 FWER 在任何相依結構下(這是它的賣點與包袱)。缺點:m 大時極度保守,幾乎丟掉所有 power。

Carlo E. Bonferroni 在 1936 年提出,奠基於 Bonferroni 不等式 P(∪ Aᵢ) ≤ Σ P(Aᵢ)。GWAS 的 5×10⁻⁸ 門檻 = 0.05/10⁶。

Reject H₀ᵢ if pᵢ < α/m. Controls FWER under any dependence (its selling point and its burden). Downside: brutal at large m — drops nearly all power.

Carlo E. Bonferroni proposed it in 1936, leaning on P(∪ Aᵢ) ≤ Σ P(Aᵢ). The 5×10⁻⁸ GWAS threshold is just 0.05 / 10⁶.

Holm step-down

把 p 值由小到大排序 p₍₁₎ ≤ p₍₂₎ ≤ ⋯;從最小開始檢視 p₍ᵢ₎ ≤ α / (m − i + 1),若不滿足則停止,後面全部接受。仍控制 FWER、永遠至少和 Bonferroni 一樣強(uniformly more powerful)——沒有理由用 Bonferroni 而不用 Holm。

Sort p-values smallest first p₍₁₎ ≤ p₍₂₎ ≤ ⋯; starting from the smallest, reject as long as p₍ᵢ₎ ≤ α / (m − i + 1); stop on first failure. Still controls FWER, but is uniformly more powerful than Bonferroni — there is no reason to use Bonferroni instead of Holm.

Hochberg step-up

反向:從最大 p 往小檢視,找最大的 i 使 p₍ᵢ₎ ≤ α / (m − i + 1),拒絕所有 ≤ i 的假設。比 Holm 略 strong,但需要獨立或 PRDS 假設(Bonferroni / Holm 不需要)。

Reverse direction: scan from the largest p downward, find the largest i with p₍ᵢ₎ ≤ α / (m − i + 1), reject all j ≤ i. Slightly more powerful than Holm but requires independence or PRDS (Bonferroni and Holm do not).

BH-FDR ⭐

排序 p₍₁₎ ≤ ⋯ ≤ p₍ₘ₎,找最大的 k 滿足 p₍ₖ₎ ≤ (k/m)·q,拒絕所有 i ≤ k。在獨立或 PRDS(Positive Regression Dependence on Subsets)下控制 FDR ≤ q。大多數基因表達資料、GWAS 同一染色體 SNP 都滿足 PRDS。高通量資料的工業標準。

原始論文 JRSS-B 57:289–300,DOI 10.1111/j.2517-6161.1995.tb02031.x。

Sort p₍₁₎ ≤ ⋯ ≤ p₍ₘ₎; find the largest k with p₍ₖ₎ ≤ (k/m)·q; reject all i ≤ k. Controls FDR ≤ q under independence or PRDS (Positive Regression Dependence on Subsets), satisfied by most gene-expression tests and same-chromosome GWAS SNPs. The industry standard for high-throughput data.

JRSS-B 57:289–300, DOI 10.1111/j.2517-6161.1995.tb02031.x.

BY for arbitrary dependence

當相依結構未知或負相依時,BH 不保證控制 FDR。BY 把 BH 的門檻乘以 1/Σᵢ₌₁ᵐ (1/i) ≈ 1/(ln m + 0.5772)。m = 10,000 時這倍率約 1/9.8——非常保守,但任何相依結構都成立。Ann. Statist. 29:1165–1188。

When dependence is unknown or negative, BH may not control FDR. BY multiplies BH's threshold by 1/Σᵢ₌₁ᵐ (1/i) ≈ 1/(ln m + 0.5772). For m = 10,000 that's roughly 1/9.8 — very conservative, but valid under any dependence. Ann. Statist. 29:1165–1188.

Storey q-value ⭐

估計 π₀(真 H₀ 比例),用 π₀·m 取代 BH 的 m。如果只有 60% 基因為真 null,q-value 比 BH power 高 ~40%。q-value(i) = 「使第 i 個檢定恰好被宣稱顯著時的最小 pFDR」——具有與 p-value 對偶的逐項解讀。

qvalue R 套件(Storey, Bass, Dabney, Robinson);scRNA-seq DE 的主力工具。Storey 2002 JRSS-B 64:479; Storey & Tibshirani 2003 PNAS 100:9440。

Estimates π₀ (the true-null proportion) and replaces m by π₀·m in BH. If only 60% of genes are truly null, q-value yields ~40% more power than BH. q-value(i) = "the smallest pFDR at which test i would just be rejected" — a per-test reading dual to the p-value.

The qvalue R package (Storey, Bass, Dabney, Robinson) is the scRNA-seq DE workhorse. Storey 2002 JRSS-B 64:479; Storey & Tibshirani 2003 PNAS 100:9440.

Local FDR

fdr(z) = P(H₀ | T = z) = π₀·f₀(z) / f(z),empirical-Bayes 框架。q-value 是「尾端偽陽性比例」,local FDR 是「該點偽陽性機率」。Efron 2007 JASA、2008 Stat Sci、2010 書《Large-Scale Inference》系統論述。R: locfdr

fdr(z) = P(H₀ | T = z) = π₀·f₀(z) / f(z), an empirical-Bayes formulation. The q-value gives the tail false-positive rate; local FDR gives the pointwise posterior. See Efron 2007 JASA, 2008 Stat Sci, and the 2010 book Large-Scale Inference. R: locfdr.

IHW — covariate-weighted

許多檢定有共變量(covariate)和 power 相關但和 p-value 在 H₀ 下獨立——例如 RNA-seq DE 中的 mean expression、GWAS 中的 MAF。IHW(Independent Hypothesis Weighting)按 covariate 分箱、學習每箱權重,在控制 FDR 下大幅提升 power。Nat Methods 13:577,DOI 10.1038/nmeth.3885。Bioconductor: IHW

Many tests have a covariate correlated with power but independent of the p-value under H₀ — mean expression in RNA-seq DE, MAF in GWAS. IHW (Independent Hypothesis Weighting) bins by the covariate, learns per-bin weights, and gains substantial power while still controlling FDR. Nat Methods 13:577, DOI 10.1038/nmeth.3885. Bioconductor: IHW.

2dGBH — structured / spatial

當檢定有二維結構(如空間轉錄組的 spot × gene、scATAC 的 peak × cluster)時,傳統一維 BH 忽略此結構。2dGBH 在二維 group 上做 group BH,提升 power 同時控制 FDR。Liu 2024 提出。

When tests share a two-dimensional structure (spatial transcriptomics spot × gene, scATAC peak × cluster, etc.), 1D BH ignores it. 2dGBH applies group BH across the 2D grouping, raising power while keeping FDR control. Proposed in Liu 2024.

實務原則:沒有「最強」的方法,只有「最適合資料結構」的方法。標準分析流程:(1) 探索性高通量 → BH 或 q-value;(2) 同時用 IHW + covariate 加權若 covariate 存在;(3) 任意相依(如 metabolomics 強相關 features)→ BY;(4) 確認性 / 臨床 / 監管 → Bonferroni 或 Holm。但永遠在分析之前決定方法——事後挑選等於 p-hacking。 Practical rule: there is no "best" method, only "best fit for the data structure." Standard pipeline: (1) exploratory high-throughput → BH or q-value; (2) add IHW if a covariate exists; (3) arbitrary dependence (e.g. strongly correlated metabolomics features) → BY; (4) confirmatory / clinical / regulatory → Bonferroni or Holm. Decide before you analyse — picking afterwards is p-hacking.

BH-FDR demonstrator

模擬 m 個 p-value:比例 π₀ 為真 null(U(0,1))、其餘為真 alt(偏小 p)。排序後畫散佈圖 (i, p₍ᵢ₎),疊上 BH 線 y = q·i/mBH 規則:找最大的 i 使點落在線下,拒絕所有 ≤ 此 i 的假設。拖滑桿改 π₀、q、m,觀察拒絕門檻怎麼移動——π₀ 越小(真訊號越多),步距越長,BH 也越強。

Simulate m p-values: a fraction π₀ are true nulls (U(0,1)), the rest are true alternatives (small p). Sort and scatter (i, p₍ᵢ₎), overlay the BH line y = q·i/m. BH rule: find the largest i with the point under the line; reject all j ≤ i. Drag the sliders for π₀, q, m and watch the rejection threshold slide — the smaller π₀ is (more real signal), the further BH walks.

綠 = 拒絕(≤ k)· 灰 = 接受 · 紅虛線 = BH 門檻 y = q·i/m · 藍虛線 = Bonferroni α/mGreen = rejected (≤ k) · grey = accepted · red dashed = BH line y = q·i/m · blue dashed = Bonferroni α/m

三、決策樹

🌳 多重檢定方法選擇

Q1:
研究是確認性 / 監管 / 臨床終點嗎?→ 是 →Bonferroni 或 Holm 控制 FWER(永遠選 Holm,比 Bonferroni 強)。
Q2:
探索性高通量(m > 100),檢定間獨立或正相依(PRDS)→ 是 →BH-FDR(基本款)或 Storey q-value(power 較高,當 π₀ 顯著 < 1)。
Q3:
檢定間負相依或結構不明(如 metabolomics 高度相關 features)?→ 是 →BY(任意相依下保證 FDR ≤ q)。
Q4:
存在與 power 相關但與 H₀ 下 p 值獨立的共變量(如 mean expression、MAF)?→ 是 →IHW(covariate-weighted FDR)。RNA-seq DE 主流:DESeq2 + IHW。
Q5:
需要逐項後驗偽陽性風險?→ 是 →local FDR(Efron empirical Bayes)。
Q6:
檢定有二維結構(空間 / 多 cluster)?→ 是 → 考慮 2dGBH(Liu 2024)或 grouped FDR。
Q7:
m 很小(< 20)?→ 是 → FDR 估計不可靠,回到 Holm事先指定主要與次要 endpoint
Q1:
Confirmatory / regulatory / clinical endpoints? → Yes → Use Bonferroni or Holm (always prefer Holm — uniformly stronger than Bonferroni).
Q2:
Exploratory high-throughput (m > 100), tests independent or PRDS? → Yes → Use BH-FDR (baseline) or Storey q-value (higher power when π₀ < 1).
Q3:
Tests negatively or unknown-structure dependent (e.g., highly correlated metabolomics features)? → Yes → Use BY (FDR ≤ q under arbitrary dependence).
Q4:
A covariate linked to power but independent of p under H₀ (mean expression, MAF)? → Yes → Use IHW. Mainstream RNA-seq DE pipeline: DESeq2 + IHW.
Q5:
Need a per-test posterior false-positive risk? → Yes → Use local FDR (Efron empirical Bayes).
Q6:
Tests have a 2D structure (spatial / multi-cluster)? → Yes → Consider 2dGBH (Liu 2024) or grouped FDR.
Q7:
m is small (< 20)? → Yes → FDR estimates are unreliable — return to Holm or pre-specified primary/secondary endpoints.

四、方法比較表

方法 控制什麼 假設 Power 適用情境
Bonferroni 1936FWER任意相依最低少量檢定、GWAS 5×10⁻⁸Any dependenceLowestFew tests, GWAS 5×10⁻⁸
Holm 1979FWER任意相依略高於 Bonferroni(始終 ≥)取代 Bonferroni 的預設選項Any dependence≥ Bonferroni, alwaysDefault replacement for Bonferroni
Hochberg 1988FWER獨立 / PRDS≥ Holm少量檢定、相依結構良好Independence / PRDS≥ HolmFew tests with mild dependence
BH 1995 ⭐FDR ≤ q獨立 / PRDSRNA-seq, proteomics, microbiome — 預設Independence / PRDSHighRNA-seq, proteomics, microbiome — default
BY 2001FDR ≤ q任意相依≈ BH × 1/(ln m + 0.58)負相依或未知結構Any dependence≈ BH × 1/(ln m + 0.58)Negative or unknown dependence
Storey q-valueFDR / pFDR ≤ q獨立 / 弱相依、估 π₀ 可靠≥ BH(π₀ < 1 時更高)scRNA-seq DE、microarray、π₀ 顯著 < 1 的場景Independence / weak dep., reliable π₀≥ BH (more so when π₀ < 1)scRNA-seq DE, microarray, π₀ < 1 regimes
IHW 2016FDR ≤ q獨立 covariate(H₀ 下 p ⊥ covariate)> BH(covariate 資訊量大時顯著)DESeq2 預設、ATAC peak filteringIndependent covariate (p ⊥ covariate under H₀)> BH (large gain when covariate informative)DESeq2 default, ATAC peak filtering
local FDRper-test 後驗大 m(> 500)、可估 f, f₀不直接比較(不同 metric)empirical Bayes、需要逐 gene 風險Large m (> 500), estimable f, f₀Different metric, not directly comparableEmpirical Bayes, per-gene risk

五、組學慣例

🧬

RNA-seq DE

m ≈ 15,000–25,000 基因。預設:BH 或 BH + IHW(DESeq2 預設)。apeglm shrinkage 後配 BH。報告 padj < 0.05 或 0.1。

m ≈ 15,000–25,000 genes. Default: BH or BH + IHW (DESeq2 default). apeglm shrinkage then BH. Report padj < 0.05 or 0.1.

🧪

GWAS

m ≈ 10⁶ SNP。歷史共識:Bonferroni α/m = 5×10⁻⁸("genome-wide significance")。同一染色體 LD 區塊滿足 PRDS,可用 BH 但傳統仍偏好 Bonferroni 以利跨研究 meta。

m ≈ 10⁶ SNPs. Historical consensus: Bonferroni α/m = 5×10⁻⁸ ("genome-wide significance"). Same-chromosome LD blocks satisfy PRDS so BH is valid, but Bonferroni remains preferred for cross-study meta-analysis.

🧫

Proteomics

m ≈ 5,000–10,000 peptide/protein。BH 在 PSM、peptide、protein 三層分別校正。MaxLFQ 後常用 limma + BH。

m ≈ 5,000–10,000 peptides / proteins. BH applied separately at PSM, peptide, protein levels. Post-MaxLFQ, the limma + BH pipeline is standard.

🦠

scRNA-seq DE

cluster-vs-rest 每 cluster 數萬個比較。Seurat FindMarkers 預設 BH;Storey q-value 在 ASAP、MAST 中常見。注意 pseudobulk vs single-cell-level 的選擇影響 m 與 FDR 估計。

cluster-vs-rest yields tens of thousands of comparisons per cluster. Seurat FindMarkers defaults to BH; Storey q-value appears in ASAP / MAST. Note: pseudobulk vs single-cell-level analysis changes both m and the FDR estimate.

🌐

ChIP / ATAC

MACS2 peak calling 用 q-value(BH-FDR)作為 cutoff。peak × condition 差異分析則用 DESeq2 + BH,或 DiffBind + IHW(mean signal 作 covariate)。

MACS2 peak calling uses a q-value (BH-FDR) cutoff. Differential peak × condition is DESeq2 + BH or DiffBind + IHW (mean signal as covariate).

📊

Microbiome

m ≈ 100–2,000 OTU/ASV,相依強(compositionality)。ANCOM-BC + BHBY 對任意相依更穩健。比例壓制下假陽性極多——多重檢定校正只是底線,先做 compositional transform。

m ≈ 100–2,000 OTU/ASV with strong dependence (compositionality). ANCOM-BC + BH or BY for arbitrary dependence. False positives are rampant under compositional constraint — correction is the floor; transform first.

六、實作範例

# R — Multiple-testing toolbox
library(qvalue)        # Storey q-value
library(IHW)           # covariate-weighted FDR
library(locfdr)        # Efron local FDR

# Suppose p_vec is a length-m vector of p-values
m <- length(p_vec)

# --- FWER family ---
p_bonf <- p.adjust(p_vec, method = "bonferroni")
p_holm <- p.adjust(p_vec, method = "holm")
p_hoch <- p.adjust(p_vec, method = "hochberg")

# --- FDR family ---
p_bh   <- p.adjust(p_vec, method = "BH")        # Benjamini-Hochberg ⭐
p_by   <- p.adjust(p_vec, method = "BY")        # arbitrary dependence

# Storey q-value (adapts to π₀)
qv <- qvalue::qvalue(p_vec)
qv$pi0; qv$qvalues
summary(qv)            # reports π₀ and number of discoveries by q

# IHW with a covariate (e.g. mean expression for RNA-seq DE)
res <- IHW::ihw(p_vec ~ mean_expr, alpha = 0.05)
rejections(res)
adj_p <- IHW::adj_pvalues(res)

# Local FDR (Efron empirical Bayes; needs z-scores, not p)
z <- qnorm(1 - p_vec)
lf <- locfdr::locfdr(z, plot = 0)
lf$fdr                  # per-test local FDR

# --- DESeq2 pipeline (the modern default) ---
# library(DESeq2); res <- results(dds, filterFun = ihw); res$padj
import numpy as np
from statsmodels.stats.multitest import multipletests
from scipy.stats import norm

# --- FWER family ---
_, p_bonf, _, _ = multipletests(p_vec, method='bonferroni')
_, p_holm, _, _ = multipletests(p_vec, method='holm')
_, p_hoch, _, _ = multipletests(p_vec, method='hommel')   # Hommel ~ Hochberg variant

# --- FDR family ---
rej, p_bh,   _, _ = multipletests(p_vec, method='fdr_bh',    alpha=0.05)
_,   p_by,   _, _ = multipletests(p_vec, method='fdr_by',    alpha=0.05)
_,   p_tsbh, _, _ = multipletests(p_vec, method='fdr_tsbh',  alpha=0.05)  # two-stage

# --- Storey q-value (requires the qvalue package on PyPI) ---
from qvalue import estimate
result = estimate(p_vec)              # returns qvalues + pi0

# --- IHW in Python: via rpy2 or pyihw (Bioconductor port) ---
from pyihw import ihw
adj_p, weights = ihw(p_vec, mean_expr, alpha=0.05)

# --- Local FDR: convert to z, fit two-group empirical Bayes ---
z = norm.isf(p_vec)
# custom fit or use sklearn / locfdr-python

# --- DESeq2 (via pyDESeq2) pipeline ---
# from pydeseq2.dds import DeseqDataSet
# res = dds.results(); res['padj']  # already BH
💡
實用 cheatsheet:R 中 p.adjust(p, method="BH") 與 statsmodels 的 multipletests(p, method='fdr_bh') 是 99% 的場合會用到的;其餘是 IHW(power 增益最顯著)與 q-value(π₀ 估計)。寫論文 Methods:「P-values were adjusted using the Benjamini-Hochberg procedure (BH; Benjamini and Hochberg, 1995). Significant if BH-adjusted p < 0.05.」 Practical cheatsheet: R's p.adjust(p, method="BH") and statsmodels' multipletests(p, method='fdr_bh') cover 99% of cases. The rest are IHW (biggest power gain) and q-value (π₀ estimation). Methods boilerplate: "P-values were adjusted using the Benjamini-Hochberg procedure (BH; Benjamini and Hochberg, 1995). Discoveries were declared at BH-adjusted p < 0.05."

七、常見陷阱

BH 在負相依下不保證

BH 保證 FDR ≤ q 需要獨立或 PRDS。當資料負相依(例如 compositional microbiome、強相關 features),BH 可能失效。實務做法:改用 BY(永遠保證但更保守)或先做 decorrelation / compositional transform。Benjamini & Yekutieli 2001 給的反例至今仍是教科書。

BH guarantees FDR ≤ q only under independence or PRDS. With negative dependence (compositional microbiome, strongly correlated features), BH may fail. Fix: use BY (always valid, more conservative) or decorrelate / compositionally transform first. Benjamini & Yekutieli 2001 give the canonical counter-example.

raw p 不該是主要結果

m 很大時,raw p < 0.05 幾乎沒有資訊。論文常見錯誤:「我們找到 1,200 個基因 p < 0.05(其中 80 個 padj < 0.05)」——80 才是結論,1,200 是噪音。主要結果一定是 padj / q,raw p 只能輔助。

For large m, raw p < 0.05 carries almost no information. The classic paper error: "We found 1,200 genes at p < 0.05 (80 at padj < 0.05)" — 80 is the result; 1,200 is noise. The headline must be padj / q-value; raw p is supplementary at best.

事後挑 q 門檻

「q < 0.05 沒幾個顯著,那我用 q < 0.10⋯⋯還是不夠,q < 0.20 試試」——這是嚴重 p-hacking。q 必須在看資料前選定。若需要寬鬆 cutoff,預先註明(如 q < 0.10 作 hypothesis generation, q < 0.05 作 confirmation)並 pre-register。

"q < 0.05 gives nothing... try q < 0.10... still few... q < 0.20?" — that's serious p-hacking. The q threshold must be chosen before looking at the data. If a generous cutoff is needed, state it (e.g., q < 0.10 for hypothesis generation, q < 0.05 for confirmation) and pre-register it.

小 m 估 FDR 不可靠

FDR 的良好行為來自大 m 的大數法則。m < 20 時,FDR 估計的變異性大、π₀ 幾乎沒法估、q-value 不穩。建議 m < 20 用 Holm;m = 20–100 用 BH 但解讀保守;m > 100 才放心用 q-value / IHW。

FDR's good behaviour comes from the law of large numbers on m. With m < 20, FDR estimates are highly variable, π₀ is essentially unestimable, and q-values wobble. Recommendation: m < 20 → Holm; m = 20–100 → BH but interpret cautiously; m > 100 → q-value / IHW are reliable.

FDR 不是 p

「padj < 0.05」不代表「該基因為 null 的機率 < 5%」。padj < 0.05 表示:在所有 padj < 0.05 的基因中,平均 5% 是錯的——這是清單層級而非單一基因層級的保證。要單一層級用 local FDR 或 Bayes posterior。

"padj < 0.05" does not mean "this gene has < 5% chance of being null." It means: among all genes with padj < 0.05, an average of 5% are false — a list-level, not per-gene, guarantee. For per-gene risk, use local FDR or Bayesian posterior probabilities.

同一檢定校正兩次

常見錯誤:先在 cluster 內做 BH,再對所有 cluster 結果再做一次 BH。這會雙重收縮,過度保守。正確做法是一次性對整批檢定(或 grouped FDR 方法如 2dGBH)做校正。

Common mistake: BH within each cluster, then BH again across clusters. This double-shrinks and is over-conservative. Apply correction once to the full family, or use a grouped-FDR method (2dGBH) that handles structure explicitly.

獨立 filter 才不算多重性

RNA-seq 常先用 mean count 過濾低表達基因再做 DE。filter 必須與後續檢定獨立(如用 row-mean 過濾、不用 group difference 過濾),否則 FDR 失效。這正是 IHW(Ignatiadis 2016)的數學基礎——它允許 filter,前提是 filter 與 H₀ 下 p 獨立。

RNA-seq often pre-filters low-count genes by mean. The filter must be independent of the downstream test (row-mean OK; group-difference filter not OK), or FDR control breaks. This is the mathematical basis of IHW (Ignatiadis 2016) — it permits filters as long as they are independent of p under H₀.

事後挑校正法

「Bonferroni 都沒顯著,那我用 BH」——分析決定。Methods 段預先寫明方法是 reproducibility 的基本要求。換方法等於增加 type-I error 的搜索空間。

"Bonferroni found nothing — try BH" inflates Type-I error by silently expanding the search space. Decide before analysing; pre-specify in Methods. This is reproducibility 101.

陷阱總結 FDR 校正不能修復糟糕的實驗設計、不能解決 confounding、不能讓 underpowered 研究變可信。它只控制多重檢定這個錯誤源。如果你的 effect 大、樣本充足、設計乾淨,BH 校正後仍是顯著的;如果你需要 q = 0.20 才有任何發現,問題不在 FDR,在於資料本身。Begley & Ellis 2012 Nature 警告:preclinical 研究 reproducibility ~10%。多重檢定校正只是「不要再讓事情更糟」。 FDR control cannot rescue bad experimental design, confounding, or underpowered studies. It only addresses the multiple-testing source of error. With a real effect, adequate n, and a clean design, results survive BH; if you need q = 0.20 to find anything, the problem is the data, not the FDR. Begley & Ellis 2012 Nature: preclinical reproducibility ~10%. Correction is "do no further harm" — not a cure.

📝 自我檢測

1. 你做 RNA-seq DE,m = 18,000 個基因,要報告主要結果。最合適的多重檢定校正?

1. You run RNA-seq DE on m = 18,000 genes and need a headline result. Best correction?

A. 只報 raw p < 0.05,反正基因多總會有的A. Just report raw p < 0.05 — plenty of genes
B. Bonferroni α/m = 2.8×10⁻⁶B. Bonferroni at α/m = 2.8×10⁻⁶
C. BH-FDR(或 BH + IHW with mean expression)C. BH-FDR (or BH + IHW with mean expression)
D. 不校正,留給 reviewer 處理D. No correction, leave it to the reviewers

2. 「padj < 0.05」的正確解讀是?

2. What does "padj < 0.05" actually mean?

A. 該基因為 null 的機率 < 5%A. This gene has < 5% chance of being null
B. 該基因為差異表現的機率 > 95%B. This gene is differentially expressed with > 95% probability
C. raw p × m < 0.05C. raw p × m < 0.05
D. 在所有 padj < 0.05 的基因中,平均約 5% 是偽陽性D. Among all genes with padj < 0.05, ~5% are expected to be false

3. m = 20,000 個 RNA-seq 檢定,假設所有 H₀ 為真,在未校正 α = 0.05 下預期偽陽性數約是?

3. m = 20,000 RNA-seq tests, all H₀ true, uncorrected α = 0.05. Expected false positives?

A. 5 個A. 5
B. 1,000 個B. 1,000
C. 50 個C. 50
D. 5,000 個D. 5,000

4. 你的代謝體學資料 features 之間強相關(負相依),BH 校正可能失效。最穩健的選擇?

4. Your metabolomics features are strongly (negatively) correlated; BH may fail. Most robust choice?

A. 直接用 Storey q-valueA. Use Storey q-value directly
B. 用 IHW 與 mean intensityB. Use IHW with mean intensity
C. 用 BY(Benjamini-Yekutieli 2001)——任意相依下保證 FDRC. Use BY (Benjamini-Yekutieli 2001) — FDR controlled under any dependence
D. 不校正,因為 features 相關互相抵消D. Skip correction — correlations cancel out