假設檢定的本質
檢定流程:(1) 設定 H₀ 與 H₁,(2) 挑選檢定統計量 T,(3) 在 H₀ 下推導 T 的分布,(4) 計算實際 T 在 H₀ 分布下的尾端機率 = p-value。如果這個機率很小 (< α),就拒絕 H₀。
關鍵詞:α = Type I 錯誤率 (拒絕了對的 H₀);β = Type II 錯誤率 (沒拒絕錯的 H₀);power = 1 − β = 在 H₁ 為真時拒絕 H₀ 的機率。本章還涵蓋效果量、最小可檢測效果 (MDE)、單側 vs 雙側、等價/非劣 (TOST) 檢定、預登錄 (pre-registration)。
The procedure: (1) set H₀ and H₁; (2) pick test statistic T; (3) derive T's distribution under H₀; (4) compute the tail probability of the observed T = p-value. If small (< α), reject H₀.
Vocabulary: α = Type I (reject true H₀); β = Type II (fail to reject false H₀); power = 1 − β = prob of rejecting H₀ when H₁ is true. We also cover effect size and the minimum detectable effect (MDE), one- vs two-sided tests, equivalence / non-inferiority testing (TOST), and pre-registration.
一、兩類錯誤與決策矩陣
| H₀ 真 | H₀ 假 | ||||
|---|---|---|---|---|---|
| 拒絕 H₀ | Type I 錯誤 (機率 α) | ✅ 正確拒絕 (power = 1−β) | Reject H₀ | Type I error (prob α) | ✅ True positive (power = 1−β) |
| 不拒絕 H₀ | ✅ 正確不拒絕 (1−α) | Type II 錯誤 (機率 β) | Fail to reject | ✅ True negative (1−α) | Type II error (prob β) |
效果量與 MDE
效果量 (Cohen's d、log₂FC、OR 等) 量化「差有多大」。最小可檢測效果 (MDE):在固定 n、α、power = 0.8 條件下能偵測到的最小效果。報告時必附效果量與 CI,p-value 本身不是效果。
Effect size (Cohen's d, log₂FC, OR) quantifies "how much". The minimum detectable effect (MDE) is the smallest effect detectable given n, α, power = 0.8. Always report effect size + CI alongside p — the p-value alone is not an effect.
單側 vs 雙側 / TOST
H₁ 若是「有差別」就雙側 (預設);若預先有明確方向理由才單側。等價檢定 (TOST):想證明「兩組差異 < Δ」── 將 H₀ 與 H₁ 對調。常規檢定不顯著不能當成「沒差」,必須做 TOST。
Two-sided is the default; only choose one-sided with strong prior justification. Equivalence (TOST): to prove "difference < Δ", swap H₀ and H₁. A non-significant standard test does not establish "no difference"; you must run TOST.
二、Power 的決定因子
Power 由四個量共同決定,知其三可解第四:(1) 樣本數 n、(2) 效果量 Δ/σ、(3) 顯著水準 α、(4) power 自身。一般 RNA-seq 設計目標 power ≥ 0.8、α = 0.05。樣本太少 (e.g. n=3 per group + log₂FC = 0.5) 通常 power < 0.3──大量真實差異被遺漏。
RNA-seq 專用:RNASeqPower、ssizeRNA 把 dispersion 加入 power 公式;scRNA-seq 因 dropout 與細胞-細胞變異更大,per-cell power 高但 per-cluster 與 cell-type 比較常因 cluster 大小極不均而失準。
Power is jointly determined by four quantities; fix any three, solve for the fourth: (1) sample size n, (2) effect size Δ/σ, (3) significance level α, (4) power itself. RNA-seq designs target power ≥ 0.8 at α = 0.05. Tiny n (e.g. 3 per group with log₂FC = 0.5) usually gives power < 0.3 — many true effects missed.
RNA-seq specific tools: RNASeqPower and ssizeRNA fold dispersion into the power formula. scRNA-seq has high per-cell power but per-cluster comparisons collapse when cluster sizes are wildly imbalanced.
α / β / Power 三分布視覺器
左圖:H₀ 抽樣分布 (藍) 與 H₁ 抽樣分布 (橙)。陰影面積:紅 = α (Type I),米 = β (Type II)。拖動效果量 d 與 n,觀察兩分布如何分離、β 縮小、power 增加。右側統計列顯示即時 power。
Left: sampling distributions under H₀ (blue) and H₁ (orange). Shading: red = α (Type I), tan = β (Type II). Drag effect size d and n — see distributions pull apart, β shrink, power rise. Stats row shows live power.
X:標準化 t 統計量;藍:H₀;橙:H₁。
三、p-value 的五大誤用
- 「p > 0.05 表示 H₀ 為真」:「absence of evidence ≠ evidence of absence」。要證明「沒差」請用 TOST。
- p-hacking:多模型、多子群、多終點都試,挑顯著的回報──Type I 嚴重膨脹。
- HARKing (Hypothesising After Results are Known):看到資料後再寫假設,當作預先設定。
- Optional stopping:邊看 p 邊收資料、一達顯著就停。Type I 在實際序列下可衝到 20–50%。對策:序貫分析 (group-sequential, alpha-spending)。
- 統計顯著 ≠ 生物顯著:對 20,000 基因,微小效果在 n 大時也會 p < 0.05──需設效果量門檻 (log₂FC > 1) 或 lfcShrink。
- "p > 0.05 means H₀ is true": absence of evidence ≠ evidence of absence. To establish "no difference" use TOST.
- p-hacking: trying many models / subgroups / endpoints and reporting whichever is significant — inflates Type I dramatically.
- HARKing: Hypothesising After Results are Known — writing the hypothesis post-hoc and pretending it was preregistered.
- Optional stopping: peeking at p and stopping when significant. Real Type I can blow up to 20–50%. Use group-sequential / alpha-spending designs instead.
- Statistical ≠ biological significance: with 20,000 genes, tiny effects become p < 0.05 once n is large — gate by effect size (|log₂FC| > 1) or use lfcShrink.
實作:Power、TOST 與模擬
# --- R --- Power 與 TOST # 1) 解 4 變數中的任一個 library(pwr) pwr.t.test(d=0.5, sig.level=0.05, power=0.8, type="two.sample") # 解 n pwr.t.test(n=25, d=0.5, sig.level=0.05, type="two.sample") # 解 power # 2) RNA-seq 樣本數規劃 library(RNASeqPower) rnapower(depth=20, n=3, cv=0.4, effect=2, alpha=0.05) # 3) TOST (兩單側檢定) library(TOSTER) TOSTtwo(m1=10.1, m2=10.0, sd1=0.3, sd2=0.3, n1=30, n2=30, low_eqbound_d=-0.4, high_eqbound_d=0.4) # 4) 模擬驗證 power sim_power <- function(n, d, M=5000) { ps <- replicate(M, t.test(rnorm(n, d), rnorm(n))$p.value) mean(ps < 0.05) } sim_power(25, 0.5) # 應接近 pwr.t.test
# --- Python --- Power 與 TOST from statsmodels.stats.power import TTestIndPower from statsmodels.stats.weightstats import ttost_ind import numpy as np from scipy import stats # 1) 解 power 計算的任一變量 ana = TTestIndPower() ana.solve_power(effect_size=0.5, alpha=0.05, power=0.8) # 解 n ana.solve_power(effect_size=0.5, nobs1=25, alpha=0.05) # 解 power # 2) TOST p, lo, hi = ttost_ind(group1, group2, low=-0.4, upp=0.4, usevar="pooled") # 3) 模擬驗證 def sim_power(n, d, M=5000): ps = [stats.ttest_ind(np.random.normal(d, 1, n), np.random.normal(0, 1, n)).pvalue for _ in range(M)] return np.mean(np.array(ps) < 0.05) sim_power(25, 0.5)
📝 自我檢測
1. 為什麼「p > 0.05」不能作為支持 H₀ 的證據?
1. Why is "p > 0.05" not evidence for H₀?
2. Power 隨下列何者增加?
2. Power increases with which of the following?
3. 常規檢定「不顯著」與等價檢定「顯著」的差別?
3. Difference between a non-significant standard test and a significant equivalence test?