STEP 4 / 17

假設檢定與錯誤

H₀、H₁、α、β、power、p-value──以及 RNA-seq 上「統計顯著 ≠ 生物顯著」的關鍵警示。

H₀, H₁, α, β, power, p-values — and the critical RNA-seq warning that "statistically significant ≠ biologically meaningful".

假設檢定的本質

檢定流程:(1) 設定 H₀H₁,(2) 挑選檢定統計量 T,(3) 在 H₀ 下推導 T 的分布,(4) 計算實際 T 在 H₀ 分布下的尾端機率 = p-value。如果這個機率很小 (< α),就拒絕 H₀。

關鍵詞:α = Type I 錯誤率 (拒絕了對的 H₀);β = Type II 錯誤率 (沒拒絕錯的 H₀);power = 1 − β = 在 H₁ 為真時拒絕 H₀ 的機率。本章還涵蓋效果量、最小可檢測效果 (MDE)、單側 vs 雙側、等價/非劣 (TOST) 檢定、預登錄 (pre-registration)。

The procedure: (1) set H₀ and H₁; (2) pick test statistic T; (3) derive T's distribution under H₀; (4) compute the tail probability of the observed T = p-value. If small (< α), reject H₀.

Vocabulary: α = Type I (reject true H₀); β = Type II (fail to reject false H₀); power = 1 − β = prob of rejecting H₀ when H₁ is true. We also cover effect size and the minimum detectable effect (MDE), one- vs two-sided tests, equivalence / non-inferiority testing (TOST), and pre-registration.

⚠️
p-value 的正確定義:「假設 H₀ 為真,獲得至少和現在一樣極端之資料的機率」。不是① H₀ 為真的機率;② 重複實驗下結果相同的機率;③ 效果大小的指標。
The correct p-value definition: "Assuming H₀ is true, the probability of getting data at least as extreme as observed". It is NOT: ① the probability that H₀ is true; ② the probability of replicating; ③ a measure of effect size.

一、兩類錯誤與決策矩陣

H₀ 真 H₀ 假
拒絕 H₀Type I 錯誤 (機率 α)✅ 正確拒絕 (power = 1−β)Reject H₀Type I error (prob α)✅ True positive (power = 1−β)
不拒絕 H₀✅ 正確不拒絕 (1−α)Type II 錯誤 (機率 β)Fail to reject✅ True negative (1−α)Type II error (prob β)
📊

效果量與 MDE

效果量 (Cohen's d、log₂FC、OR 等) 量化「差有多大」。最小可檢測效果 (MDE):在固定 n、α、power = 0.8 條件下能偵測到的最小效果。報告時必附效果量與 CI,p-value 本身不是效果。

Effect size (Cohen's d, log₂FC, OR) quantifies "how much". The minimum detectable effect (MDE) is the smallest effect detectable given n, α, power = 0.8. Always report effect size + CI alongside p — the p-value alone is not an effect.

🎯

單側 vs 雙側 / TOST

H₁ 若是「有差別」就雙側 (預設);若預先有明確方向理由才單側。等價檢定 (TOST):想證明「兩組差異 < Δ」── 將 H₀ 與 H₁ 對調。常規檢定不顯著不能當成「沒差」,必須做 TOST。

Two-sided is the default; only choose one-sided with strong prior justification. Equivalence (TOST): to prove "difference < Δ", swap H₀ and H₁. A non-significant standard test does not establish "no difference"; you must run TOST.

二、Power 的決定因子

Power 由四個量共同決定,知其三可解第四:(1) 樣本數 n(2) 效果量 Δ/σ(3) 顯著水準 α(4) power 自身。一般 RNA-seq 設計目標 power ≥ 0.8、α = 0.05。樣本太少 (e.g. n=3 per group + log₂FC = 0.5) 通常 power < 0.3──大量真實差異被遺漏。

RNA-seq 專用:RNASeqPowerssizeRNA 把 dispersion 加入 power 公式;scRNA-seq 因 dropout 與細胞-細胞變異更大,per-cell power 高但 per-cluster 與 cell-type 比較常因 cluster 大小極不均而失準。

Power is jointly determined by four quantities; fix any three, solve for the fourth: (1) sample size n, (2) effect size Δ/σ, (3) significance level α, (4) power itself. RNA-seq designs target power ≥ 0.8 at α = 0.05. Tiny n (e.g. 3 per group with log₂FC = 0.5) usually gives power < 0.3 — many true effects missed.

RNA-seq specific tools: RNASeqPower and ssizeRNA fold dispersion into the power formula. scRNA-seq has high per-cell power but per-cluster comparisons collapse when cluster sizes are wildly imbalanced.

α / β / Power 三分布視覺器

左圖:H₀ 抽樣分布 (藍) 與 H₁ 抽樣分布 (橙)。陰影面積:紅 = α (Type I),米 = β (Type II)。拖動效果量 d 與 n,觀察兩分布如何分離、β 縮小、power 增加。右側統計列顯示即時 power。

Left: sampling distributions under H₀ (blue) and H₁ (orange). Shading: red = α (Type I), tan = β (Type II). Drag effect size d and n — see distributions pull apart, β shrink, power rise. Stats row shows live power.

α = 0.050β ≈ —power ≈ —

X:標準化 t 統計量;藍:H₀;橙:H₁。

三、p-value 的五大誤用

  1. 「p > 0.05 表示 H₀ 為真」:「absence of evidence ≠ evidence of absence」。要證明「沒差」請用 TOST。
  2. p-hacking:多模型、多子群、多終點都試,挑顯著的回報──Type I 嚴重膨脹。
  3. HARKing (Hypothesising After Results are Known):看到資料後再寫假設,當作預先設定。
  4. Optional stopping:邊看 p 邊收資料、一達顯著就停。Type I 在實際序列下可衝到 20–50%。對策:序貫分析 (group-sequential, alpha-spending)。
  5. 統計顯著 ≠ 生物顯著:對 20,000 基因,微小效果在 n 大時也會 p < 0.05──需設效果量門檻 (log₂FC > 1) 或 lfcShrink。
  1. "p > 0.05 means H₀ is true": absence of evidence ≠ evidence of absence. To establish "no difference" use TOST.
  2. p-hacking: trying many models / subgroups / endpoints and reporting whichever is significant — inflates Type I dramatically.
  3. HARKing: Hypothesising After Results are Known — writing the hypothesis post-hoc and pretending it was preregistered.
  4. Optional stopping: peeking at p and stopping when significant. Real Type I can blow up to 20–50%. Use group-sequential / alpha-spending designs instead.
  5. Statistical ≠ biological significance: with 20,000 genes, tiny effects become p < 0.05 once n is large — gate by effect size (|log₂FC| > 1) or use lfcShrink.
🚫
ASA 2016 聲明節錄:「p-value 並非衡量假設為真的機率,也非衡量結果偶然產生的機率」、「無論機制與商業重要性,唯一的 p 門檻不可作為科學結論的依據」。建議:CI + 效果量 + 預登錄 + 重現。 From the ASA 2016 statement: "P-values do not measure the probability that the hypothesis is true, or the probability that the data were produced by chance alone"; "scientific conclusions should not be based only on whether a p-value passes a specific threshold". Best practice: CI + effect size + pre-registration + replication.

實作:Power、TOST 與模擬

# --- R --- Power 與 TOST
# 1) 解 4 變數中的任一個
library(pwr)
pwr.t.test(d=0.5, sig.level=0.05, power=0.8, type="two.sample")  # 解 n
pwr.t.test(n=25, d=0.5, sig.level=0.05, type="two.sample")             # 解 power

# 2) RNA-seq 樣本數規劃
library(RNASeqPower)
rnapower(depth=20, n=3, cv=0.4, effect=2, alpha=0.05)

# 3) TOST (兩單側檢定)
library(TOSTER)
TOSTtwo(m1=10.1, m2=10.0, sd1=0.3, sd2=0.3, n1=30, n2=30,
        low_eqbound_d=-0.4, high_eqbound_d=0.4)

# 4) 模擬驗證 power
sim_power <- function(n, d, M=5000) {
  ps <- replicate(M, t.test(rnorm(n, d), rnorm(n))$p.value)
  mean(ps < 0.05)
}
sim_power(25, 0.5)                                  # 應接近 pwr.t.test
# --- Python --- Power 與 TOST
from statsmodels.stats.power import TTestIndPower
from statsmodels.stats.weightstats import ttost_ind
import numpy as np
from scipy import stats

# 1) 解 power 計算的任一變量
ana = TTestIndPower()
ana.solve_power(effect_size=0.5, alpha=0.05, power=0.8)        # 解 n
ana.solve_power(effect_size=0.5, nobs1=25, alpha=0.05)         # 解 power

# 2) TOST
p, lo, hi = ttost_ind(group1, group2, low=-0.4, upp=0.4, usevar="pooled")

# 3) 模擬驗證
def sim_power(n, d, M=5000):
    ps = [stats.ttest_ind(np.random.normal(d, 1, n),
                          np.random.normal(0, 1, n)).pvalue
          for _ in range(M)]
    return np.mean(np.array(ps) < 0.05)
sim_power(25, 0.5)

📝 自我檢測

1. 為什麼「p > 0.05」不能作為支持 H₀ 的證據?

1. Why is "p > 0.05" not evidence for H₀?

A. 因為 α = 0.05 是任意選的A. Because α = 0.05 is arbitrary
B. 因為 p-value 一定有偏B. Because p-values are always biased
C. Power 可能不足──「沒檢出」不等於「不存在」。要證明等效須用 TOST。C. Power may be insufficient — "not detected" ≠ "not present". To establish equivalence, use TOST.
D. 因為 p-value 是貝氏量D. Because p-values are Bayesian quantities

2. Power 隨下列何者增加?

2. Power increases with which of the following?

A. 只有 α 變大A. Only larger α
B. 只有 n 變大B. Only larger n
C. 只有效果量變大C. Only larger effect size
D. n、效果量、α 三者皆能 (且變異 σ 縮小亦可)D. n, effect size, α (and smaller σ) — all of them

3. 常規檢定「不顯著」與等價檢定「顯著」的差別?

3. Difference between a non-significant standard test and a significant equivalence test?

A. 完全一樣A. Exactly the same
B. 前者只代表「沒檢出差異」;後者主動證明「差異 < Δ」B. The former only says "no detected difference"; the latter actively shows "difference < Δ"
C. 前者貝氏、後者頻率C. One is Bayesian, the other frequentist
D. 後者不可在臨床試驗使用D. The latter cannot be used in clinical trials