Step 4: Hypothesis Testing & Errors — Statistical Inference Tutorial

概覽

假設檢定的本質

檢定流程：(1) 設定 H₀ 與 H₁，(2) 挑選檢定統計量 T，(3) 在 H₀ 下推導 T 的分布，(4) 計算實際 T 在 H₀ 分布下的尾端機率 = p-value。如果這個機率很小 (＜ α)，就拒絕 H₀。

關鍵詞：α = Type I 錯誤率 (拒絕了對的 H₀)；β = Type II 錯誤率 (沒拒絕錯的 H₀)；power = 1 − β = 在 H₁ 為真時拒絕 H₀ 的機率。本章還涵蓋效果量、最小可檢測效果 (MDE)、單側 vs 雙側、等價／非劣 (TOST) 檢定、預登錄 (pre-registration)。

The procedure: (1) set H₀ and H₁; (2) pick test statistic T; (3) derive T's distribution under H₀; (4) compute the tail probability of the observed T = p-value. If small (< α), reject H₀.

Vocabulary: α = Type I (reject true H₀); β = Type II (fail to reject false H₀); power = 1 − β = prob of rejecting H₀ when H₁ is true. We also cover effect size and the minimum detectable effect (MDE), one- vs two-sided tests, equivalence / non-inferiority testing (TOST), and pre-registration.

⚠️

p-value 的正確定義：「假設 H₀ 為真，獲得至少和現在一樣極端之資料的機率」。它不是：① H₀ 為真的機率；② 重複實驗下結果相同的機率；③ 效果大小的指標。

The correct p-value definition: "Assuming H₀ is true, the probability of getting data at least as extreme as observed". It is NOT: ① the probability that H₀ is true; ② the probability of replicating; ③ a measure of effect size.

核心概念

一、兩類錯誤與決策矩陣

	H₀ 真	H₀ 假
拒絕 H₀	Type I 錯誤 (機率 α)	✅ 正確拒絕 (power = 1−β)	Reject H₀	Type I error (prob α)	✅ True positive (power = 1−β)
不拒絕 H₀	✅ 正確不拒絕 (1−α)	Type II 錯誤 (機率 β)	Fail to reject	✅ True negative (1−α)	Type II error (prob β)

📊

效果量與 MDE

效果量 (Cohen's d、log₂FC、OR 等) 量化「差有多大」。最小可檢測效果 (MDE)：在固定 n、α、power = 0.8 條件下能偵測到的最小效果。報告時必附效果量與 CI，p-value 本身不是效果。

Effect size (Cohen's d, log₂FC, OR) quantifies "how much". The minimum detectable effect (MDE) is the smallest effect detectable given n, α, power = 0.8. Always report effect size + CI alongside p — the p-value alone is not an effect.

🎯

單側 vs 雙側 / TOST

H₁ 若是「有差別」就雙側 (預設)；若預先有明確方向理由才單側。等價檢定 (TOST)：想證明「兩組差異 < Δ」── 將 H₀ 與 H₁ 對調。常規檢定不顯著不能當成「沒差」，必須做 TOST。

Two-sided is the default; only choose one-sided with strong prior justification. Equivalence (TOST): to prove "difference < Δ", swap H₀ and H₁. A non-significant standard test does not establish "no difference"; you must run TOST.

二、Power 的決定因子

Power 由四個量共同決定，知其三可解第四：(1) 樣本數 n、(2) 效果量 Δ/σ、(3) 顯著水準 α、(4) power 自身。一般 RNA-seq 設計目標 power ≥ 0.8、α = 0.05。樣本太少 (e.g. n=3 per group + log₂FC = 0.5) 通常 power < 0.3──大量真實差異被遺漏。

RNA-seq 專用：RNASeqPower、ssizeRNA 把 dispersion 加入 power 公式；scRNA-seq 因 dropout 與細胞-細胞變異更大，per-cell power 高但 per-cluster 與 cell-type 比較常因 cluster 大小極不均而失準。

Power is jointly determined by four quantities; fix any three, solve for the fourth: (1) sample size n, (2) effect size Δ/σ, (3) significance level α, (4) power itself. RNA-seq designs target power ≥ 0.8 at α = 0.05. Tiny n (e.g. 3 per group with log₂FC = 0.5) usually gives power < 0.3 — many true effects missed.

RNA-seq specific tools: RNASeqPower and ssizeRNA fold dispersion into the power formula. scRNA-seq has high per-cell power but per-cluster comparisons collapse when cluster sizes are wildly imbalanced.

互動模擬

α / β / Power 三分布視覺器

左圖：H₀ 抽樣分布 (藍) 與 H₁ 抽樣分布 (橙)。陰影面積：紅 = α (Type I)，米 = β (Type II)。拖動效果量 d 與 n，觀察兩分布如何分離、β 縮小、power 增加。右側統計列顯示即時 power。

Left: sampling distributions under H₀ (blue) and H₁ (orange). Shading: red = α (Type I), tan = β (Type II). Drag effect size d and n — see distributions pull apart, β shrink, power rise. Stats row shows live power.

α = 0.050β ≈ —power ≈ —

效果量 d 0.5

每組 n 25

α 0.05

X：標準化 t 統計量；藍：H₀；橙：H₁。

常見陷阱

三、p-value 的五大誤用

「p > 0.05 表示 H₀ 為真」：「absence of evidence ≠ evidence of absence」。要證明「沒差」請用 TOST。
p-hacking：多模型、多子群、多終點都試，挑顯著的回報──Type I 嚴重膨脹。
HARKing (Hypothesising After Results are Known)：看到資料後再寫假設，當作預先設定。
Optional stopping：邊看 p 邊收資料、一達顯著就停。Type I 在實際序列下可衝到 20–50%。對策：序貫分析 (group-sequential, alpha-spending)。
統計顯著 ≠ 生物顯著：對 20,000 基因，微小效果在 n 大時也會 p < 0.05──需設效果量門檻 (log₂FC > 1) 或 lfcShrink。

"p > 0.05 means H₀ is true": absence of evidence ≠ evidence of absence. To establish "no difference" use TOST.
p-hacking: trying many models / subgroups / endpoints and reporting whichever is significant — inflates Type I dramatically.
HARKing: Hypothesising After Results are Known — writing the hypothesis post-hoc and pretending it was preregistered.
Optional stopping: peeking at p and stopping when significant. Real Type I can blow up to 20–50%. Use group-sequential / alpha-spending designs instead.
Statistical ≠ biological significance: with 20,000 genes, tiny effects become p < 0.05 once n is large — gate by effect size (|log₂FC| > 1) or use lfcShrink.

🚫

ASA 2016 聲明節錄：「p-value 並非衡量假設為真的機率，也非衡量結果偶然產生的機率」、「無論機制與商業重要性，唯一的 p 門檻不可作為科學結論的依據」。建議：CI + 效果量 + 預登錄 + 重現。 From the ASA 2016 statement: "P-values do not measure the probability that the hypothesis is true, or the probability that the data were produced by chance alone"; "scientific conclusions should not be based only on whether a p-value passes a specific threshold". Best practice: CI + effect size + pre-registration + replication.

程式碼

實作：Power、TOST 與模擬

# --- R --- Power 與 TOST
# 1) 解 4 變數中的任一個
library(pwr)
pwr.t.test(d=0.5, sig.level=0.05, power=0.8, type="two.sample")  # 解 n
pwr.t.test(n=25, d=0.5, sig.level=0.05, type="two.sample")             # 解 power

# 2) RNA-seq 樣本數規劃
library(RNASeqPower)
rnapower(depth=20, n=3, cv=0.4, effect=2, alpha=0.05)

# 3) TOST (兩單側檢定)
library(TOSTER)
TOSTtwo(m1=10.1, m2=10.0, sd1=0.3, sd2=0.3, n1=30, n2=30,
        low_eqbound_d=-0.4, high_eqbound_d=0.4)

# 4) 模擬驗證 power
sim_power <- function(n, d, M=5000) {
  ps <- replicate(M, t.test(rnorm(n, d), rnorm(n))$p.value)
  mean(ps < 0.05)
}
sim_power(25, 0.5)                                  # 應接近 pwr.t.test

# --- Python --- Power 與 TOST
from statsmodels.stats.power import TTestIndPower
from statsmodels.stats.weightstats import ttost_ind
import numpy as np
from scipy import stats

# 1) 解 power 計算的任一變量
ana = TTestIndPower()
ana.solve_power(effect_size=0.5, alpha=0.05, power=0.8)        # 解 n
ana.solve_power(effect_size=0.5, nobs1=25, alpha=0.05)         # 解 power

# 2) TOST
p, lo, hi = ttost_ind(group1, group2, low=-0.4, upp=0.4, usevar="pooled")

# 3) 模擬驗證
def sim_power(n, d, M=5000):
    ps = [stats.ttest_ind(np.random.normal(d, 1, n),
                          np.random.normal(0, 1, n)).pvalue
          for _ in range(M)]
    return np.mean(np.array(ps) < 0.05)
sim_power(25, 0.5)

📝 自我檢測

1. 為什麼「p > 0.05」不能作為支持 H₀ 的證據？

1. Why is "p > 0.05" not evidence for H₀?

A. 因為 α = 0.05 是任意選的A. Because α = 0.05 is arbitrary

B. 因為 p-value 一定有偏B. Because p-values are always biased

C. Power 可能不足──「沒檢出」不等於「不存在」。要證明等效須用 TOST。C. Power may be insufficient — "not detected" ≠ "not present". To establish equivalence, use TOST.

D. 因為 p-value 是貝氏量D. Because p-values are Bayesian quantities

2. Power 隨下列何者增加？

2. Power increases with which of the following?

A. 只有 α 變大A. Only larger α

B. 只有 n 變大B. Only larger n

C. 只有效果量變大C. Only larger effect size

D. n、效果量、α 三者皆能 (且變異 σ 縮小亦可)D. n, effect size, α (and smaller σ) — all of them

3. 常規檢定「不顯著」與等價檢定「顯著」的差別？

3. Difference between a non-significant standard test and a significant equivalence test?

A. 完全一樣A. Exactly the same

B. 前者只代表「沒檢出差異」；後者主動證明「差異 < Δ」B. The former only says "no detected difference"; the latter actively shows "difference < Δ"

C. 前者貝氏、後者頻率C. One is Bayesian, the other frequentist

D. 後者不可在臨床試驗使用D. The latter cannot be used in clinical trials