STEP 11 / 13

樣本數與檢定力 (Power & Sample Size)

α、β、n、效應量——四個量綁在一起;固定其中三個,第四個就決定了。設計時算,絕不要 post-hoc 算 power

α, β, n, effect size — four quantities locked together; fix any three and the fourth is determined. Compute at design time; never compute post-hoc power.

α、β、n、效應量的四角關係

檢定力分析(power analysis)是統計設計的第一步,不是事後補的程序。它把四個量綁在一張紙上:顯著水準 α(第一型錯誤率)、第二型錯誤率 β(漏掉真效應的機率,power = 1 − β)、樣本數 n效應量 effect size(你想偵測的最小差異)。這四個量像四面牆——固定其中三面,第四面自動站直

研究者最常問「n 要多少?」其實是把 α=0.05、power=0.80、最小有意義效應量這三面牆豎好之後,把 n 解出來。如果 effect size 不講,那不是統計問題,是臨床/生物學問題——你必須先決定「多大的差別才值得偵測」。這就是 SESOI(Smallest Effect Size of Interest)的概念(Lakens 2018; Giner-Sorolla 2024)。

為什麼 post-hoc power(事後 power)是統計禁忌?Hoenig & Heisey (2001 The American Statistician) 證明:用觀察到的 effect size 反算 power,得到的「觀察 power」(observed power)和 p-value 是一對一的單調函數——p < 0.05 一定對應 observed power > 50%,p = 0.5 一定對應 power = 50%。它沒有提供 p-value 以外的任何資訊,卻常被審稿人要求補充。這份教學會教你怎麼禮貌而堅定地拒絕。

Power analysis is the first step of statistical design — not a salvage operation. It pins four quantities to a single sheet of paper: significance level α (type I error rate), type II error rate β (missing a real effect, where power = 1 − β), sample size n, and the effect size you want to detect. The four are walls of a box — fix three and the fourth stands up by itself.

Researchers usually ask "what n do I need?" — that is just solving for n after fixing α = 0.05, power = 0.80, and the minimum meaningful effect. If you cannot state the effect size, that is not a statistics question — it is a clinical or biological one: you have to decide what difference is worth detecting. This is the Smallest Effect Size of Interest (SESOI) idea (Lakens 2018; Giner-Sorolla 2024).

Why is post-hoc power taboo? Hoenig & Heisey (2001 The American Statistician) proved that "observed power" computed from the observed effect is a one-to-one monotone function of the p-value — p < 0.05 always corresponds to observed power > 50%, p = 0.5 always gives power = 50%. It carries zero information beyond the p-value, yet reviewers still request it. This chapter teaches you how to refuse politely and firmly.

💡
三句話帶走:(1) Power = 1 − β = 偵測真實效應的機率;(2) 四量(α, β, n, effect)固定 3 求 1;(3) Post-hoc power = p-value 的單射函數,不要算。設計階段做 a priori,事後做 sensitivity 或 SESOI,永遠不要做 post-hoc。 Take three things home: (1) Power = 1 − β = probability of detecting a true effect; (2) Fix any three of α, β, n, effect; (3) Post-hoc power is a one-to-one function of the p-value — do not compute it. Run a priori at design time, sensitivity afterwards, never post-hoc.

一、四個量的鎖鏈關係

把樣本的均值分布想成兩座山——H₀ 山位於 μ₀,H₁ 山位於 μ₀ + δ。我們在 H₀ 山的右側設立 critical value(α 的閾值);α 是 H₀ 山右側超過閾值的尾巴面積(誤殺率),β 是 H₁ 山左側落在閾值以下的尾巴面積(漏抓率)。Power = 1 − β = H₁ 山右側超過閾值的面積

當 n 增大 → 兩座山變瘦 → 重疊變少 → β 變小 → power 變大。當 effect size 增大 → 兩座山被推開 → 重疊變少 → power 變大。當 α 放寬 → 閾值左移 → β 變小但代價是更多誤殺。這就是四角的物理直覺

Picture the sampling distribution of the mean as two hills — the H₀ hill at μ₀ and the H₁ hill at μ₀ + δ. The critical value sits on the right of H₀ at the α threshold. α is the right tail of H₀ above the threshold (false alarms); β is the left tail of H₁ below the threshold (missed detections). Power = 1 − β = the right tail of H₁ above the threshold.

Increase n → both hills narrow → less overlap → β shrinks → power rises. Increase the effect → the hills move apart → less overlap → power rises. Loosen α → threshold moves left → β shrinks but at the cost of more false alarms. That is the physical intuition behind the four-way lock.

🎯

α — Type I error

誤殺 H₀ 的機率。生物醫學慣用 α = 0.05(單次檢定)。多重檢定要分配 α(見 Step 12 FDR)。GWAS 用 α = 5×10⁻⁸(Bonferroni at ~10⁶ SNPs;Pe'er 2008 Genet Epidemiol)。

Probability of a false alarm. Biomedical convention is α = 0.05 per test. Multiple testing splits α (see Step 12, FDR). GWAS uses α = 5×10⁻⁸ (Bonferroni over ~10⁶ SNPs; Pe'er 2008 Genet Epidemiol).

😴

β — Type II error

漏抓真效應的機率。Power = 1 − β 是「偵測力」。學界慣用 power = 0.80(β = 0.20,1/5 機率漏抓)。臨床三期 RCT 常要求 0.90。Cohen (1988) 提出 α : β = 1 : 4 的成本比例為慣例起源。

Probability of missing a real effect. Power = 1 − β. The standard target is 0.80 (β = 0.20, 1-in-5 miss rate); phase-III RCTs often demand 0.90. Cohen (1988) suggested a 1:4 cost ratio between α and β, the origin of the convention.

👥

n — sample size

每組 n(two-sample)或總 n(regression)。近似公式:兩樣本 t 檢定,每組 n ≈ 2·((z1−α/2 + z1−β)/d)²。當 d = 0.5、α = 0.05、power = 0.80 → 每組 n ≈ 64。Cohen 表 2.4.1 給出大量場景的 n 對照表。

n per group (two-sample) or total n (regression). Approx formula: two-sample t, n per group ≈ 2·((z1−α/2 + z1−β)/d)². For d = 0.5, α = 0.05, power = 0.80 → n ≈ 64 per group. Cohen's Table 2.4.1 lists n for many scenarios.

📏

Effect size

偵測對象的「真實差異」標準化後的尺度。連續資料用 Cohen's d(小 0.2 / 中 0.5 / 大 0.8;Cohen 1988)、ANOVA 用 η² 或 f、相關用 r、二分類用 Cohen's h 或 OR、存活用 HR絕不要用「pilot 觀察到的 effect」算正式 n——pilot 是 noisy 的,會嚴重低估真實 n(Kraemer 2006 Arch Gen Psychiatry)。

The standardised true difference you intend to detect. Continuous → Cohen's d (small 0.2 / medium 0.5 / large 0.8; Cohen 1988); ANOVA → η² or f; correlation → r; binary → Cohen's h or OR; survival → HR. Never plug a pilot's observed effect into the formal n calculation — pilots are noisy and systematically underestimate the required n (Kraemer 2006 Arch Gen Psychiatry).

兩樣本 t(同變異):每組 n ≈ 2·((z1−α/2 + z1−β) / d)²  ·  單樣本比例 z:n ≈ ((z1−α/2 + z1−β) / h)²  ·  存活(Schoenfeld 1981):所需事件數 d = ((z1−α/2 + z1−β)² · (1+k)²) / (k · (log HR)²),k = 配對比 ⌝ z 值用 normal 近似;小樣本應改用 t 分布的 non-central t(更精準),這正是 G*Power 與 R 的 pwr 套件做的事。 Two-sample t (equal variance): n per group ≈ 2·((z1−α/2 + z1−β) / d)²  ·  One-sample z (proportion): n ≈ ((z1−α/2 + z1−β) / h)²  ·  Survival (Schoenfeld 1981): required events d = ((z1−α/2 + z1−β)² · (1+k)²) / (k · (log HR)²), k = allocation ratio ⌝ These are normal approximations; small samples need non-central t — which is exactly what G*Power and R's pwr package use under the hood.

Power 計算器

選擇要求解的量(α、n、effect、power),剩下三個用滑桿給定。下方曲線顯示 power 隨 n 的成長(在當前 d 與 α 下)。注意 power 不是線性——當 n 從 10 → 30 進步神速,從 100 → 300 邊際遞減。這正是「n=50–100 是甜點」的數學原因。

Pick which quantity to solve for (α, n, effect, power); set the other three with sliders. The curve below traces power vs n at the current d and α. Notice power is not linear — n = 10 → 30 buys a lot; n = 100 → 300 has steeply diminishing returns. That is the math behind the rule of thumb "n = 50–100 is the sweet spot".

藍線 = power 隨 n 變化(當前 d, α)· 紅虛線 = 0.80 慣例 · 綠點 = 目前滑桿位置Blue = power vs n at current d, α · Red dashed = 0.80 convention · Green dot = current slider position

二、效應量家族與工具

效應量家族

連續資料Cohen's d = (μ₁−μ₂)/σ,配對 t 用 dzHedges' g = d 的小樣本校正(n < 20 推薦;Hedges 1981)。
ANOVA / 迴歸η²(解釋變異比例)、Cohen's f = √(η²/(1−η²))、(多元迴歸;小 0.02 / 中 0.15 / 大 0.35)。
相關r(Pearson;小 0.1 / 中 0.3 / 大 0.5)。
二分類Cohen's h = 2(arcsin√p₁ − arcsin√p₂);OR(小 1.5 / 中 2 / 大 3);RR 也可。
存活HR(Hazard Ratio)+ 預期事件數。

Continuous: Cohen's d = (μ₁−μ₂)/σ; paired uses dz; Hedges' g = small-n corrected d (recommended for n < 20; Hedges 1981).
ANOVA / regression: η², Cohen's f = √(η²/(1−η²)), for multiple regression (small 0.02 / medium 0.15 / large 0.35).
Correlation: r (Pearson; 0.1 / 0.3 / 0.5).
Binary: Cohen's h = 2(arcsin√p₁ − arcsin√p₂); OR (1.5 / 2 / 3); RR also fine.
Survival: HR + expected events.

工具地圖

GUIG*Power 3(Faul 2007 Behav Res Methods;Faul 2009 BRM 41:1149;引用 80,000+ 次;跨 macOS/Win/Linux 免費)。視覺化中央分布與非中央分布的重疊。

Rpwr(Champely)涵蓋 t / ANOVA / r / χ² / 比例 / GLM f²;longpower(Donohue)做 longitudinal LMM;gsDesign(Anderson)做臨床試驗群體設計;survival::nSurvTrialSize

Pythonstatsmodels.stats.power(TTestIndPower、FTestAnovaPower、NormalIndPower、GofChisquarePower);scipy.stats 無內建 power。

商用PASS(NCSS,涵蓋 1000+ 場景,FDA 認可)、nQueryStata power 指令

GUI: G*Power 3 (Faul 2007 Behav Res Methods; Faul 2009 BRM 41:1149; 80,000+ citations; free on macOS/Win/Linux). Visualises central + noncentral distributions.

R: pwr (Champely) covers t / ANOVA / r / χ² / proportions / GLM f²; longpower (Donohue) for longitudinal LMM; gsDesign (Anderson) for group-sequential trials; survival::nSurv, TrialSize.

Python: statsmodels.stats.power (TTestIndPower, FTestAnovaPower, NormalIndPower, GofChisquarePower); scipy.stats has no built-in power.

Commercial: PASS (NCSS, 1000+ scenarios, FDA-trusted), nQuery, Stata power.

⚠️
SESOI(Smallest Effect Size of Interest):Lakens (2018 Adv Methods Pract Psychol Sci) 與 Giner-Sorolla (2024) 推動:在資料收集,用臨床/科學意義而非統計顯著性決定「多大差別才值得偵測」。例:降血壓藥要超過安慰劑 SBP −3 mmHg 才算有臨床意義(NICE guideline)。SESOI 把效應量從「猜」變成「論證」。 SESOI (Smallest Effect Size of Interest): Lakens (2018 Adv Methods Pract Psychol Sci) and Giner-Sorolla (2024) push this: before data collection, decide on a meaningful difference based on clinical/scientific grounds, not statistical convenience. Example: a new antihypertensive must beat placebo by ≥ 3 mmHg SBP to be clinically relevant (NICE). SESOI turns "guess the effect" into "justify the effect".

特殊設計

🏘️

Cluster RCT

當隨機化單位是「群(學校、診所、村莊)」而非個人時,群內個體不獨立。設計效應 DE = 1 + (m − 1)·ICC,m = 平均群大小,ICC = 群內相關係數(intraclass correlation)。有效 n = ntotal / DE。例:m = 30、ICC = 0.05 → DE = 2.45 → 算出的個人 n 要 × 2.45。Hayes & Moulton 2017 "Cluster Randomised Trials" 2nd ed 是聖經。

When the unit of randomisation is a cluster (school, clinic, village), members within a cluster are not independent. Design effect DE = 1 + (m − 1)·ICC, m = average cluster size, ICC = intraclass correlation. Effective n = ntotal / DE. Example: m = 30, ICC = 0.05 → DE = 2.45 → multiply individual-level n by 2.45. Hayes & Moulton 2017 "Cluster Randomised Trials" 2nd ed is the bible.

⏱️

存活分析

存活的 power 取決於事件數,不是樣本數(Schoenfeld 1981 Biometrics)。先算所需事件 d = ((z1−α/2+z1−β)·(1+k))² / (k·(ln HR)²);再用預期事件率反推 n 與追蹤期。R: survival::nSurv()gsDesign::nSurv;Python: lifelines 沒有直接 power 函式,但可手算。典型錯誤:「我收了 200 人,每組 100」但事件只發生 30 次——power 由 30 決定,不是 200。

Survival power is driven by event count, not sample size (Schoenfeld 1981 Biometrics). First compute required events d = ((z1−α/2+z1−β)·(1+k))² / (k·(ln HR)²); then back out n and follow-up using the expected event rate. R: survival::nSurv(), gsDesign::nSurv. Python's lifelines has no built-in power function — compute by hand. Classic mistake: "I recruited 200 (100 per arm)" but only 30 events accrue — power is driven by 30, not 200.

⚖️

不等配對

2:1 allocation 比 1:1 損失 power——以 2:1 為例,效率只有 88%(n_total 要多 12%)。為何還用?實務上:(1) 倫理(讓更多人接受新療法);(2) 預算(控制組已知資料便宜);(3) 不平衡組變異(heteroscedastic)。算式:pwr.t2n.test(n1, n2, d, sig.level)

2:1 allocation loses power vs 1:1 — at 2:1 you keep only 88% efficiency (need 12% more total n). Why use it? (1) ethics (more patients on the new arm); (2) cost (existing controls are cheap); (3) unequal variance. Use pwr.t2n.test(n1, n2, d, sig.level).

🎯

多 endpoint α

RCT 預設 ≥ 2 個 primary endpoint:α 要分割(Bonferroni: α/k)或 hierarchical testing(Maurer 2011)。Power 算每個 endpoint 用分割後的 α。FDA / EMA 對 confirmatory trial 要求嚴格事先 SAP(statistical analysis plan)。

RCTs with ≥ 2 primary endpoints must split α (Bonferroni α/k) or use hierarchical / gatekeeping testing (Maurer 2011). Power each endpoint at the split α. FDA / EMA require a pre-specified SAP for confirmatory trials.

Post-hoc power 假象

拖動觀察 p-value 滑桿。下方顯示用「觀察到的 effect size」反算的 observed power。畫面上的曲線就是 Hoenig & Heisey (2001) 的關鍵發現:observed power 完全由 p-value 決定——p < 0.05 必然有 observed power > 50%,p = 0.5 必然有 observed power = 50%。它沒有提供任何新資訊。當審稿人要求「補 post-hoc power」,正確回應是「請改要 confidence interval 或 sensitivity analysis」。

Slide the observed p-value. The "observed power" computed from the observed effect is plotted. The curve below is exactly Hoenig & Heisey's (2001) key finding: observed power is completely determined by the p-value — p < 0.05 always implies observed power > 50%, p = 0.5 always gives 50%. It carries no new information. When a reviewer demands "post-hoc power", the right answer is: "let me give you a confidence interval or a sensitivity analysis instead".

紫線 = observed power 隨 p 變化 · 紅點 = p=0.05 對應 power≈50%(Hoenig 2001)Purple = observed power vs p · Red dot = p=0.05 ↔ power ≈ 50% (Hoenig 2001)

關鍵直覺:p-value 衡量「資料和 H₀ 不一致的程度」;observed power 衡量「假設真實 effect 等於觀察 effect,會有多大機率得到 p < α」。兩個算法都用同一個觀察 effect — 所以兩個量必然等價。Hoenig 把這個等價關係寫成單一公式:observed power = Φ(zobs − z1−α/2),zobs = Φ⁻¹(1 − p/2)。 Key intuition: the p-value measures "how inconsistent the data are with H₀"; observed power measures "if the true effect were the observed effect, what is the chance of p < α?". Both use the same observed effect — so they must be equivalent. Hoenig writes the relation as observed power = Φ(zobs − z1−α/2) where zobs = Φ⁻¹(1 − p/2).

三、四種 Power 分析的決策樹

🌳 Power 分析時機

Q1:
還沒收資料→ 是 → A priori(事前):固定 α、power、effect → 求 n。寫 grant、IRB、預先註冊(pre-registration)必備。
Q2:
n 已被限制(資金、時間、可獲樣本)?→ 是 → Sensitivity(敏感度):固定 α、power、n → 求最小可偵測效應 MDE。回答「我這個 n 能偵測到的最小 effect 是多少?」
Q3:
n 與 effect 都固定,但你想知道用什麼 α 才能保持 power→ 是 → Criterion:固定 n、power、effect → 求 α。少用,但 GWAS 與 ML 預測閾值會用。
Q4:
已收完資料,想算「我的 power 是多少」?→ 是 → 停!絕不要做 post-hoc。改報 95% CI(資料和 0 的距離 + 不確定度),或重新做 sensitivity(如果未來會 replicate)。Hoenig 2001 是這個禁忌的經典證明。
Q1:
You haven't collected data yet? → Yes → A priori: fix α, power, effect → solve n. Mandatory for grants, IRB, pre-registration.
Q2:
n is capped (budget, time, available samples)? → Yes → Sensitivity: fix α, power, n → solve minimum detectable effect (MDE). Answers "with this n, what is the smallest effect I can detect?"
Q3:
n and effect fixed, but you want to know what α keeps power high? → Yes → Criterion: fix n, power, effect → solve α. Rare; used for GWAS or ML-prediction thresholds.
Q4:
You've finished data collection and want to know "what was my power?" → Yes → Stop. Never run post-hoc. Report a 95% CI (distance from 0 + uncertainty) instead, or run a sensitivity for a future replication. Hoenig 2001 is the classic proof of this taboo.

四、工具比較

工具 平台 涵蓋面 優點 限制
G*Power 3 (Faul 2007)(Faul 2007)Win / macOS / Linux GUIt / F / χ² / z / 相關,多種 ANOVA / GLMt / F / χ² / z / correlation, many ANOVA / GLM免費、GUI 直觀、有圖示中央 / 非中央分布Free, intuitive GUI, plots central / noncentral無 survival、無 cluster、不能批次No survival, no cluster, no batch
R pwr (Champely)(Champely)R packaget、ANOVA (1-way)、r、χ²、比例、f²t, 1-way ANOVA, r, χ², proportions, f²script 可重現、可組合 simulationScriptable, reproducible, simulation-friendly無 survival(用 longpowergsDesignNo survival (use longpower, gsDesign)
statsmodels (Python)(Python)PythonTTestIndPower / FTestAnovaPower / NormalIndPower / GofChisquarePowerTTestIndPower / FTestAnovaPower / NormalIndPower / GofChisquarePower融入 ML pipeline,solve_power 可求任一量Fits ML pipeline, solve_power back-solves any quantity無 survival、文檔較少No survival, sparser docs
PASS (NCSS)(NCSS)Win GUI1000+ 場景:survival, cluster, equivalence, non-inferiority1000+ scenarios: survival, cluster, equivalence, non-inferiorityFDA 認可、報告品質高、附 PDF referencesFDA-trusted, publication-ready, cites references in PDF商用付費、GUI、不容易自動化Paid, GUI-only, hard to script
Stata powerStata完整:survival、cluster、equivalence、cross-overComprehensive: survival, cluster, equivalence, cross-over語法統一、文件詳盡、官方支援Unified syntax, thorough docs, official support付費;script 與 R/Python 整合差Paid; weaker integration with R/Python pipelines
💡
實務建議:學術論文:G*Power 做初算 + R/Python 重現以利 reproducibility。臨床三期 RCT:PASS 或 Stata(FDA 偏好)。Survival/cluster:gsDesignlongpowernSurv關鍵:把 power 計算的所有參數寫進 protocol / pre-registration / Methods,這是 ARRIVE 2.0 與 CONSORT 都要求的。 Practical recipe: Academic papers: G*Power for the first pass + R/Python to reproduce. Phase-III RCT: PASS or Stata (FDA-preferred). Survival/cluster: gsDesign, longpower, nSurv. Critical: write all power parameters into the protocol / pre-registration / Methods — required by ARRIVE 2.0 and CONSORT.

五、實作範例

# ============= R: pwr + survival/longpower =============
library(pwr)
library(longpower)
library(gsDesign)

# --- 1. Two-sample t-test: a priori (find n) ---
pwr.t.test(d = 0.5, power = 0.80, sig.level = 0.05,
            type = "two.sample", alternative = "two.sided")
#  n per group ≈ 64

# --- 2. Sensitivity (find min detectable d given n) ---
pwr.t.test(n = 30, power = 0.80, sig.level = 0.05,
            type = "two.sample")
#  d ≈ 0.74  (Note: NOT post-hoc; this is sensitivity)

# --- 3. ANOVA (1-way, k groups, Cohen's f) ---
pwr.anova.test(k = 4, f = 0.25, sig.level = 0.05, power = 0.80)

# --- 4. Correlation r ---
pwr.r.test(r = 0.3, power = 0.80, sig.level = 0.05)

# --- 5. Two-proportion (Cohen's h) ---
h <- ES.h(0.40, 0.30)
pwr.2p.test(h = h, power = 0.80, sig.level = 0.05)

# --- 6. Unequal allocation (2:1) ---
pwr.t2n.test(n1 = 60, n2 = 30, d = 0.5, sig.level = 0.05)

# --- 7. Survival: events required (Schoenfeld 1981) ---
# HR=0.7, two-sided α=0.05, power=0.80, equal allocation
gsDesign::nSurv(lambdaC = 0.05, hr = 0.7, eta = 0.02,
                 T = 36, minfup = 12, alpha = 0.025, beta = 0.20)

# --- 8. Longitudinal LMM ---
longpower::lmmpower(delta = 0.1, t = seq(0, 1.5, 0.25),
                    sig2.i = 55, sig2.s = 24, sig2.e = 10,
                    cor.s.i = 0.5, power = 0.80)

# --- 9. Cluster RCT: inflate by Design Effect ---
ICC <- 0.05; m <- 30
DE  <- 1 + (m - 1) * ICC                # = 2.45
n_indiv <- ceiling(pwr.t.test(d=0.3, power=0.80)$n)
n_cluster_total <- ceiling(n_indiv * DE)
cat("Individuals per arm:", n_cluster_total, "\n")
import numpy as np
from statsmodels.stats.power import (
    TTestIndPower, FTestAnovaPower, NormalIndPower, GofChisquarePower
)
from statsmodels.stats.proportion import proportion_effectsize
from scipy.stats import norm

# --- 1. Two-sample t: a priori (find n) ---
ap = TTestIndPower()
n = ap.solve_power(effect_size=0.5, alpha=0.05, power=0.80,
                    alternative='two-sided')
print(f"n per group ≈ {n:.1f}")   # ≈ 63.8

# --- 2. Sensitivity (find min detectable d at n=30) ---
d_min = ap.solve_power(nobs1=30, alpha=0.05, power=0.80)
print(f"Min detectable d ≈ {d_min:.3f}")

# --- 3. ANOVA (1-way) ---
fa = FTestAnovaPower()
n_total = fa.solve_power(effect_size=0.25, k_groups=4,
                          alpha=0.05, power=0.80)
print(f"Total n ≈ {n_total:.1f}")

# --- 4. Two-proportion (using Cohen's h) ---
h = proportion_effectsize(0.40, 0.30)
ni = NormalIndPower().solve_power(effect_size=h, alpha=0.05,
                                       power=0.80)
print(f"n per group ≈ {ni:.1f}")

# --- 5. Survival (Schoenfeld 1981, by hand) ---
def schoenfeld_events(HR, alpha=0.05, power=0.80, k=1):
    """k = allocation ratio (n_treat/n_ctrl). Returns required events."""
    za = norm.ppf(1 - alpha/2)
    zb = norm.ppf(power)
    return ((za + zb) ** 2) * (1 + k) ** 2 / (k * (np.log(HR)) ** 2)
events = schoenfeld_events(HR=0.7)
print(f"Required events: {events:.0f}")   # ≈ 247

# --- 6. Cluster RCT design effect ---
ICC, m = 0.05, 30
DE = 1 + (m - 1) * ICC
n_indiv = ap.solve_power(effect_size=0.3, alpha=0.05, power=0.80)
n_cluster = np.ceil(n_indiv * DE)
print(f"DE = {DE:.2f}, individuals per arm = {n_cluster:.0f}")

# --- 7. NEVER do this: post-hoc power ---
# observed_d = (mean1 - mean2) / pooled_sd  # NO!
# ap.power(effect_size=observed_d, nobs1=n, alpha=0.05)  # NO!
# Report a 95% CI instead — that IS the right "post-hoc" answer.

六、常見錯誤

Post-hoc power

用觀察 effect 反算 power——它和 p-value 是同一個東西的兩種寫法(Hoenig 2001)。審稿人要求時,回信:「Post-hoc power is a deterministic function of the p-value (Hoenig & Heisey 2001 Am Stat 55:19, DOI:10.1198/000313001300339897). We instead report the 95% CI, which directly conveys the precision of our estimate.」

Computing power from the observed effect — it's a deterministic function of the p-value (Hoenig 2001). When reviewers ask, reply: "Post-hoc power is a deterministic function of the p-value (Hoenig & Heisey 2001 Am Stat 55:19, DOI:10.1198/000313001300339897). We instead report the 95% CI, which directly conveys the precision of our estimate."

Pilot effect 算 n

Pilot 樣本太小(典型 n ≤ 30)→ 觀察 effect 變異大且偏向過大("winner's curse")→ 算出來的 n 嚴重低估。Kraemer 2006 Arch Gen Psychiatry:使用 SESOI、文獻 meta-analysis、或 pilot d − 1 SE 作為 effect input。

Pilot n is small (typically ≤ 30) → observed effect is noisy and upwardly biased ("winner's curse") → formal n is severely underestimated. Kraemer 2006 Arch Gen Psychiatry: use SESOI, a meta-analytic effect, or pilot d − 1 SE as input.

忽略多重檢定

RCT 預設 5 個 primary endpoint 全部用 α = 0.05 → family-wise error 大約 0.23。算 n 時必須用調整後 α(Bonferroni α/k 或 hierarchical α-spending;見 Step 12)。GWAS、proteomics 也適用。

An RCT with 5 primary endpoints all at α = 0.05 has family-wise error ≈ 0.23. n must be powered at the adjusted α (Bonferroni α/k or hierarchical α-spending; see Step 12). Same for GWAS and proteomics.

忽略 ICC

群集 RCT 用個人 n 算 power → 嚴重高估。記得乘以 DE = 1 + (m − 1)·ICC。Hayes & Moulton 2017 警告:典型醫療 ICC = 0.01–0.10;忽略 ICC 是 cluster trial 失敗的頭號原因。

Computing cluster-RCT power on individual n → severe overestimate. Multiply by DE = 1 + (m − 1)·ICC. Hayes & Moulton 2017: typical health ICC = 0.01–0.10; ignoring ICC is the top cause of failed cluster trials.

Survival 用 n 不用事件

log-rank 的 power 取決於事件數,不是樣本數。「收 500 人」無關緊要,重點是「事件 ≥ 多少」。Schoenfeld 1981 公式直接算事件 d;再用預期事件率反推 n 與追蹤期。R: gsDesign::nSurv

log-rank power depends on events, not n. "500 enrolled" is irrelevant — only "events accrued" matters. Schoenfeld's 1981 formula gives events d directly; back out n and follow-up from the expected event rate. R: gsDesign::nSurv.

效應量太樂觀

研究者最常為了得到「合理」的 n,把 d 從 0.3 改成 0.8——這叫 effect inflation。論文 Methods 必須寫明 effect size 的依據(meta-analysis、SESOI、臨床指引)。Albers & Lakens 2018 PLOS ONE:超過 70% 的 power analyses 用無根據的 effect。

The most common shortcut: inflate d from 0.3 to 0.8 just to make n "feasible" — effect inflation. Methods sections must justify the effect (meta-analysis, SESOI, clinical guideline). Albers & Lakens 2018 PLOS ONE: > 70% of published power analyses use unjustified effects.

紅旗:論文寫 "post-hoc power = X" 這句話幾乎一定是反模式。正確替代:(1) 95% CI 告訴你估計精度;(2) sensitivity analysis 告訴你「我的 n 可偵測到多大的 effect」;(3) 重做 a priori 用 SESOI(如果是準備下一個研究)。Levine & Ensom 2001 Pharmacotherapy、Lenth 2007 Am Stat 都加強這個禁忌。 That sentence is almost always an anti-pattern. Correct alternatives: (1) a 95% CI tells you the precision; (2) sensitivity analysis tells you "what effect could I detect with this n?"; (3) re-run a priori with SESOI for the next study. Levine & Ensom 2001 Pharmacotherapy and Lenth 2007 Am Stat both reinforce the taboo.

七、實戰場景

💊

情境 A:降血壓藥 RCT

目標:偵測新藥 vs 安慰劑 SBP 差 ≥ 5 mmHg。已知 σ ≈ 12 mmHg → d = 5/12 = 0.42。α = 0.025(雙側 0.05 + 雙臂 0.025)、power = 0.90。

Rpwr.t.test(d=0.42, power=0.9, sig.level=0.025) → n ≈ 152/arm。加 15% 失訪 → 175/arm。Pre-specified in protocol, SAP fixed.

Goal: detect drug vs placebo SBP gap ≥ 5 mmHg. Known σ ≈ 12 → d = 5/12 = 0.42. α = 0.025, power = 0.90.

R: pwr.t.test(d=0.42, power=0.9, sig.level=0.025) → n ≈ 152/arm. Add 15% dropout → 175/arm. Pre-specified in protocol, SAP locked.

🧪

情境 B:劑量反應實驗

目標:偵測 5 個劑量間 IC50 差異(One-way ANOVA)。文獻 Cohen's f = 0.40(大效應)、α = 0.05、power = 0.80。

Rpwr.anova.test(k=5, f=0.40, power=0.8) → n ≈ 16/group → 總 80。技術重複 ≥ 3 wells/group,明確分層分析(Step 13 mixed model)。

Goal: detect IC50 differences across 5 doses (1-way ANOVA). Literature Cohen's f = 0.40 (large), α = 0.05, power = 0.80.

R: pwr.anova.test(k=5, f=0.40, power=0.8) → n ≈ 16/group → total 80. Technical reps ≥ 3 wells/group; analyse hierarchically (Step 13 mixed model).

🧬

情境 C:GWAS

目標:偵測 OR = 1.20 的 SNP(MAF = 0.30)。αGWAS = 5×10⁻⁸(Pe'er 2008)、power = 0.80。

Genetic Power CalculatorQuanto:≈ 12,000 cases + 12,000 controls。為何要這麼大?因為 α 嚴格到 5×10⁻⁸ 對應 z1−α/2 ≈ 5.45(vs α=0.05 的 1.96)——光是 (5.45/1.96)² ≈ 7.7 倍的 n。

Goal: detect SNP at OR = 1.20 (MAF = 0.30). αGWAS = 5×10⁻⁸ (Pe'er 2008), power = 0.80.

Genetic Power Calculator or Quanto: ≈ 12,000 cases + 12,000 controls. Why so big? α = 5×10⁻⁸ ↔ z1−α/2 ≈ 5.45 (vs 1.96 at α = 0.05) — that alone is (5.45/1.96)² ≈ 7.7× the n.

📝 自我檢測

1. 審稿人寫信:「請補上 post-hoc power 分析。」最佳回應是?

1. A reviewer asks: "Please add a post-hoc power analysis." Best response?

A. 立刻算 observed power 並補進論文A. Compute observed power immediately and add it
B. 把 power 算到 99% 給審稿人看B. Compute power at 99% to look strong
C. 引用 Hoenig & Heisey 2001,改報 95% CI 與 sensitivity 分析C. Cite Hoenig & Heisey 2001 and report 95% CI + sensitivity analysis instead
D. 拒絕修改D. Refuse without explanation

2. 你規劃一個 cluster RCT(學校為單位),平均每校 m = 25 名學生,預期 ICC = 0.04。個人層級算出每組需 100 人——實際每組要多少?

2. You plan a cluster RCT (schools as clusters), m = 25 students/school, ICC = 0.04. Individual-level power says 100 per arm — what do you actually need per arm?

A. 100(ICC 可以忽略)A. 100 (ICC negligible)
B. 125B. 125
C. 196(DE = 1 + (25−1)×0.04 = 1.96 → 100 × 1.96)C. 196 (DE = 1 + (25−1)×0.04 = 1.96 → 100 × 1.96)
D. 50(cluster 提升 power)D. 50 (clustering boosts power)

3. 你被預算限制 n = 40/組。最合適的 power 分析類型是?

3. Your budget caps n at 40/group. Which power-analysis type fits?

A. A priori:固定 effect 求 nA. A priori: fix effect, solve n
B. Post-hoc:等收完資料再算 powerB. Post-hoc: compute power after data collection
C. Criterion:求 αC. Criterion: solve α
D. Sensitivity:固定 n、α、power → 求最小可偵測 effect (MDE)D. Sensitivity: fix n, α, power → solve minimum detectable effect (MDE)

4. 存活分析(log-rank)power 主要取決於下列哪一項?

4. Survival (log-rank) power mainly depends on which of the following?

A. 收案人數(總 n)A. Enrolled n
B. 事件數(events / deaths)——Schoenfeld 1981B. Event count (events / deaths) — Schoenfeld 1981
C. 追蹤天數C. Follow-up days
D. SD of survival timeD. SD of survival time