Step 11: Power & Sample Size — Biostatistics Tutorial

總覽

α、β、n、效應量的四角關係

檢定力分析（power analysis）是統計設計的第一步，不是事後補的程序。它把四個量綁在一張紙上：顯著水準 α（第一型錯誤率）、第二型錯誤率 β（漏掉真效應的機率，power = 1 − β）、樣本數 n、效應量 effect size（你想偵測的最小差異）。這四個量像四面牆——固定其中三面，第四面自動站直。

研究者最常問「n 要多少？」其實是把 α=0.05、power=0.80、最小有意義效應量這三面牆豎好之後，把 n 解出來。如果 effect size 不講，那不是統計問題，是臨床／生物學問題——你必須先決定「多大的差別才值得偵測」。這就是 SESOI（Smallest Effect Size of Interest）的概念（Lakens 2018; Giner-Sorolla 2024）。

為什麼 post-hoc power（事後 power）是統計禁忌？Hoenig & Heisey (2001 The American Statistician) 證明：用觀察到的 effect size 反算 power，得到的「觀察 power」（observed power）和 p-value 是一對一的單調函數——p < 0.05 一定對應 observed power > 50%，p = 0.5 一定對應 power = 50%。它沒有提供 p-value 以外的任何資訊，卻常被審稿人要求補充。這份教學會教你怎麼禮貌而堅定地拒絕。

Power analysis is the first step of statistical design — not a salvage operation. It pins four quantities to a single sheet of paper: significance level α (type I error rate), type II error rate β (missing a real effect, where power = 1 − β), sample size n, and the effect size you want to detect. The four are walls of a box — fix three and the fourth stands up by itself.

Researchers usually ask "what n do I need?" — that is just solving for n after fixing α = 0.05, power = 0.80, and the minimum meaningful effect. If you cannot state the effect size, that is not a statistics question — it is a clinical or biological one: you have to decide what difference is worth detecting. This is the Smallest Effect Size of Interest (SESOI) idea (Lakens 2018; Giner-Sorolla 2024).

Why is post-hoc power taboo? Hoenig & Heisey (2001 The American Statistician) proved that "observed power" computed from the observed effect is a one-to-one monotone function of the p-value — p < 0.05 always corresponds to observed power > 50%, p = 0.5 always gives power = 50%. It carries zero information beyond the p-value, yet reviewers still request it. This chapter teaches you how to refuse politely and firmly.

💡

三句話帶走：(1) Power = 1 − β = 偵測真實效應的機率；(2) 四量（α, β, n, effect）固定 3 求 1；(3) Post-hoc power = p-value 的單射函數，不要算。設計階段做 a priori，事後做 sensitivity 或 SESOI，永遠不要做 post-hoc。 Take three things home: (1) Power = 1 − β = probability of detecting a true effect; (2) Fix any three of α, β, n, effect; (3) Post-hoc power is a one-to-one function of the p-value — do not compute it. Run a priori at design time, sensitivity afterwards, never post-hoc.

核心概念

一、四個量的鎖鏈關係

把樣本的均值分布想成兩座山——H₀ 山位於 μ₀，H₁ 山位於 μ₀ + δ。我們在 H₀ 山的右側設立 critical value（α 的閾值）；α 是 H₀ 山右側超過閾值的尾巴面積（誤殺率），β 是 H₁ 山左側落在閾值以下的尾巴面積（漏抓率）。Power = 1 − β = H₁ 山右側超過閾值的面積。

當 n 增大 → 兩座山變瘦 → 重疊變少 → β 變小 → power 變大。當 effect size 增大 → 兩座山被推開 → 重疊變少 → power 變大。當 α 放寬 → 閾值左移 → β 變小但代價是更多誤殺。這就是四角的物理直覺。

Picture the sampling distribution of the mean as two hills — the H₀ hill at μ₀ and the H₁ hill at μ₀ + δ. The critical value sits on the right of H₀ at the α threshold. α is the right tail of H₀ above the threshold (false alarms); β is the left tail of H₁ below the threshold (missed detections). Power = 1 − β = the right tail of H₁ above the threshold.

Increase n → both hills narrow → less overlap → β shrinks → power rises. Increase the effect → the hills move apart → less overlap → power rises. Loosen α → threshold moves left → β shrinks but at the cost of more false alarms. That is the physical intuition behind the four-way lock.

🎯

α — Type I error

誤殺 H₀ 的機率。生物醫學慣用 α = 0.05（單次檢定）。多重檢定要分配 α（見 Step 12 FDR）。GWAS 用 α = 5×10⁻⁸（Bonferroni at ~10⁶ SNPs；Pe'er 2008 Genet Epidemiol）。

Probability of a false alarm. Biomedical convention is α = 0.05 per test. Multiple testing splits α (see Step 12, FDR). GWAS uses α = 5×10⁻⁸ (Bonferroni over ~10⁶ SNPs; Pe'er 2008 Genet Epidemiol).

😴

β — Type II error

漏抓真效應的機率。Power = 1 − β 是「偵測力」。學界慣用 power = 0.80（β = 0.20，1/5 機率漏抓）。臨床三期 RCT 常要求 0.90。Cohen (1988) 提出 α : β = 1 : 4 的成本比例為慣例起源。

Probability of missing a real effect. Power = 1 − β. The standard target is 0.80 (β = 0.20, 1-in-5 miss rate); phase-III RCTs often demand 0.90. Cohen (1988) suggested a 1:4 cost ratio between α and β, the origin of the convention.

👥

n — sample size

每組 n（two-sample）或總 n（regression）。近似公式：兩樣本 t 檢定，每組 n ≈ 2·((z_1−α/2 + z_1−β)/d)²。當 d = 0.5、α = 0.05、power = 0.80 → 每組 n ≈ 64。Cohen 表 2.4.1 給出大量場景的 n 對照表。

n per group (two-sample) or total n (regression). Approx formula: two-sample t, n per group ≈ 2·((z_1−α/2 + z_1−β)/d)². For d = 0.5, α = 0.05, power = 0.80 → n ≈ 64 per group. Cohen's Table 2.4.1 lists n for many scenarios.

📏

Effect size

偵測對象的「真實差異」標準化後的尺度。連續資料用 Cohen's d（小 0.2 / 中 0.5 / 大 0.8；Cohen 1988）、ANOVA 用 η² 或 f、相關用 r、二分類用 Cohen's h 或 OR、存活用 HR。絕不要用「pilot 觀察到的 effect」算正式 n——pilot 是 noisy 的，會嚴重低估真實 n（Kraemer 2006 Arch Gen Psychiatry）。

The standardised true difference you intend to detect. Continuous → Cohen's d (small 0.2 / medium 0.5 / large 0.8; Cohen 1988); ANOVA → η² or f; correlation → r; binary → Cohen's h or OR; survival → HR. Never plug a pilot's observed effect into the formal n calculation — pilots are noisy and systematically underestimate the required n (Kraemer 2006 Arch Gen Psychiatry).

⌜ 兩樣本 t（同變異）：每組 n ≈ 2·((z_1−α/2 + z_1−β) / d)² · 單樣本比例 z：n ≈ ((z_1−α/2 + z_1−β) / h)² · 存活（Schoenfeld 1981）：所需事件數 d = ((z_1−α/2 + z_1−β)² · (1+k)²) / (k · (log HR)²)，k = 配對比 ⌝ z 值用 normal 近似；小樣本應改用 t 分布的 non-central t（更精準），這正是 G*Power 與 R 的 pwr 套件做的事。 ⌜ Two-sample t (equal variance): n per group ≈ 2·((z_1−α/2 + z_1−β) / d)² · One-sample z (proportion): n ≈ ((z_1−α/2 + z_1−β) / h)² · Survival (Schoenfeld 1981): required events d = ((z_1−α/2 + z_1−β)² · (1+k)²) / (k · (log HR)²), k = allocation ratio ⌝ These are normal approximations; small samples need non-central t — which is exactly what G*Power and R's pwr package use under the hood.

互動模擬 ①

Power 計算器

選擇要求解的量（α、n、effect、power），剩下三個用滑桿給定。下方曲線顯示 power 隨 n 的成長（在當前 d 與 α 下）。注意 power 不是線性——當 n 從 10 → 30 進步神速，從 100 → 300 邊際遞減。這正是「n=50–100 是甜點」的數學原因。

Pick which quantity to solve for (α, n, effect, power); set the other three with sliders. The curve below traces power vs n at the current d and α. Notice power is not linear — n = 10 → 30 buys a lot; n = 100 → 300 has steeply diminishing returns. That is the math behind the rule of thumb "n = 50–100 is the sweet spot".

求解

效應量 d 0.50

α 0.050

n / group 64

Power 0.80

藍線 = power 隨 n 變化（當前 d, α）· 紅虛線 = 0.80 慣例 · 綠點 = 目前滑桿位置Blue = power vs n at current d, α · Red dashed = 0.80 convention · Green dot = current slider position

深入討論

二、效應量家族與工具

效應量家族

連續資料：Cohen's d = (μ₁−μ₂)/σ，配對 t 用 d_z；Hedges' g = d 的小樣本校正（n < 20 推薦；Hedges 1981）。
ANOVA / 迴歸：η²（解釋變異比例）、Cohen's f = √(η²/(1−η²))、f²（多元迴歸；小 0.02 / 中 0.15 / 大 0.35）。
相關：r（Pearson；小 0.1 / 中 0.3 / 大 0.5）。
二分類：Cohen's h = 2(arcsin√p₁ − arcsin√p₂)；OR（小 1.5 / 中 2 / 大 3）；RR 也可。
存活：HR（Hazard Ratio）+ 預期事件數。

Continuous: Cohen's d = (μ₁−μ₂)/σ; paired uses d_z; Hedges' g = small-n corrected d (recommended for n < 20; Hedges 1981).
ANOVA / regression: η², Cohen's f = √(η²/(1−η²)), f² for multiple regression (small 0.02 / medium 0.15 / large 0.35).
Correlation: r (Pearson; 0.1 / 0.3 / 0.5).
Binary: Cohen's h = 2(arcsin√p₁ − arcsin√p₂); OR (1.5 / 2 / 3); RR also fine.
Survival: HR + expected events.

工具地圖

GUI：G*Power 3（Faul 2007 Behav Res Methods；Faul 2009 BRM 41:1149；引用 80,000+ 次；跨 macOS/Win/Linux 免費）。視覺化中央分布與非中央分布的重疊。

R：pwr（Champely）涵蓋 t / ANOVA / r / χ² / 比例 / GLM f²；longpower（Donohue）做 longitudinal LMM；gsDesign（Anderson）做臨床試驗群體設計；survival::nSurv、TrialSize。

Python：statsmodels.stats.power（TTestIndPower、FTestAnovaPower、NormalIndPower、GofChisquarePower）；scipy.stats 無內建 power。

商用：PASS（NCSS，涵蓋 1000+ 場景，FDA 認可）、nQuery、Stata power 指令。

GUI: G*Power 3 (Faul 2007 Behav Res Methods; Faul 2009 BRM 41:1149; 80,000+ citations; free on macOS/Win/Linux). Visualises central + noncentral distributions.

R: pwr (Champely) covers t / ANOVA / r / χ² / proportions / GLM f²; longpower (Donohue) for longitudinal LMM; gsDesign (Anderson) for group-sequential trials; survival::nSurv, TrialSize.

Python: statsmodels.stats.power (TTestIndPower, FTestAnovaPower, NormalIndPower, GofChisquarePower); scipy.stats has no built-in power.

Commercial: PASS (NCSS, 1000+ scenarios, FDA-trusted), nQuery, Stata power.

⚠️

SESOI（Smallest Effect Size of Interest）：Lakens (2018 Adv Methods Pract Psychol Sci) 與 Giner-Sorolla (2024) 推動：在資料收集前，用臨床／科學意義而非統計顯著性決定「多大差別才值得偵測」。例：降血壓藥要超過安慰劑 SBP −3 mmHg 才算有臨床意義（NICE guideline）。SESOI 把效應量從「猜」變成「論證」。 SESOI (Smallest Effect Size of Interest): Lakens (2018 Adv Methods Pract Psychol Sci) and Giner-Sorolla (2024) push this: before data collection, decide on a meaningful difference based on clinical/scientific grounds, not statistical convenience. Example: a new antihypertensive must beat placebo by ≥ 3 mmHg SBP to be clinically relevant (NICE). SESOI turns "guess the effect" into "justify the effect".

特殊設計

🏘️

Cluster RCT

當隨機化單位是「群（學校、診所、村莊）」而非個人時，群內個體不獨立。設計效應 DE = 1 + (m − 1)·ICC，m = 平均群大小，ICC = 群內相關係數（intraclass correlation）。有效 n = n_total / DE。例：m = 30、ICC = 0.05 → DE = 2.45 → 算出的個人 n 要 × 2.45。Hayes & Moulton 2017 "Cluster Randomised Trials" 2nd ed 是聖經。

When the unit of randomisation is a cluster (school, clinic, village), members within a cluster are not independent. Design effect DE = 1 + (m − 1)·ICC, m = average cluster size, ICC = intraclass correlation. Effective n = n_total / DE. Example: m = 30, ICC = 0.05 → DE = 2.45 → multiply individual-level n by 2.45. Hayes & Moulton 2017 "Cluster Randomised Trials" 2nd ed is the bible.

⏱️

存活分析

存活的 power 取決於事件數，不是樣本數（Schoenfeld 1981 Biometrics）。先算所需事件 d = ((z_1−α/2+z_1−β)·(1+k))² / (k·(ln HR)²)；再用預期事件率反推 n 與追蹤期。R: survival::nSurv()、gsDesign::nSurv；Python: lifelines 沒有直接 power 函式，但可手算。典型錯誤：「我收了 200 人，每組 100」但事件只發生 30 次——power 由 30 決定，不是 200。

Survival power is driven by event count, not sample size (Schoenfeld 1981 Biometrics). First compute required events d = ((z_1−α/2+z_1−β)·(1+k))² / (k·(ln HR)²); then back out n and follow-up using the expected event rate. R: survival::nSurv(), gsDesign::nSurv. Python's lifelines has no built-in power function — compute by hand. Classic mistake: "I recruited 200 (100 per arm)" but only 30 events accrue — power is driven by 30, not 200.

⚖️

不等配對

2:1 allocation 比 1:1 損失 power——以 2:1 為例，效率只有 88%（n_total 要多 12%）。為何還用？實務上：(1) 倫理（讓更多人接受新療法）；(2) 預算（控制組已知資料便宜）；(3) 不平衡組變異（heteroscedastic）。算式：pwr.t2n.test(n1, n2, d, sig.level)。

2:1 allocation loses power vs 1:1 — at 2:1 you keep only 88% efficiency (need 12% more total n). Why use it? (1) ethics (more patients on the new arm); (2) cost (existing controls are cheap); (3) unequal variance. Use pwr.t2n.test(n1, n2, d, sig.level).

🎯

多 endpoint α

RCT 預設 ≥ 2 個 primary endpoint：α 要分割（Bonferroni: α/k）或 hierarchical testing（Maurer 2011）。Power 算每個 endpoint 用分割後的 α。FDA / EMA 對 confirmatory trial 要求嚴格事先 SAP（statistical analysis plan）。

RCTs with ≥ 2 primary endpoints must split α (Bonferroni α/k) or use hierarchical / gatekeeping testing (Maurer 2011). Power each endpoint at the split α. FDA / EMA require a pre-specified SAP for confirmatory trials.

互動模擬 ②

Post-hoc power 假象

拖動觀察 p-value 滑桿。下方顯示用「觀察到的 effect size」反算的 observed power。畫面上的曲線就是 Hoenig & Heisey (2001) 的關鍵發現：observed power 完全由 p-value 決定——p < 0.05 必然有 observed power > 50%，p = 0.5 必然有 observed power = 50%。它沒有提供任何新資訊。當審稿人要求「補 post-hoc power」，正確回應是「請改要 confidence interval 或 sensitivity analysis」。

Slide the observed p-value. The "observed power" computed from the observed effect is plotted. The curve below is exactly Hoenig & Heisey's (2001) key finding: observed power is completely determined by the p-value — p < 0.05 always implies observed power > 50%, p = 0.5 always gives 50%. It carries no new information. When a reviewer demands "post-hoc power", the right answer is: "let me give you a confidence interval or a sensitivity analysis instead".

觀察 p 0.050

n / group 30

紫線 = observed power 隨 p 變化 · 紅點 = p=0.05 對應 power≈50%（Hoenig 2001）Purple = observed power vs p · Red dot = p=0.05 ↔ power ≈ 50% (Hoenig 2001)

關鍵直覺：p-value 衡量「資料和 H₀ 不一致的程度」；observed power 衡量「假設真實 effect 等於觀察 effect，會有多大機率得到 p < α」。兩個算法都用同一個觀察 effect — 所以兩個量必然等價。Hoenig 把這個等價關係寫成單一公式：observed power = Φ(z_obs − z_1−α/2)，z_obs = Φ⁻¹(1 − p/2)。 Key intuition: the p-value measures "how inconsistent the data are with H₀"; observed power measures "if the true effect were the observed effect, what is the chance of p < α?". Both use the same observed effect — so they must be equivalent. Hoenig writes the relation as observed power = Φ(z_obs − z_1−α/2) where z_obs = Φ⁻¹(1 − p/2).

決策引導

三、四種 Power 分析的決策樹

🌳 Power 分析時機

Q1:

你還沒收資料？→ 是 → A priori（事前）：固定 α、power、effect → 求 n。寫 grant、IRB、預先註冊（pre-registration）必備。

Q2:

n 已被限制（資金、時間、可獲樣本）？→ 是 → Sensitivity（敏感度）：固定 α、power、n → 求最小可偵測效應 MDE。回答「我這個 n 能偵測到的最小 effect 是多少？」

Q3:

n 與 effect 都固定，但你想知道用什麼 α 才能保持 power？→ 是 → Criterion：固定 n、power、effect → 求 α。少用，但 GWAS 與 ML 預測閾值會用。

Q4:

你已收完資料，想算「我的 power 是多少」？→ 是 → 停！絕不要做 post-hoc。改報 95% CI（資料和 0 的距離 + 不確定度），或重新做 sensitivity（如果未來會 replicate）。Hoenig 2001 是這個禁忌的經典證明。

Q1:

You haven't collected data yet? → Yes → A priori: fix α, power, effect → solve n. Mandatory for grants, IRB, pre-registration.

Q2:

n is capped (budget, time, available samples)? → Yes → Sensitivity: fix α, power, n → solve minimum detectable effect (MDE). Answers "with this n, what is the smallest effect I can detect?"

Q3:

n and effect fixed, but you want to know what α keeps power high? → Yes → Criterion: fix n, power, effect → solve α. Rare; used for GWAS or ML-prediction thresholds.

Q4:

You've finished data collection and want to know "what was my power?" → Yes → Stop. Never run post-hoc. Report a 95% CI (distance from 0 + uncertainty) instead, or run a sensitivity for a future replication. Hoenig 2001 is the classic proof of this taboo.

工具比較

四、工具比較

工具	平台	涵蓋面	優點	限制
*GPower 3** (Faul 2007)(Faul 2007)	Win / macOS / Linux GUI	t / F / χ² / z / 相關，多種 ANOVA / GLM	t / F / χ² / z / correlation, many ANOVA / GLM	免費、GUI 直觀、有圖示中央 / 非中央分布	Free, intuitive GUI, plots central / noncentral	無 survival、無 cluster、不能批次	No survival, no cluster, no batch
R pwr (Champely)(Champely)	R package	t、ANOVA (1-way)、r、χ²、比例、f²	t, 1-way ANOVA, r, χ², proportions, f²	script 可重現、可組合 simulation	Scriptable, reproducible, simulation-friendly	無 survival（用 `longpower`、`gsDesign`）	No survival (use `longpower`, `gsDesign`)
statsmodels (Python)(Python)	Python	TTestIndPower / FTestAnovaPower / NormalIndPower / GofChisquarePower	TTestIndPower / FTestAnovaPower / NormalIndPower / GofChisquarePower	融入 ML pipeline，`solve_power` 可求任一量	Fits ML pipeline, `solve_power` back-solves any quantity	無 survival、文檔較少	No survival, sparser docs
PASS (NCSS)(NCSS)	Win GUI	1000+ 場景：survival, cluster, equivalence, non-inferiority	1000+ scenarios: survival, cluster, equivalence, non-inferiority	FDA 認可、報告品質高、附 PDF references	FDA-trusted, publication-ready, cites references in PDF	商用付費、GUI、不容易自動化	Paid, GUI-only, hard to script
Stata power	Stata	完整：survival、cluster、equivalence、cross-over	Comprehensive: survival, cluster, equivalence, cross-over	語法統一、文件詳盡、官方支援	Unified syntax, thorough docs, official support	付費；script 與 R/Python 整合差	Paid; weaker integration with R/Python pipelines

💡

實務建議：學術論文：G*Power 做初算 + R/Python 重現以利 reproducibility。臨床三期 RCT：PASS 或 Stata（FDA 偏好）。Survival/cluster：gsDesign、longpower、nSurv。關鍵：把 power 計算的所有參數寫進 protocol / pre-registration / Methods，這是 ARRIVE 2.0 與 CONSORT 都要求的。 Practical recipe: Academic papers: G*Power for the first pass + R/Python to reproduce. Phase-III RCT: PASS or Stata (FDA-preferred). Survival/cluster: gsDesign, longpower, nSurv. Critical: write all power parameters into the protocol / pre-registration / Methods — required by ARRIVE 2.0 and CONSORT.

程式碼

五、實作範例

# ============= R: pwr + survival/longpower =============
library(pwr)
library(longpower)
library(gsDesign)

# --- 1. Two-sample t-test: a priori (find n) ---
pwr.t.test(d = 0.5, power = 0.80, sig.level = 0.05,
            type = "two.sample", alternative = "two.sided")
#  n per group ≈ 64

# --- 2. Sensitivity (find min detectable d given n) ---
pwr.t.test(n = 30, power = 0.80, sig.level = 0.05,
            type = "two.sample")
#  d ≈ 0.74  (Note: NOT post-hoc; this is sensitivity)

# --- 3. ANOVA (1-way, k groups, Cohen's f) ---
pwr.anova.test(k = 4, f = 0.25, sig.level = 0.05, power = 0.80)

# --- 4. Correlation r ---
pwr.r.test(r = 0.3, power = 0.80, sig.level = 0.05)

# --- 5. Two-proportion (Cohen's h) ---
h <- ES.h(0.40, 0.30)
pwr.2p.test(h = h, power = 0.80, sig.level = 0.05)

# --- 6. Unequal allocation (2:1) ---
pwr.t2n.test(n1 = 60, n2 = 30, d = 0.5, sig.level = 0.05)

# --- 7. Survival: events required (Schoenfeld 1981) ---
# HR=0.7, two-sided α=0.05, power=0.80, equal allocation
gsDesign::nSurv(lambdaC = 0.05, hr = 0.7, eta = 0.02,
                 T = 36, minfup = 12, alpha = 0.025, beta = 0.20)

# --- 8. Longitudinal LMM ---
longpower::lmmpower(delta = 0.1, t = seq(0, 1.5, 0.25),
                    sig2.i = 55, sig2.s = 24, sig2.e = 10,
                    cor.s.i = 0.5, power = 0.80)

# --- 9. Cluster RCT: inflate by Design Effect ---
ICC <- 0.05; m <- 30
DE  <- 1 + (m - 1) * ICC                # = 2.45
n_indiv <- ceiling(pwr.t.test(d=0.3, power=0.80)$n)
n_cluster_total <- ceiling(n_indiv * DE)
cat("Individuals per arm:", n_cluster_total, "\n")

import numpy as np
from statsmodels.stats.power import (
    TTestIndPower, FTestAnovaPower, NormalIndPower, GofChisquarePower
)
from statsmodels.stats.proportion import proportion_effectsize
from scipy.stats import norm

# --- 1. Two-sample t: a priori (find n) ---
ap = TTestIndPower()
n = ap.solve_power(effect_size=0.5, alpha=0.05, power=0.80,
                    alternative='two-sided')
print(f"n per group ≈ {n:.1f}")   # ≈ 63.8

# --- 2. Sensitivity (find min detectable d at n=30) ---
d_min = ap.solve_power(nobs1=30, alpha=0.05, power=0.80)
print(f"Min detectable d ≈ {d_min:.3f}")

# --- 3. ANOVA (1-way) ---
fa = FTestAnovaPower()
n_total = fa.solve_power(effect_size=0.25, k_groups=4,
                          alpha=0.05, power=0.80)
print(f"Total n ≈ {n_total:.1f}")

# --- 4. Two-proportion (using Cohen's h) ---
h = proportion_effectsize(0.40, 0.30)
ni = NormalIndPower().solve_power(effect_size=h, alpha=0.05,
                                       power=0.80)
print(f"n per group ≈ {ni:.1f}")

# --- 5. Survival (Schoenfeld 1981, by hand) ---
def schoenfeld_events(HR, alpha=0.05, power=0.80, k=1):
    """k = allocation ratio (n_treat/n_ctrl). Returns required events."""
    za = norm.ppf(1 - alpha/2)
    zb = norm.ppf(power)
    return ((za + zb) ** 2) * (1 + k) ** 2 / (k * (np.log(HR)) ** 2)
events = schoenfeld_events(HR=0.7)
print(f"Required events: {events:.0f}")   # ≈ 247

# --- 6. Cluster RCT design effect ---
ICC, m = 0.05, 30
DE = 1 + (m - 1) * ICC
n_indiv = ap.solve_power(effect_size=0.3, alpha=0.05, power=0.80)
n_cluster = np.ceil(n_indiv * DE)
print(f"DE = {DE:.2f}, individuals per arm = {n_cluster:.0f}")

# --- 7. NEVER do this: post-hoc power ---
# observed_d = (mean1 - mean2) / pooled_sd  # NO!
# ap.power(effect_size=observed_d, nobs1=n, alpha=0.05)  # NO!
# Report a 95% CI instead — that IS the right "post-hoc" answer.

常見陷阱

六、常見錯誤

❌ Post-hoc power

用觀察 effect 反算 power——它和 p-value 是同一個東西的兩種寫法（Hoenig 2001）。審稿人要求時，回信：「Post-hoc power is a deterministic function of the p-value (Hoenig & Heisey 2001 Am Stat 55:19, DOI:10.1198/000313001300339897). We instead report the 95% CI, which directly conveys the precision of our estimate.」

Computing power from the observed effect — it's a deterministic function of the p-value (Hoenig 2001). When reviewers ask, reply: "Post-hoc power is a deterministic function of the p-value (Hoenig & Heisey 2001 Am Stat 55:19, DOI:10.1198/000313001300339897). We instead report the 95% CI, which directly conveys the precision of our estimate."

❌ Pilot effect 算 n

Pilot 樣本太小（典型 n ≤ 30）→ 觀察 effect 變異大且偏向過大（"winner's curse"）→ 算出來的 n 嚴重低估。Kraemer 2006 Arch Gen Psychiatry：使用 SESOI、文獻 meta-analysis、或 pilot d − 1 SE 作為 effect input。

Pilot n is small (typically ≤ 30) → observed effect is noisy and upwardly biased ("winner's curse") → formal n is severely underestimated. Kraemer 2006 Arch Gen Psychiatry: use SESOI, a meta-analytic effect, or pilot d − 1 SE as input.

❌ 忽略多重檢定

RCT 預設 5 個 primary endpoint 全部用 α = 0.05 → family-wise error 大約 0.23。算 n 時必須用調整後 α（Bonferroni α/k 或 hierarchical α-spending；見 Step 12）。GWAS、proteomics 也適用。

An RCT with 5 primary endpoints all at α = 0.05 has family-wise error ≈ 0.23. n must be powered at the adjusted α (Bonferroni α/k or hierarchical α-spending; see Step 12). Same for GWAS and proteomics.

❌ 忽略 ICC

群集 RCT 用個人 n 算 power → 嚴重高估。記得乘以 DE = 1 + (m − 1)·ICC。Hayes & Moulton 2017 警告：典型醫療 ICC = 0.01–0.10；忽略 ICC 是 cluster trial 失敗的頭號原因。

Computing cluster-RCT power on individual n → severe overestimate. Multiply by DE = 1 + (m − 1)·ICC. Hayes & Moulton 2017: typical health ICC = 0.01–0.10; ignoring ICC is the top cause of failed cluster trials.

❌ Survival 用 n 不用事件

log-rank 的 power 取決於事件數，不是樣本數。「收 500 人」無關緊要，重點是「事件 ≥ 多少」。Schoenfeld 1981 公式直接算事件 d；再用預期事件率反推 n 與追蹤期。R: gsDesign::nSurv。

log-rank power depends on events, not n. "500 enrolled" is irrelevant — only "events accrued" matters. Schoenfeld's 1981 formula gives events d directly; back out n and follow-up from the expected event rate. R: gsDesign::nSurv.

❌ 效應量太樂觀

研究者最常為了得到「合理」的 n，把 d 從 0.3 改成 0.8——這叫 effect inflation。論文 Methods 必須寫明 effect size 的依據（meta-analysis、SESOI、臨床指引）。Albers & Lakens 2018 PLOS ONE：超過 70% 的 power analyses 用無根據的 effect。

The most common shortcut: inflate d from 0.3 to 0.8 just to make n "feasible" — effect inflation. Methods sections must justify the effect (meta-analysis, SESOI, clinical guideline). Albers & Lakens 2018 PLOS ONE: > 70% of published power analyses use unjustified effects.

紅旗：論文寫 "post-hoc power = X" 這句話幾乎一定是反模式。正確替代：(1) 95% CI 告訴你估計精度；(2) sensitivity analysis 告訴你「我的 n 可偵測到多大的 effect」；(3) 重做 a priori 用 SESOI（如果是準備下一個研究）。Levine & Ensom 2001 Pharmacotherapy、Lenth 2007 Am Stat 都加強這個禁忌。 That sentence is almost always an anti-pattern. Correct alternatives: (1) a 95% CI tells you the precision; (2) sensitivity analysis tells you "what effect could I detect with this n?"; (3) re-run a priori with SESOI for the next study. Levine & Ensom 2001 Pharmacotherapy and Lenth 2007 Am Stat both reinforce the taboo.

實戰案例

七、實戰場景

💊

情境 A：降血壓藥 RCT

目標：偵測新藥 vs 安慰劑 SBP 差 ≥ 5 mmHg。已知 σ ≈ 12 mmHg → d = 5/12 = 0.42。α = 0.025（雙側 0.05 + 雙臂 0.025）、power = 0.90。

R：pwr.t.test(d=0.42, power=0.9, sig.level=0.025) → n ≈ 152/arm。加 15% 失訪 → 175/arm。Pre-specified in protocol, SAP fixed.

Goal: detect drug vs placebo SBP gap ≥ 5 mmHg. Known σ ≈ 12 → d = 5/12 = 0.42. α = 0.025, power = 0.90.

R: pwr.t.test(d=0.42, power=0.9, sig.level=0.025) → n ≈ 152/arm. Add 15% dropout → 175/arm. Pre-specified in protocol, SAP locked.

🧪

情境 B：劑量反應實驗

目標：偵測 5 個劑量間 IC50 差異（One-way ANOVA）。文獻 Cohen's f = 0.40（大效應）、α = 0.05、power = 0.80。

R：pwr.anova.test(k=5, f=0.40, power=0.8) → n ≈ 16/group → 總 80。技術重複 ≥ 3 wells/group，明確分層分析（Step 13 mixed model）。

Goal: detect IC50 differences across 5 doses (1-way ANOVA). Literature Cohen's f = 0.40 (large), α = 0.05, power = 0.80.

R: pwr.anova.test(k=5, f=0.40, power=0.8) → n ≈ 16/group → total 80. Technical reps ≥ 3 wells/group; analyse hierarchically (Step 13 mixed model).

🧬

情境 C：GWAS

目標：偵測 OR = 1.20 的 SNP（MAF = 0.30）。α_GWAS = 5×10⁻⁸（Pe'er 2008）、power = 0.80。

Genetic Power Calculator 或 Quanto：≈ 12,000 cases + 12,000 controls。為何要這麼大？因為 α 嚴格到 5×10⁻⁸ 對應 z_1−α/2 ≈ 5.45（vs α=0.05 的 1.96）——光是 (5.45/1.96)² ≈ 7.7 倍的 n。

Goal: detect SNP at OR = 1.20 (MAF = 0.30). α_GWAS = 5×10⁻⁸ (Pe'er 2008), power = 0.80.

Genetic Power Calculator or Quanto: ≈ 12,000 cases + 12,000 controls. Why so big? α = 5×10⁻⁸ ↔ z_1−α/2 ≈ 5.45 (vs 1.96 at α = 0.05) — that alone is (5.45/1.96)² ≈ 7.7× the n.

📝 自我檢測

1. 審稿人寫信：「請補上 post-hoc power 分析。」最佳回應是？

1. A reviewer asks: "Please add a post-hoc power analysis." Best response?

A. 立刻算 observed power 並補進論文A. Compute observed power immediately and add it

B. 把 power 算到 99% 給審稿人看B. Compute power at 99% to look strong

C. 引用 Hoenig & Heisey 2001，改報 95% CI 與 sensitivity 分析C. Cite Hoenig & Heisey 2001 and report 95% CI + sensitivity analysis instead

D. 拒絕修改D. Refuse without explanation

2. 你規劃一個 cluster RCT（學校為單位），平均每校 m = 25 名學生，預期 ICC = 0.04。個人層級算出每組需 100 人——實際每組要多少？

2. You plan a cluster RCT (schools as clusters), m = 25 students/school, ICC = 0.04. Individual-level power says 100 per arm — what do you actually need per arm?

A. 100（ICC 可以忽略）A. 100 (ICC negligible)

B. 125B. 125

C. 196（DE = 1 + (25−1)×0.04 = 1.96 → 100 × 1.96）C. 196 (DE = 1 + (25−1)×0.04 = 1.96 → 100 × 1.96)

D. 50（cluster 提升 power）D. 50 (clustering boosts power)

3. 你被預算限制 n = 40/組。最合適的 power 分析類型是？

3. Your budget caps n at 40/group. Which power-analysis type fits?

A. A priori：固定 effect 求 nA. A priori: fix effect, solve n

B. Post-hoc：等收完資料再算 powerB. Post-hoc: compute power after data collection

C. Criterion：求 αC. Criterion: solve α

D. Sensitivity：固定 n、α、power → 求最小可偵測 effect (MDE)D. Sensitivity: fix n, α, power → solve minimum detectable effect (MDE)

4. 存活分析（log-rank）power 主要取決於下列哪一項？

4. Survival (log-rank) power mainly depends on which of the following?

A. 收案人數（總 n）A. Enrolled n

B. 事件數（events / deaths）——Schoenfeld 1981B. Event count (events / deaths) — Schoenfeld 1981

C. 追蹤天數C. Follow-up days

D. SD of survival timeD. SD of survival time