Step 4: Hypothesis Testing — Biostatistics Tutorial

總覽

為什麼假設檢定被廣泛誤解？

假設檢定是 20 世紀統計學「最成功」也「最被濫用」的工具。Fisher（1925）原本提出 p-value 作為「資料是否值得進一步研究」的弱證據指標；Neyman & Pearson（1933）後來補上 α、β、Type I/II error 的決策框架。今天的「混合派」（hybrid logic）是兩者倉促拼接的產物——這就是為什麼 90% 的研究者在被問到「p = 0.04 是什麼意思？」時答錯。

ASA（American Statistical Association）2016 統計聲明（Wasserstein & Lazar 2016, Am Stat）正式承認：研究界長期誤用 p-value，列出六大原則。2019 年的續篇〈Moving to a World Beyond "p < 0.05"〉更直接呼籲：停止用「統計顯著」這個詞。本章不只教你怎麼「跑檢定」，更要教你怎麼讀懂並誠實報告。

Hypothesis testing is the 20th century's most successful — and most abused — statistical tool. Fisher (1925) proposed p-values as weak evidence that the data deserve further study; Neyman & Pearson (1933) later added α, β, and Type I/II decision theory. The "hybrid" logic taught today is a hasty splicing of the two — which is exactly why 90 % of researchers misread "p = 0.04" when asked.

The ASA 2016 Statement (Wasserstein & Lazar 2016, Am Stat) formally acknowledged decades of misuse and listed six principles. Its 2019 sequel — "Moving to a World Beyond p < 0.05" — went further: stop using the phrase "statistically significant" altogether. This chapter teaches you not just how to run a test, but how to read it and report it honestly.

💡

Fisher 的本意：在《Statistical Methods for Research Workers》(1925)，Fisher 把 p < 0.05 描述為「值得繼續追問」的門檻，不是下結論的終點。他若見今日「p < 0.05 就投稿、p > 0.05 就丟抽屜」的習慣，一定會大發雷霆。 What Fisher really meant: in Statistical Methods for Research Workers (1925), Fisher described p < 0.05 as "worth a closer look" — not as a finish line. Were he alive to see the modern habit of "p < 0.05 = submit, p > 0.05 = file-drawer," he would be furious.

核心框架

一、五步檢定框架

不論你跑 t 檢定、卡方、ANOVA、迴歸——所有頻率學派假設檢定都走同樣的五步。背下這五步比背公式重要十倍：

Whether you run a t-test, chi-square, ANOVA, or regression — every frequentist test follows the same five steps. Memorising them matters ten times more than memorising formulas:

陳述 H₀ 與 H₁

H₀（虛無假設）：「沒有效應 / 沒有差異 / 沒有關聯」（μ₁ = μ₂）。H₁（對立假設）：你想證據支持的東西（μ₁ ≠ μ₂、μ₁ > μ₂、或 μ₁ < μ₂）。寫下方向（單尾 vs 雙尾）必須在看資料前定，事後改是 p-hacking。

H₀ (null): "no effect / no difference / no association" (μ₁ = μ₂). H₁ (alternative): what you want evidence for (μ₁ ≠ μ₂, μ₁ > μ₂, or μ₁ < μ₂). Pick directionality (one- vs two-tailed) before you see the data — switching after is p-hacking.

決定 α

傳統 α = 0.05 是 Fisher 1925 的歷史慣例而非物理常數。基因體（多重比較）通常用 α = 0.05/m 或 FDR；高利害臨床試驗用 α = 0.025（單尾）；高能物理用 5σ ≈ 3×10⁻⁷。α 應依據錯誤成本選擇。

The traditional α = 0.05 is Fisher's 1925 convention, not a physical constant. Genomics (many tests) use α = 0.05/m or FDR; pivotal clinical trials use one-sided α = 0.025; particle physics uses 5σ ≈ 3 × 10⁻⁷. α should be chosen by the cost of being wrong.

計算檢定統計量

把資料壓成一個數字——z、t、F、χ² 等。一般形式：統計量 = (估計值 − 虛無值) / 標準誤。這個數字在 H₀ 為真時有已知分布。

Compress the data into a single number — z, t, F, χ². General form: statistic = (estimate − null value) / standard error. Under H₀ that number follows a known distribution.

得 p-value

p = P(觀察到目前或更極端的統計量 | H₀ 為真)。注意三個必要條件：(a) 「或更極端」——所以 p 是尾部累積機率，不是點機率；(b) 「假設 H₀ 為真」——p 不是 H₀ 的機率；(c) 「目前的檢定設計」——若你偷看資料才決定方向，這個 p 是錯的。

p = P(statistic this extreme or more | H₀ true). Three indispensable pieces: (a) "or more extreme" — p is a tail probability, not a point one; (b) "assuming H₀ is true" — p is not the probability of H₀; (c) "under the current design" — peeking before picking a direction invalidates it.

做決定

p < α → 「拒絕 H₀」；p ≥ α → 「未能拒絕 H₀」（不是「接受 H₀」、也不是「證明沒效應」）。同時必須報告 效應量（Cohen's d、OR、HR）與 95% CI——這比 p 本身更有資訊量。

p < α → "reject H₀"; p ≥ α → "fail to reject H₀" (not "accept H₀", not "no effect proven"). Always also report the effect size (Cohen's d, OR, HR) and 95 % CI — they carry more information than p itself.

⌜ p = P( |Z| ≥ |z_obs| ∣ H₀ ) · α = P( reject H₀ ∣ H₀ true ) · β = P( fail to reject H₀ ∣ H₁ true ) · Power = 1 − β ⌝ 注意 α 與 p-value 不同：α 是檢定設計階段選的長期錯誤率；p 是觀察到資料後計算的尾部機率。混淆兩者是文獻中最常見的概念錯誤之一（Hubbard 2004 Theory Psychol）。 ⌜ p = P( |Z| ≥ |z_obs| ∣ H₀ ) · α = P( reject H₀ ∣ H₀ true ) · β = P( fail to reject H₀ ∣ H₁ true ) · Power = 1 − β ⌝ Note: α and the p-value are different — α is a long-run error rate chosen at design time; p is a tail probability computed after seeing the data. Conflating them is one of the most common conceptual errors in the literature (Hubbard 2004 Theory Psychol).

錯誤類型

二、Type I / II 與單尾 vs 雙尾

❌

Type I (α)

P(reject H₀ ∣ H₀ true)。明明沒效應，卻說有——「無中生有」。α 是長期偽陽性率：跑 100 次測試，平均 5 次會誤報。臨床上對應「批准了無效藥」，FDA 嚴格控制。

P(reject H₀ ∣ H₀ true). No real effect, but you call one — making something from nothing. α is the long-run false-positive rate: of 100 tests, on average 5 mis-fire. Clinically: "approving a useless drug." Tightly controlled by FDA.

🙈

Type II (β)

P(fail to reject H₀ ∣ H₁ true)。明明有效應，卻沒偵測到——「眼前漏看」。β 由樣本大小、效應量、α 共同決定。Power = 1 − β。傳統慣例：power ≥ 0.80（β ≤ 0.20）。Type II 是 underpowered 研究的禍源——見 Step 11。

P(fail to reject H₀ ∣ H₁ true). A real effect, but you missed it. β depends on n, effect size, and α together. Power = 1 − β. Convention: power ≥ 0.80 (β ≤ 0.20). Type II is the curse of underpowered studies — see Step 11.

⚖️

α vs β 拉鋸

固定 n 時，降低 α 會升高 β（拒絕門檻變嚴 → 更難偵測真效應）。同時降低兩者的唯一辦法是增加 n。Neyman-Pearson lemma：UMP（最強檢定）在 α 與 H₁ 點固定時最大化 power。

For fixed n, lowering α raises β (stricter threshold ⇒ harder to detect real effects). The only way to lower both is to raise n. The Neyman–Pearson lemma: the UMP (uniformly most powerful) test maximises power for a fixed α and point-H₁.

	H₀ true	H₁ true
Reject H₀	❌ Type I error（機率 α）❌ Type I error (prob. α)	✅ 正確（機率 1 − β = power）✅ Correct (prob. 1 − β = power)
Fail to reject H₀	✅ 正確（機率 1 − α）✅ Correct (prob. 1 − α)	❌ Type II error（機率 β）❌ Type II error (prob. β)

雙尾檢定

H₁: μ ≠ μ₀。α 分到兩端各 α/2。默認首選——大多數情況不知道方向。

H₁: μ ≠ μ₀. α split equally between both tails. Default choice — most of the time you do not know the direction.

單尾檢定

H₁: μ > μ₀（或 <）。α 全放一邊。須事前嚴格證明反方向的效應「不重要或不可能」（如生物等價性、非劣性），否則就是 p-hacking。

H₁: μ > μ₀ (or <). All α in one tail. Requires pre-specified justification that the opposite direction is "uninteresting or impossible" (e.g. equivalence, non-inferiority). Otherwise it is p-hacking.

⚠️

單尾 = 偷一倍 power？常見誤解：「我用單尾 p 比較容易 < 0.05」。真相是——你只能在事前合理排除反方向的情況才能用單尾；事後改方向是學術不端。Bland & Altman 1994 BMJ 明確警告。 "One-tailed = free power"? A common myth. The truth: you may only use one-tailed when the opposite direction is pre-specified as uninteresting; switching after seeing data is research misconduct. Bland & Altman 1994 BMJ are explicit.

互動模擬 ①

α、β、Power 視覺化

下方顯示兩個常態分布：左邊（藍）= H₀ 真實時統計量的分布；右邊（綠）= H₁ 真實時的分布，差距 = 效應量 d。紅色垂直線是臨界值（由 α 決定）。紅尾 = α（Type I error，H₀ 區域中超過臨界值的部分）；橙色區 = β（Type II error，H₁ 區域中未超過臨界值的部分）；綠色區 = power（H₁ 區域中超過臨界值的部分，1 − β）。拖滑桿觀察三者怎麼動。

Two normal densities below: left (blue) = sampling distribution under H₀; right (green) = under H₁, gap = effect size d. The red vertical line is the critical value (set by α). Red tail = α (Type I, the slice of H₀ beyond critical); orange = β (Type II, slice of H₁ below critical); green = power (slice of H₁ above critical, 1 − β). Drag the sliders and watch all three react.

效應量 d 0.8

顯著水準 α 0.050

藍 = H₀ · 綠 = H₁ · 紅線 = 臨界值 · 紅尾 = α · 橙 = β · 綠區 = powerBlue = H₀ · Green = H₁ · Red line = critical value · Red tail = α · Orange = β · Green = power

深入討論

三、ASA、S-value、分岔花園

2016 年 3 月 7 日，美國統計學會（ASA）發表史上第一份對單一統計概念的官方立場聲明——〈The ASA Statement on p-Values: Context, Process, and Purpose〉（Wasserstein & Lazar 2016）。文件列出六大原則，每一條都對應一個常見誤用：

On 7 March 2016 the American Statistical Association issued the first official position statement in its history about a single statistical concept — "The ASA Statement on p-Values: Context, Process, and Purpose" (Wasserstein & Lazar 2016). The document lists six principles, each addressing a frequent misuse:

原則	說明
1	p-value 衡量「資料與某個指定統計模型的不相容程度」。p-values can indicate how incompatible the data are with a specified model.
2	p-value 不是「假設為真的機率」，也不是「結果是偶然產生的機率」。p-values do not measure the probability the hypothesis is true, nor that the data are "random chance alone".
3	科學結論與商業／政策決定，不應僅以 p 是否越過某個門檻來決定。Scientific or policy decisions should not rest solely on whether p crosses a threshold.
4	適當的推論需要完整報告與透明化（含分析過程、樣本選擇、模型假設）。Proper inference requires full reporting and transparency.
5	p-value 不衡量效應大小或結果重要性。A p-value does not measure the size or importance of an effect.
6	單一 p-value 不能提供完整證據——須配合效應量、CI、先驗、研究設計一起讀。By itself a p-value does not provide a good measure of evidence — pair it with effect size, CI, design, priors.

2019 年的續篇〈Moving to a World Beyond "p < 0.05"〉（Wasserstein, Schirm & Lazar 2019, Am Stat）由 43 位作者共同撰寫，立場更激進：建議停用「statistically significant」這個詞彙，因為它已被當成「結果重要 vs 不重要」的開關來濫用。同期 Amrhein, Greenland & McShane (2019, Nature) 也發起 800+ 學者連署的「Retire statistical significance」運動。

The 2019 sequel "Moving to a World Beyond p < 0.05" (Wasserstein, Schirm & Lazar 2019, Am Stat) — co-authored by 43 statisticians — went further: stop using the phrase "statistically significant", because it had become a switch flipped between "matters" and "doesn't". The same week, Amrhein, Greenland & McShane (2019, Nature) gathered 800+ scientists to "retire statistical significance".

S-value（驚奇值）

Greenland 2019 Am Stat 建議用 S = −log₂(p)——以「位元」（bit）為單位表達 p 的證據強度。直觀：S 是「擲幾次公平硬幣全部同面」的等價驚奇度。

p = 0.5 → S = 1 bit（1 次硬幣）
p = 0.05 → S ≈ 4.3 bits（≈ 4 次硬幣全朝同一面）
p = 0.005 → S ≈ 7.6 bits（≈ 8 次）
p = 0.001 → S ≈ 10 bits

S-value 的優點：(1) 線性—— p 從 0.06 到 0.04 看似巨變，S 從 4.06 到 4.64 只移 0.6 bit；(2) 不會被「< vs ≥」二分法騙；(3) 直接對應日常驚奇感。

Greenland 2019 Am Stat recommends S = −log₂(p) — measuring p in bits of surprise: how many fair-coin flips all landing the same way is equally surprising.

p = 0.5 → S = 1 bit (1 flip)
p = 0.05 → S ≈ 4.3 bits (≈ 4 flips all same)
p = 0.005 → S ≈ 7.6 bits (≈ 8 flips)
p = 0.001 → S ≈ 10 bits

Advantages: (1) linear — p moving 0.06 → 0.04 looks dramatic, but S moves only 4.06 → 4.64; (2) cannot be tricked by the < vs ≥ dichotomy; (3) maps to everyday intuition.

分岔花園 (2014)

Gelman & Loken 2014 Am Sci「The Garden of Forking Paths」指出一個比 p-hacking 更陰險的問題：研究者不需故意多次測試——只要在分析時做的每個小選擇（要不要剔除 outlier？要不要 log 變換？亞群分析？）都會在不同資料下不同，整體就成了多重比較。

結果：單一論文裡看似「事先設計」的單一檢定，其實是研究者腦中「眾多檢定」的選一個顯著的呈現。這比赤裸的 p-hacking 更難偵測，也更普遍。Borges 的「岔路花園」短篇就是論文標題的靈感。

Gelman & Loken 2014 Am Sci's "Garden of Forking Paths" describes something subtler than p-hacking: the analyst need not consciously run many tests — every small decision (exclude outliers? log-transform? subgroup?) would differ on different data, so the overall procedure is itself a multiple comparison.

Result: a single "pre-specified" test in a paper is the surviving member of many implicit tests the analyst's mind ran. Harder to detect than overt p-hacking — and far more common. Borges' "Garden of Forking Paths" is the inspiration for the name.

⚠️

解方：preregistration（預先註冊）。在收資料前把假設、檢定、α、停止規則寫死並上傳到第三方平台（OSF Registries、AsPredicted、Clinicaltrials.gov）。Registered Reports（《Cortex》《Nature Human Behaviour》《eLife》接受）更進一步——同行評審先審 protocol、再做實驗，無論結果都發表。Nosek et al. 2018 PNAS 估計 RR 能把偽陽性率降低 5–10 倍。 The remedy: preregistration. Before data collection, lock down hypotheses, tests, α, and stopping rules on a third-party platform (OSF Registries, AsPredicted, ClinicalTrials.gov). Registered Reports (offered by Cortex, Nature Human Behaviour, eLife) go further: peer review the protocol, then run the experiment, publish regardless of outcome. Nosek et al. 2018 PNAS estimate RRs cut false-positive rates 5–10×.

進階檢定類型

四、等效與非劣性

傳統 NHST 只能否決「無差異」，無法證明「無差異」——這是它最常被誤用的地方（看到 p > 0.05 就說「兩組相同」是嚴重錯誤）。當研究目的是展示差不多（如學名藥 vs 原廠藥）或不會更差（新療法非劣於現有標準），應改用以下兩種設計：

Classical NHST can only reject "no difference"; it cannot prove "no difference" — its single most abused property (concluding "equal" from p > 0.05 is a serious error). When the goal is to show similar enough (generic vs brand drug) or not meaningfully worse (new therapy non-inferior to standard), switch to one of these designs:

🎯 Equivalence — TOST

Schuirmann 1987 J Pharmacokinet Biopharm 的「Two One-Sided Tests」：定義等效邊界 δ（如 ±20%），然後做兩個單尾檢定：

H₀: μ₁ − μ₂ ≤ −δ 或 ≥ +δ vs H₁: −δ < μ₁ − μ₂ < +δ

若兩個 p 都 < α（等價於 90% CI 完全落在 (−δ, +δ) 內），即可宣告等效。FDA 生物等效性指引（2003、2022 更新）用 GMR 的 90% CI 必須在 (0.80, 1.25)。Lakens 2017 Soc Psychol Personal Sci 是入門 TOST 必讀，配套 R 套件 TOSTER。

Schuirmann 1987 J Pharmacokinet Biopharm's Two One-Sided Tests: define an equivalence margin δ (e.g. ±20 %) and run two one-sided tests:

H₀: μ₁ − μ₂ ≤ −δ or ≥ +δ vs H₁: −δ < μ₁ − μ₂ < +δ

If both ps < α (equivalently, the 90 % CI lies entirely inside (−δ, +δ)) you may claim equivalence. FDA Bioequivalence Guidance (2003; updated 2022) requires the 90 % CI of the GMR to lie within (0.80, 1.25). Lakens 2017 Soc Psychol Personal Sci is the canonical tutorial; R package TOSTER.

⚖️ Non-Inferiority

新療法只需不顯著差於標準療法即可——常見於有道德問題不能放安慰劑的臨床試驗。CONSORT-NI 2012 JAMA（Piaggio et al.）規範報告：

H₀: μ_new < μ_std − δ vs H₁: μ_new ≥ μ_std − δ

δ（非劣性邊界）必須事先決定，並基於臨床意義、不是統計方便。報告應同時呈現 CI 與 δ，並明確標示「non-inferiority margin」。FDA Non-Inferiority Guidance 2016 提供範例。

The new therapy need only be not meaningfully worse than the standard — common in clinical trials where placebo is unethical. CONSORT-NI 2012 JAMA (Piaggio et al.) regulates reporting:

H₀: μ_new < μ_std − δ vs H₁: μ_new ≥ μ_std − δ

The margin δ must be pre-specified on clinical, not statistical, grounds. Reports must show the CI alongside δ and explicitly label the "non-inferiority margin". FDA Non-Inferiority Guidance 2016 provides templates.

常見錯誤：「p = 0.31，兩組沒有顯著差異 → 兩組相同」← 大錯！p > 0.05 只代表「沒有足夠證據說有差異」，可能是真的沒差、也可能是 power 不夠。要證明「相同」，請跑等效檢定。Greene et al. 2008 BMJ 估計約 60% 的非顯著結果論文錯誤地宣稱「無差異」。 Common error: "p = 0.31, no significant difference → the groups are equal." Wrong. p > 0.05 only means "no evidence of a difference" — could be no effect, or could be underpowered. To claim equality, run an equivalence test. Greene et al. 2008 BMJ estimate ~60 % of non-significant-result papers wrongly conclude "no difference".

互動模擬 ②

p-value 的分布長相

很多人不知道：當 H₀ 為真時，p-value 是均勻分布 U(0,1)——所以你看到 p = 0.04 跟 p = 0.96 的機率一樣大！只有當 H₁ 為真時，p 才會偏向 0（小 p 變多）。下方模擬 5000 次雙樣本 z 檢定，把所有 p 畫成 histogram。試試 d = 0（H₀ 真）vs d = 0.5（中等效應）。

A surprising fact: under H₀ the p-value is uniformly distributed on (0,1) — so p = 0.04 and p = 0.96 are equally likely. Only under H₁ does the distribution skew toward 0 (small p's become more common). The widget below simulates 5000 two-sample z-tests and histograms the p's. Try d = 0 (H₀ true) vs d = 0.5 (medium effect).

真實效應量 d 0.0

樣本數 n 30

紅虛線 = α = 0.05 · 紅色 bin = p < 0.05 的比例（≈ 觀察 power）Red dashed = α = 0.05 · Red bin = fraction with p < 0.05 (≈ observed power)

決策引導

五、檢定設計決策樹

🌳 檢定設計決策樹

Q1:

你的研究問題是「有沒有差異？」（探索性 / 比較性研究）→ 是 → 傳統 significance test（t、ANOVA、迴歸），雙尾為主；務必同時報 effect size + 95% CI。

Q2:

你的問題是「兩者效果差不多嗎？」（學名藥、不同儀器交叉驗證）→ 是 → Equivalence test (TOST)，事前定 δ，跑兩個單尾。

Q3:

「新療法不會比現有更差嗎？」（活性對照臨床試驗）→ 是 → Non-inferiority test，事前定 δ；遵循 CONSORT-NI。

Q4:

多個假設要同時檢定（基因體、imaging）？→ 是 → 加 FDR 或 Bonferroni 校正（見 Step 12）；不要單獨用 raw p。

Q5:

樣本數很小（n < 10）或分布極不正常？→ 是 → permutation / exact test（Fisher's exact、Wilcoxon、bootstrap），別硬套常態檢定。

Q6:

關心方向嚴格事先確定？→ 是 → 才能用單尾；否則一律雙尾。

Q1:

"Is there a difference?" (exploratory / comparative)→ Yes → classical significance test (t, ANOVA, regression), two-tailed by default; report effect size + 95 % CI alongside.

Q2:

"Are the two effectively equal?" (generic drug, cross-device validation)→ Yes → Equivalence (TOST); pre-specify δ; run two one-sided tests.

Q3:

"Is the new therapy not meaningfully worse?" (active-control trial)→ Yes → Non-inferiority with a pre-specified δ; follow CONSORT-NI.

Q4:

Many hypotheses at once (genomics, imaging)?→ Yes → add FDR or Bonferroni correction (Step 12); never report raw p alone.

Q5:

Very small n (< 10) or wildly non-normal?→ Yes → use permutation / exact tests (Fisher's exact, Wilcoxon, bootstrap) rather than forcing normality.

Q6:

Direction is strictly pre-specified?→ Yes → only then may you use a one-tailed test; otherwise two-tailed.

六、p-value 誤用對照表

情境	常見誤讀	正確解讀
p = 0.03	「H₀ 為真的機率是 3%」"There's a 3 % chance H₀ is true"	在 H₀ 為真的前提下，觀察到目前或更極端統計量的機率是 3%。Given H₀ is true, P(statistic this extreme or more) = 3 %.
p = 0.03	「結果為偶然的機率 3%」"3 % chance the result is due to chance"	p 是「資料 ∣ 模型」的條件機率，與「偶然 vs 真效應」的後驗無直接關係。p is P(data \| model); it is not the posterior split between "chance" and "real effect".
p > 0.05	「兩組沒有差異」"The two groups are the same"	缺乏拒絕 H₀ 的證據——可能真沒差，也可能 power 不夠。要證明「相同」請用 TOST。Insufficient evidence to reject H₀ — could be no effect, could be underpowered. Use TOST for equality.
p < 0.001	「效應很大」"Large effect"	p 很小可能來自大 n而非大效應。同時看 effect size。A tiny p may come from large n, not a large effect. Read the effect size.
p = 0.049 vs p = 0.051	「一個顯著，一個不顯著，差很多」"One significant, one not — big difference"	兩者的證據強度幾乎一樣（S 差 ~0.06 bit）。「< vs ≥」是人為二分。Evidence strength is essentially identical (≈ 0.06 bit difference). The </≥ split is artificial.
重複實驗 p = 0.06	「沒重現原研究 p = 0.04」"Failed to replicate the original p = 0.04"	兩次都看到一致方向的小 p——其實是一致證據！「significance ≠ replication」（McShane & Gal 2017）。Two similar small p's in the same direction — that is consistent evidence. "Significance ≠ replication" (McShane & Gal 2017).
p = 0.04，n = 12	「可以發表」"Publishable"	小 n 配剛跨門檻的 p——典型「winner's curse」，效應量極可能被高估。需重複驗證。Tiny n + barely-crossing p — textbook "winner's curse"; effect size almost certainly inflated. Needs replication.

💡

新一代報告語：與其寫「p = 0.03，significant」，不如寫「the estimated mean difference was 0.42 (95% CI 0.05 to 0.79), corresponding to p = 0.03 (S ≈ 5.1 bits of evidence against H₀)」。資訊量多三倍，且不暗示二分結論。 Modern reporting: instead of "p = 0.03, significant," try "mean difference 0.42 (95 % CI 0.05–0.79); p = 0.03 (S ≈ 5.1 bits against H₀)." Three times the information, none of the false dichotomy.

程式碼

七、實作範例

# R: z-test, TOST equivalence, effect size + CI
library(tidyverse)
library(TOSTER)        # equivalence (Lakens 2017)
library(effectsize)    # Cohen's d + 95% CI

# --- Synthetic biomarker example ---
set.seed(42)
ctrl <- rnorm(30, mean = 5.0, sd = 1.0)
trt  <- rnorm(30, mean = 5.5, sd = 1.0)

# --- (1) Classical two-sided test ---
res <- t.test(trt, ctrl, var.equal = FALSE)   # Welch by default
res$p.value
res$conf.int
effectsize::cohens_d(trt, ctrl)         # d + 95% CI

# --- (2) Report S-value (Greenland 2019) ---
s_val <- -log2(res$p.value)             # bits of surprise

# --- (3) TOST equivalence test (δ = 0.4) ---
TOSTER::tsum_TOST(
  m1 = mean(trt), sd1 = sd(trt), n1 = 30,
  m2 = mean(ctrl), sd2 = sd(ctrl), n2 = 30,
  low_eqbound = -0.4, high_eqbound = 0.4,
  eqbound_type = "raw", alpha = 0.05
)
# Both 1-sided p < 0.05 → equivalent within ±0.4 units

# --- (4) Recommended sentence ---
sprintf("Mean diff = %.2f (95%% CI %.2f to %.2f); p = %.3f (S = %.1f bits)",
        mean(trt) - mean(ctrl),
        -res$conf.int[2], -res$conf.int[1],
        res$p.value, s_val)

import numpy as np
from scipy import stats
import math

# Synthetic biomarker example
rng = np.random.default_rng(42)
ctrl = rng.normal(5.0, 1.0, 30)
trt  = rng.normal(5.5, 1.0, 30)

# (1) Welch t-test (two-sided)
res = stats.ttest_ind(trt, ctrl, equal_var=False)
mean_diff = trt.mean() - ctrl.mean()

# 95% CI for mean difference
se = math.sqrt(trt.var(ddof=1)/len(trt) + ctrl.var(ddof=1)/len(ctrl))
df = res.df
tcrit = stats.t.ppf(0.975, df)
ci = (mean_diff - tcrit*se, mean_diff + tcrit*se)

# (2) Cohen's d (Hedges-corrected)
pooled = math.sqrt(((len(trt)-1)*trt.var(ddof=1) +
                   (len(ctrl)-1)*ctrl.var(ddof=1)) / (len(trt)+len(ctrl)-2))
d = mean_diff / pooled

# (3) S-value (Greenland 2019)
s_val = -math.log2(res.pvalue)

# (4) TOST equivalence (δ = ±0.4)
delta = 0.4
t1 = (mean_diff - (-delta)) / se   # H0: diff <= -delta
t2 = (mean_diff -   delta ) / se   # H0: diff >= +delta
p1 = 1 - stats.t.cdf(t1, df)
p2 = stats.t.cdf(t2, df)
p_tost = max(p1, p2)              # larger of the two 1-sided p's

print(f"diff = {mean_diff:.2f} (95% CI {ci[0]:.2f} to {ci[1]:.2f})")
print(f"p = {res.pvalue:.3f}  S = {s_val:.1f} bits  d = {d:.2f}")
print(f"TOST p = {p_tost:.3f} → equivalent within ±{delta}: {p_tost < 0.05}")

💡

練習：把過去一篇你寫的論文裡的「p = 0.0X」改寫成「diff (95% CI); p (S bits)」格式。多半你會發現 CI 上下界並不一致地遠離 0，或效應量小到沒臨床意義——這正是 ASA 2019 提倡的「資訊密度報告」。 Exercise: rewrite a "p = 0.0X" line from your last paper as "diff (95 % CI); p (S bits)". Often you'll discover the CI is not symmetrically far from 0, or the effect is too small to matter clinically — exactly the information-rich reporting ASA 2019 advocates.

八、常見陷阱

八、四大陷阱

❌ p < 0.05 懸崖

p = 0.049 不是「正確」、p = 0.051 不是「錯誤」——兩者證據強度幾乎相同（S ≈ 4.36 vs 4.29 bits）。把 0.05 當二分懸崖是 ASA 2016 第 3 條原則明文反對的。解方：報告連續的 p、effect size、95% CI，讓讀者自己判斷臨床重要性。

p = 0.049 is not "right"; p = 0.051 is not "wrong" — the evidence strength is essentially equal (S ≈ 4.36 vs 4.29 bits). Treating 0.05 as a cliff is explicitly rejected by ASA 2016 Principle 3. Fix: report continuous p, effect size, and 95 % CI; let the reader judge clinical importance.

❌ 事後挑選

「我跑了 20 個亞群分析，挑了最顯著的兩個發表。」這是赤裸的 p-hacking。即使 H₀ 全部為真，跑 20 個獨立檢定中至少一個 p < 0.05 的機率是 1 − (1−0.05)²⁰ ≈ 64%。解方：preregistration + FDR / Bonferroni 校正（Step 12）。

"I ran 20 subgroup analyses and reported the two most significant." Plain p-hacking. Even if all H₀ are true, the chance at least one of 20 independent tests has p < 0.05 is 1 − (1 − 0.05)²⁰ ≈ 64 %. Fix: preregistration + FDR / Bonferroni (Step 12).

❌ 「不顯著」≠「沒效應」

Altman & Bland 1995 BMJ: 〈Absence of evidence is not evidence of absence〉。p > 0.05 可能來自真的沒效應、樣本太小、變異太大、設計不良。解方：看 95% CI——若 CI 跨 0 但寬到 (−5, +5)，就是「不確定」，不是「等於 0」。要證明「相同」改用 TOST。

Altman & Bland 1995 BMJ: "Absence of evidence is not evidence of absence." p > 0.05 may mean no effect, small n, big variance, or bad design. Fix: read the 95 % CI — if it crosses 0 but spans (−5, +5), that's "uncertain", not "zero". Use TOST to claim equality.

❌ 基率謬誤

「p = 0.05 就 95% 信任結果」是錯的。若研究的真實假設只有 10% 為真，依貝氏分析（Ioannidis 2005 PLoS Med「Why Most Published Research Findings Are False」）：在 α = 0.05、power = 0.80 下，PPV（陽性預測值）只有 ~36%——大半發表的 p < 0.05 結果其實是偽陽性。解方：提高 power、preregister、複現研究。

"p = 0.05, 95 % trust" is wrong. If only 10 % of tested hypotheses are true, Bayes (Ioannidis 2005 PLoS Med: "Why Most Published Research Findings Are False") gives, at α = 0.05 and power = 0.80, a PPV of only ~36 % — most published p < 0.05 results are false positives. Fix: raise power, preregister, replicate.

❌ 缺 effect size + CI

大 n 下，p = 0.0001 也可能對應 d = 0.05 的微小效應——統計顯著但臨床無意義。CONSORT 2010 / SAMPL 2015 都要求同時報告效應量 + 95% CI。

With large n, p = 0.0001 may correspond to a trivial d = 0.05 — statistically significant but clinically meaningless. CONSORT 2010 / SAMPL 2015 both require effect size + 95 % CI alongside p.

❌ HARKing

看完結果再回頭「事後設計」假設讓 p 顯著——Kerr 1998 Pers Soc Psychol Rev 命名為 HARKing（Hypothesising After the Results are Known）。學術不端的核心類型之一。解方：preregistration、Registered Reports、分離「探索 vs 確證」分析。

Reverse-engineering a hypothesis to fit the significant finding — Kerr 1998 Pers Soc Psychol Rev named it HARKing (Hypothesising After Results Known). A canonical form of research misconduct. Fix: preregistration, Registered Reports, separating exploratory vs confirmatory analyses.

📝 自我檢測

1. 你看到一篇論文寫「p = 0.02，所以 H₀ 為真的機率是 2%」，最佳回應是？

1. A paper writes "p = 0.02, so the probability H₀ is true is 2 %." Best response?

A. 沒問題，這就是 p 的意思A. Fine, that's what p means

B. 改寫成「H₁ 為真的機率是 98%」更精確B. Rephrase as "98 % chance H₁ is true"

C. 錯誤：p 是 P(data | H₀)，不是 P(H₀ | data)；要算後者得用貝氏分析C. Wrong: p is P(data | H₀), not P(H₀ | data); the latter needs Bayes

D. 沒差，本來統計都是這樣寫D. Doesn't matter, all stats papers write this way

2. 在學名藥研發中要「證明等效於原廠藥」，最合適的設計是？

2. To "prove equivalence" of a generic to a brand-name drug, the best design is:

A. 跑傳統雙樣本 t 檢定，若 p > 0.05 就宣告等效A. Run classical two-sample t-test; declare equal if p > 0.05

B. TOST 等效檢定：事前定 δ（如 ±20%），確認 90% CI 落在 (−δ, +δ) 內B. TOST equivalence test: pre-specify δ (e.g. ±20 %), verify 90 % CI lies in (−δ, +δ)

C. 單尾 t 檢定，方向取對自己有利的C. One-tailed t-test in the convenient direction

D. 跑 100 次然後挑最不顯著的那次D. Run 100 tests and report the least significant one

3. Gelman & Loken 的「分岔花園」（garden of forking paths）描述的問題是？

3. Gelman & Loken's "garden of forking paths" describes:

A. 多重共線性 (multicollinearity)A. Multicollinearity

B. 異質性檢驗B. Heterogeneity testing

C. 即使沒有主動 p-hacking，分析中每個小選擇都會隨資料而變，整體成為隱含的多重比較C. Even without deliberate p-hacking, each analyst choice would vary with data — creating implicit multiple comparisons

D. 樣本不獨立的問題D. Non-independent samples

4. Greenland 2019 提出的 S-value，p = 0.05 對應的 S 約為？

4. Greenland's S-value: p = 0.05 corresponds to approximately:

A. 0.05 bitsA. 0.05 bits

B. 0.5 bitsB. 0.5 bits

C. 20 bitsC. 20 bits

D. ≈ 4.3 bits（相當於 4 次公平硬幣全朝同一面）D. ≈ 4.3 bits (about 4 fair coin flips landing the same way)