STEP 4 / 13

假設檢定 (Hypothesis Testing)

五步框架、H₀ vs H₁、Type I/II 錯誤、α/β、p-value 真意——所有推論統計的中樞神經,也是被誤解最深的一章。

The five-step framework, H₀ vs H₁, Type I/II errors, α/β, and what a p-value really means — the central nervous system of inferential statistics, and the most misunderstood chapter in the book.

為什麼假設檢定被廣泛誤解?

假設檢定是 20 世紀統計學「最成功」也「最被濫用」的工具。Fisher(1925)原本提出 p-value 作為「資料是否值得進一步研究」的弱證據指標;Neyman & Pearson(1933)後來補上 α、β、Type I/II error 的決策框架。今天的「混合派」(hybrid logic)是兩者倉促拼接的產物——這就是為什麼 90% 的研究者在被問到「p = 0.04 是什麼意思?」時答錯。

ASA(American Statistical Association)2016 統計聲明(Wasserstein & Lazar 2016, Am Stat)正式承認:研究界長期誤用 p-value,列出六大原則。2019 年的續篇〈Moving to a World Beyond "p < 0.05"〉更直接呼籲:停止用「統計顯著」這個詞。本章不只教你怎麼「跑檢定」,更要教你怎麼讀懂並誠實報告

Hypothesis testing is the 20th century's most successful — and most abused — statistical tool. Fisher (1925) proposed p-values as weak evidence that the data deserve further study; Neyman & Pearson (1933) later added α, β, and Type I/II decision theory. The "hybrid" logic taught today is a hasty splicing of the two — which is exactly why 90 % of researchers misread "p = 0.04" when asked.

The ASA 2016 Statement (Wasserstein & Lazar 2016, Am Stat) formally acknowledged decades of misuse and listed six principles. Its 2019 sequel — "Moving to a World Beyond p < 0.05" — went further: stop using the phrase "statistically significant" altogether. This chapter teaches you not just how to run a test, but how to read it and report it honestly.

💡
Fisher 的本意:在《Statistical Methods for Research Workers》(1925),Fisher 把 p < 0.05 描述為「值得繼續追問」的門檻,不是下結論的終點。他若見今日「p < 0.05 就投稿、p > 0.05 就丟抽屜」的習慣,一定會大發雷霆。 What Fisher really meant: in Statistical Methods for Research Workers (1925), Fisher described p < 0.05 as "worth a closer look" — not as a finish line. Were he alive to see the modern habit of "p < 0.05 = submit, p > 0.05 = file-drawer," he would be furious.

一、五步檢定框架

不論你跑 t 檢定、卡方、ANOVA、迴歸——所有頻率學派假設檢定都走同樣的五步。背下這五步比背公式重要十倍:

Whether you run a t-test, chi-square, ANOVA, or regression — every frequentist test follows the same five steps. Memorising them matters ten times more than memorising formulas:

1
陳述 H₀ 與 H₁

H₀(虛無假設):「沒有效應 / 沒有差異 / 沒有關聯」(μ₁ = μ₂)。H₁(對立假設):你想證據支持的東西(μ₁ ≠ μ₂、μ₁ > μ₂、或 μ₁ < μ₂)。寫下方向(單尾 vs 雙尾)必須在看資料前定,事後改是 p-hacking。

H₀ (null): "no effect / no difference / no association" (μ₁ = μ₂). H₁ (alternative): what you want evidence for (μ₁ ≠ μ₂, μ₁ > μ₂, or μ₁ < μ₂). Pick directionality (one- vs two-tailed) before you see the data — switching after is p-hacking.

2
決定 α

傳統 α = 0.05 是 Fisher 1925 的歷史慣例而非物理常數。基因體(多重比較)通常用 α = 0.05/m 或 FDR;高利害臨床試驗用 α = 0.025(單尾);高能物理用 5σ ≈ 3×10⁻⁷。α 應依據錯誤成本選擇

The traditional α = 0.05 is Fisher's 1925 convention, not a physical constant. Genomics (many tests) use α = 0.05/m or FDR; pivotal clinical trials use one-sided α = 0.025; particle physics uses 5σ ≈ 3 × 10⁻⁷. α should be chosen by the cost of being wrong.

3
計算檢定統計量

把資料壓成一個數字——ztFχ² 等。一般形式:統計量 = (估計值 − 虛無值) / 標準誤。這個數字在 H₀ 為真時有已知分布。

Compress the data into a single number — z, t, F, χ². General form: statistic = (estimate − null value) / standard error. Under H₀ that number follows a known distribution.

4
得 p-value

p = P(觀察到目前或更極端的統計量 | H₀ 為真)。注意三個必要條件:(a) 「或更極端」——所以 p 是尾部累積機率,不是點機率;(b) 「假設 H₀ 為真」——p 不是 H₀ 的機率;(c) 「目前的檢定設計」——若你偷看資料才決定方向,這個 p 是錯的。

p = P(statistic this extreme or more | H₀ true). Three indispensable pieces: (a) "or more extreme" — p is a tail probability, not a point one; (b) "assuming H₀ is true" — p is not the probability of H₀; (c) "under the current design" — peeking before picking a direction invalidates it.

5
做決定

p < α → 「拒絕 H₀」;p ≥ α → 「未能拒絕 H₀」(不是「接受 H₀」、也不是「證明沒效應」)。同時必須報告 效應量(Cohen's d、OR、HR)與 95% CI——這比 p 本身更有資訊量。

p < α → "reject H₀"; p ≥ α → "fail to reject H₀" (not "accept H₀", not "no effect proven"). Always also report the effect size (Cohen's d, OR, HR) and 95 % CI — they carry more information than p itself.

p = P( |Z| ≥ |zobs| ∣ H₀ )   ·   α = P( reject H₀ ∣ H₀ true )   ·   β = P( fail to reject H₀ ∣ H₁ true )   ·   Power = 1 − β ⌝ 注意 α 與 p-value 不同:α 是檢定設計階段選的長期錯誤率;p 是觀察到資料後計算的尾部機率。混淆兩者是文獻中最常見的概念錯誤之一(Hubbard 2004 Theory Psychol)。 p = P( |Z| ≥ |zobs| ∣ H₀ )   ·   α = P( reject H₀ ∣ H₀ true )   ·   β = P( fail to reject H₀ ∣ H₁ true )   ·   Power = 1 − β ⌝ Note: α and the p-value are different — α is a long-run error rate chosen at design time; p is a tail probability computed after seeing the data. Conflating them is one of the most common conceptual errors in the literature (Hubbard 2004 Theory Psychol).

二、Type I / II 與單尾 vs 雙尾

Type I (α)

P(reject H₀ ∣ H₀ true)。明明沒效應,卻說有——「無中生有」。α 是長期偽陽性率:跑 100 次測試,平均 5 次會誤報。臨床上對應「批准了無效藥」,FDA 嚴格控制。

P(reject H₀ ∣ H₀ true). No real effect, but you call one — making something from nothing. α is the long-run false-positive rate: of 100 tests, on average 5 mis-fire. Clinically: "approving a useless drug." Tightly controlled by FDA.

🙈

Type II (β)

P(fail to reject H₀ ∣ H₁ true)。明明有效應,卻沒偵測到——「眼前漏看」。β 由樣本大小、效應量、α 共同決定。Power = 1 − β。傳統慣例:power ≥ 0.80(β ≤ 0.20)。Type II 是 underpowered 研究的禍源——見 Step 11。

P(fail to reject H₀ ∣ H₁ true). A real effect, but you missed it. β depends on n, effect size, and α together. Power = 1 − β. Convention: power ≥ 0.80 (β ≤ 0.20). Type II is the curse of underpowered studies — see Step 11.

⚖️

α vs β 拉鋸

固定 n 時,降低 α 會升高 β(拒絕門檻變嚴 → 更難偵測真效應)。同時降低兩者的唯一辦法是增加 n。Neyman-Pearson lemma:UMP(最強檢定)在 α 與 H₁ 點固定時最大化 power。

For fixed n, lowering α raises β (stricter threshold ⇒ harder to detect real effects). The only way to lower both is to raise n. The Neyman–Pearson lemma: the UMP (uniformly most powerful) test maximises power for a fixed α and point-H₁.

H₀ trueH₁ true
Reject H₀ ❌ Type I error(機率 α)❌ Type I error (prob. α) ✅ 正確(機率 1 − β = power)✅ Correct (prob. 1 − β = power)
Fail to reject H₀ ✅ 正確(機率 1 − α)✅ Correct (prob. 1 − α) ❌ Type II error(機率 β)❌ Type II error (prob. β)
雙尾檢定

H₁: μ ≠ μ₀。α 分到兩端各 α/2。默認首選——大多數情況不知道方向。

H₁: μ ≠ μ₀. α split equally between both tails. Default choice — most of the time you do not know the direction.

單尾檢定

H₁: μ > μ₀(或 <)。α 全放一邊。須事前嚴格證明反方向的效應「不重要或不可能」(如生物等價性、非劣性),否則就是 p-hacking。

H₁: μ > μ₀ (or <). All α in one tail. Requires pre-specified justification that the opposite direction is "uninteresting or impossible" (e.g. equivalence, non-inferiority). Otherwise it is p-hacking.

⚠️
單尾 = 偷一倍 power?常見誤解:「我用單尾 p 比較容易 < 0.05」。真相是——你只能在事前合理排除反方向的情況才能用單尾;事後改方向是學術不端。Bland & Altman 1994 BMJ 明確警告。 "One-tailed = free power"? A common myth. The truth: you may only use one-tailed when the opposite direction is pre-specified as uninteresting; switching after seeing data is research misconduct. Bland & Altman 1994 BMJ are explicit.

α、β、Power 視覺化

下方顯示兩個常態分布:左邊(藍)= H₀ 真實時統計量的分布;右邊(綠)= H₁ 真實時的分布,差距 = 效應量 d。紅色垂直線是臨界值(由 α 決定)。紅尾 = α(Type I error,H₀ 區域中超過臨界值的部分);橙色區 = β(Type II error,H₁ 區域中未超過臨界值的部分);綠色區 = power(H₁ 區域中超過臨界值的部分,1 − β)。拖滑桿觀察三者怎麼動。

Two normal densities below: left (blue) = sampling distribution under H₀; right (green) = under H₁, gap = effect size d. The red vertical line is the critical value (set by α). Red tail = α (Type I, the slice of H₀ beyond critical); orange = β (Type II, slice of H₁ below critical); green = power (slice of H₁ above critical, 1 − β). Drag the sliders and watch all three react.

藍 = H₀ · 綠 = H₁ · 紅線 = 臨界值 · 紅尾 = α · 橙 = β · 綠區 = powerBlue = H₀ · Green = H₁ · Red line = critical value · Red tail = α · Orange = β · Green = power

三、ASA、S-value、分岔花園

2016 年 3 月 7 日,美國統計學會(ASA)發表史上第一份對單一統計概念的官方立場聲明——〈The ASA Statement on p-Values: Context, Process, and Purpose〉(Wasserstein & Lazar 2016)。文件列出六大原則,每一條都對應一個常見誤用:

On 7 March 2016 the American Statistical Association issued the first official position statement in its history about a single statistical concept — "The ASA Statement on p-Values: Context, Process, and Purpose" (Wasserstein & Lazar 2016). The document lists six principles, each addressing a frequent misuse:

原則說明
1p-value 衡量「資料與某個指定統計模型的不相容程度」。p-values can indicate how incompatible the data are with a specified model.
2p-value 不是「假設為真的機率」,也不是「結果是偶然產生的機率」。p-values do not measure the probability the hypothesis is true, nor that the data are "random chance alone".
3科學結論與商業/政策決定,不應僅以 p 是否越過某個門檻來決定。Scientific or policy decisions should not rest solely on whether p crosses a threshold.
4適當的推論需要完整報告與透明化(含分析過程、樣本選擇、模型假設)。Proper inference requires full reporting and transparency.
5p-value 不衡量效應大小或結果重要性。A p-value does not measure the size or importance of an effect.
6單一 p-value 不能提供完整證據——須配合效應量、CI、先驗、研究設計一起讀。By itself a p-value does not provide a good measure of evidence — pair it with effect size, CI, design, priors.

2019 年的續篇〈Moving to a World Beyond "p < 0.05"〉(Wasserstein, Schirm & Lazar 2019, Am Stat)由 43 位作者共同撰寫,立場更激進:建議停用「statistically significant」這個詞彙,因為它已被當成「結果重要 vs 不重要」的開關來濫用。同期 Amrhein, Greenland & McShane (2019, Nature) 也發起 800+ 學者連署的「Retire statistical significance」運動。

The 2019 sequel "Moving to a World Beyond p < 0.05" (Wasserstein, Schirm & Lazar 2019, Am Stat) — co-authored by 43 statisticians — went further: stop using the phrase "statistically significant", because it had become a switch flipped between "matters" and "doesn't". The same week, Amrhein, Greenland & McShane (2019, Nature) gathered 800+ scientists to "retire statistical significance".

S-value(驚奇值)

Greenland 2019 Am Stat 建議用 S = −log₂(p)——以「位元」(bit)為單位表達 p 的證據強度。直觀:S 是「擲幾次公平硬幣全部同面」的等價驚奇度。

  • p = 0.5 → S = 1 bit(1 次硬幣)
  • p = 0.05 → S ≈ 4.3 bits(≈ 4 次硬幣全朝同一面)
  • p = 0.005 → S ≈ 7.6 bits(≈ 8 次)
  • p = 0.001 → S ≈ 10 bits

S-value 的優點:(1) 線性—— p 從 0.06 到 0.04 看似巨變,S 從 4.06 到 4.64 只移 0.6 bit;(2) 不會被「< vs ≥」二分法騙;(3) 直接對應日常驚奇感。

Greenland 2019 Am Stat recommends S = −log₂(p) — measuring p in bits of surprise: how many fair-coin flips all landing the same way is equally surprising.

  • p = 0.5 → S = 1 bit (1 flip)
  • p = 0.05 → S ≈ 4.3 bits (≈ 4 flips all same)
  • p = 0.005 → S ≈ 7.6 bits (≈ 8 flips)
  • p = 0.001 → S ≈ 10 bits

Advantages: (1) linear — p moving 0.06 → 0.04 looks dramatic, but S moves only 4.06 → 4.64; (2) cannot be tricked by the < vs ≥ dichotomy; (3) maps to everyday intuition.

分岔花園 (2014)

Gelman & Loken 2014 Am Sci「The Garden of Forking Paths」指出一個比 p-hacking 更陰險的問題:研究者不需故意多次測試——只要在分析時做的每個小選擇(要不要剔除 outlier?要不要 log 變換?亞群分析?)都會在不同資料下不同,整體就成了多重比較。

結果:單一論文裡看似「事先設計」的單一檢定,其實是研究者腦中「眾多檢定」的選一個顯著的呈現。這比赤裸的 p-hacking 更難偵測,也更普遍。Borges 的「岔路花園」短篇就是論文標題的靈感。

Gelman & Loken 2014 Am Sci's "Garden of Forking Paths" describes something subtler than p-hacking: the analyst need not consciously run many tests — every small decision (exclude outliers? log-transform? subgroup?) would differ on different data, so the overall procedure is itself a multiple comparison.

Result: a single "pre-specified" test in a paper is the surviving member of many implicit tests the analyst's mind ran. Harder to detect than overt p-hacking — and far more common. Borges' "Garden of Forking Paths" is the inspiration for the name.

⚠️
解方:preregistration(預先註冊)。在收資料把假設、檢定、α、停止規則寫死並上傳到第三方平台(OSF RegistriesAsPredicted、Clinicaltrials.gov)。Registered Reports(《Cortex》《Nature Human Behaviour》《eLife》接受)更進一步——同行評審先審 protocol、再做實驗,無論結果都發表。Nosek et al. 2018 PNAS 估計 RR 能把偽陽性率降低 5–10 倍。 The remedy: preregistration. Before data collection, lock down hypotheses, tests, α, and stopping rules on a third-party platform (OSF Registries, AsPredicted, ClinicalTrials.gov). Registered Reports (offered by Cortex, Nature Human Behaviour, eLife) go further: peer review the protocol, then run the experiment, publish regardless of outcome. Nosek et al. 2018 PNAS estimate RRs cut false-positive rates 5–10×.

四、等效與非劣性

傳統 NHST 只能否決「無差異」,無法證明「無差異」——這是它最常被誤用的地方(看到 p > 0.05 就說「兩組相同」是嚴重錯誤)。當研究目的是展示差不多(如學名藥 vs 原廠藥)或不會更差(新療法非劣於現有標準),應改用以下兩種設計:

Classical NHST can only reject "no difference"; it cannot prove "no difference" — its single most abused property (concluding "equal" from p > 0.05 is a serious error). When the goal is to show similar enough (generic vs brand drug) or not meaningfully worse (new therapy non-inferior to standard), switch to one of these designs:

🎯 Equivalence — TOST

Schuirmann 1987 J Pharmacokinet Biopharm 的「Two One-Sided Tests」:定義等效邊界 δ(如 ±20%),然後做兩個單尾檢定

H₀: μ₁ − μ₂ ≤ −δ 或 ≥ +δ   vs   H₁: −δ < μ₁ − μ₂ < +δ

兩個 p 都 < α(等價於 90% CI 完全落在 (−δ, +δ) 內),即可宣告等效。FDA 生物等效性指引(2003、2022 更新)用 GMR 的 90% CI 必須在 (0.80, 1.25)。Lakens 2017 Soc Psychol Personal Sci 是入門 TOST 必讀,配套 R 套件 TOSTER

Schuirmann 1987 J Pharmacokinet Biopharm's Two One-Sided Tests: define an equivalence margin δ (e.g. ±20 %) and run two one-sided tests:

H₀: μ₁ − μ₂ ≤ −δ or ≥ +δ   vs   H₁: −δ < μ₁ − μ₂ < +δ

If both ps < α (equivalently, the 90 % CI lies entirely inside (−δ, +δ)) you may claim equivalence. FDA Bioequivalence Guidance (2003; updated 2022) requires the 90 % CI of the GMR to lie within (0.80, 1.25). Lakens 2017 Soc Psychol Personal Sci is the canonical tutorial; R package TOSTER.

⚖️ Non-Inferiority

新療法只需不顯著差於標準療法即可——常見於有道德問題不能放安慰劑的臨床試驗。CONSORT-NI 2012 JAMA(Piaggio et al.)規範報告:

H₀: μnew < μstd − δ   vs   H₁: μnew ≥ μstd − δ

δ(非劣性邊界)必須事先決定,並基於臨床意義、不是統計方便。報告應同時呈現 CI 與 δ,並明確標示「non-inferiority margin」。FDA Non-Inferiority Guidance 2016 提供範例。

The new therapy need only be not meaningfully worse than the standard — common in clinical trials where placebo is unethical. CONSORT-NI 2012 JAMA (Piaggio et al.) regulates reporting:

H₀: μnew < μstd − δ   vs   H₁: μnew ≥ μstd − δ

The margin δ must be pre-specified on clinical, not statistical, grounds. Reports must show the CI alongside δ and explicitly label the "non-inferiority margin". FDA Non-Inferiority Guidance 2016 provides templates.

常見錯誤:「p = 0.31,兩組沒有顯著差異 → 兩組相同」← 大錯!p > 0.05 只代表「沒有足夠證據說有差異」,可能是真的沒差、也可能是 power 不夠。要證明「相同」,請跑等效檢定。Greene et al. 2008 BMJ 估計約 60% 的非顯著結果論文錯誤地宣稱「無差異」。 Common error: "p = 0.31, no significant difference → the groups are equal." Wrong. p > 0.05 only means "no evidence of a difference" — could be no effect, or could be underpowered. To claim equality, run an equivalence test. Greene et al. 2008 BMJ estimate ~60 % of non-significant-result papers wrongly conclude "no difference".

p-value 的分布長相

很多人不知道:當 H₀ 為真時,p-value 是均勻分布 U(0,1)——所以你看到 p = 0.04 跟 p = 0.96 的機率一樣大!只有當 H₁ 為真時,p 才會偏向 0(小 p 變多)。下方模擬 5000 次雙樣本 z 檢定,把所有 p 畫成 histogram。試試 d = 0(H₀ 真)vs d = 0.5(中等效應)。

A surprising fact: under H₀ the p-value is uniformly distributed on (0,1) — so p = 0.04 and p = 0.96 are equally likely. Only under H₁ does the distribution skew toward 0 (small p's become more common). The widget below simulates 5000 two-sample z-tests and histograms the p's. Try d = 0 (H₀ true) vs d = 0.5 (medium effect).

紅虛線 = α = 0.05 · 紅色 bin = p < 0.05 的比例(≈ 觀察 power)Red dashed = α = 0.05 · Red bin = fraction with p < 0.05 (≈ observed power)

五、檢定設計決策樹

🌳 檢定設計決策樹

Q1:
你的研究問題是「有沒有差異?」(探索性 / 比較性研究)→ 是 → 傳統 significance test(t、ANOVA、迴歸),雙尾為主;務必同時報 effect size + 95% CI。
Q2:
你的問題是「兩者效果差不多嗎?」(學名藥、不同儀器交叉驗證)→ 是 → Equivalence test (TOST),事前定 δ,跑兩個單尾。
Q3:
「新療法不會比現有更差嗎?」(活性對照臨床試驗)→ 是 → Non-inferiority test,事前定 δ;遵循 CONSORT-NI。
Q4:
多個假設要同時檢定(基因體、imaging)?→ 是 →FDR 或 Bonferroni 校正(見 Step 12);不要單獨用 raw p。
Q5:
樣本數很小(n < 10)或分布極不正常?→ 是 → permutation / exact test(Fisher's exact、Wilcoxon、bootstrap),別硬套常態檢定。
Q6:
關心方向嚴格事先確定→ 是 → 才能用單尾;否則一律雙尾。
Q1:
"Is there a difference?" (exploratory / comparative)→ Yes → classical significance test (t, ANOVA, regression), two-tailed by default; report effect size + 95 % CI alongside.
Q2:
"Are the two effectively equal?" (generic drug, cross-device validation)→ Yes → Equivalence (TOST); pre-specify δ; run two one-sided tests.
Q3:
"Is the new therapy not meaningfully worse?" (active-control trial)→ Yes → Non-inferiority with a pre-specified δ; follow CONSORT-NI.
Q4:
Many hypotheses at once (genomics, imaging)?→ Yes → add FDR or Bonferroni correction (Step 12); never report raw p alone.
Q5:
Very small n (< 10) or wildly non-normal?→ Yes → use permutation / exact tests (Fisher's exact, Wilcoxon, bootstrap) rather than forcing normality.
Q6:
Direction is strictly pre-specified?→ Yes → only then may you use a one-tailed test; otherwise two-tailed.

七、實作範例

# R: z-test, TOST equivalence, effect size + CI
library(tidyverse)
library(TOSTER)        # equivalence (Lakens 2017)
library(effectsize)    # Cohen's d + 95% CI

# --- Synthetic biomarker example ---
set.seed(42)
ctrl <- rnorm(30, mean = 5.0, sd = 1.0)
trt  <- rnorm(30, mean = 5.5, sd = 1.0)

# --- (1) Classical two-sided test ---
res <- t.test(trt, ctrl, var.equal = FALSE)   # Welch by default
res$p.value
res$conf.int
effectsize::cohens_d(trt, ctrl)         # d + 95% CI

# --- (2) Report S-value (Greenland 2019) ---
s_val <- -log2(res$p.value)             # bits of surprise

# --- (3) TOST equivalence test (δ = 0.4) ---
TOSTER::tsum_TOST(
  m1 = mean(trt), sd1 = sd(trt), n1 = 30,
  m2 = mean(ctrl), sd2 = sd(ctrl), n2 = 30,
  low_eqbound = -0.4, high_eqbound = 0.4,
  eqbound_type = "raw", alpha = 0.05
)
# Both 1-sided p < 0.05 → equivalent within ±0.4 units

# --- (4) Recommended sentence ---
sprintf("Mean diff = %.2f (95%% CI %.2f to %.2f); p = %.3f (S = %.1f bits)",
        mean(trt) - mean(ctrl),
        -res$conf.int[2], -res$conf.int[1],
        res$p.value, s_val)
import numpy as np
from scipy import stats
import math

# Synthetic biomarker example
rng = np.random.default_rng(42)
ctrl = rng.normal(5.0, 1.0, 30)
trt  = rng.normal(5.5, 1.0, 30)

# (1) Welch t-test (two-sided)
res = stats.ttest_ind(trt, ctrl, equal_var=False)
mean_diff = trt.mean() - ctrl.mean()

# 95% CI for mean difference
se = math.sqrt(trt.var(ddof=1)/len(trt) + ctrl.var(ddof=1)/len(ctrl))
df = res.df
tcrit = stats.t.ppf(0.975, df)
ci = (mean_diff - tcrit*se, mean_diff + tcrit*se)

# (2) Cohen's d (Hedges-corrected)
pooled = math.sqrt(((len(trt)-1)*trt.var(ddof=1) +
                   (len(ctrl)-1)*ctrl.var(ddof=1)) / (len(trt)+len(ctrl)-2))
d = mean_diff / pooled

# (3) S-value (Greenland 2019)
s_val = -math.log2(res.pvalue)

# (4) TOST equivalence (δ = ±0.4)
delta = 0.4
t1 = (mean_diff - (-delta)) / se   # H0: diff <= -delta
t2 = (mean_diff -   delta ) / se   # H0: diff >= +delta
p1 = 1 - stats.t.cdf(t1, df)
p2 = stats.t.cdf(t2, df)
p_tost = max(p1, p2)              # larger of the two 1-sided p's

print(f"diff = {mean_diff:.2f} (95% CI {ci[0]:.2f} to {ci[1]:.2f})")
print(f"p = {res.pvalue:.3f}  S = {s_val:.1f} bits  d = {d:.2f}")
print(f"TOST p = {p_tost:.3f} → equivalent within ±{delta}: {p_tost < 0.05}")
💡
練習:把過去一篇你寫的論文裡的「p = 0.0X」改寫成「diff (95% CI); p (S bits)」格式。多半你會發現 CI 上下界並不一致地遠離 0,或效應量小到沒臨床意義——這正是 ASA 2019 提倡的「資訊密度報告」。 Exercise: rewrite a "p = 0.0X" line from your last paper as "diff (95 % CI); p (S bits)". Often you'll discover the CI is not symmetrically far from 0, or the effect is too small to matter clinically — exactly the information-rich reporting ASA 2019 advocates.

八、四大陷阱

p < 0.05 懸崖

p = 0.049 不是「正確」、p = 0.051 不是「錯誤」——兩者證據強度幾乎相同(S ≈ 4.36 vs 4.29 bits)。把 0.05 當二分懸崖是 ASA 2016 第 3 條原則明文反對的。解方:報告連續的 p、effect size、95% CI,讓讀者自己判斷臨床重要性。

p = 0.049 is not "right"; p = 0.051 is not "wrong" — the evidence strength is essentially equal (S ≈ 4.36 vs 4.29 bits). Treating 0.05 as a cliff is explicitly rejected by ASA 2016 Principle 3. Fix: report continuous p, effect size, and 95 % CI; let the reader judge clinical importance.

事後挑選

「我跑了 20 個亞群分析,挑了最顯著的兩個發表。」這是赤裸的 p-hacking。即使 H₀ 全部為真,跑 20 個獨立檢定中至少一個 p < 0.05 的機率是 1 − (1−0.05)²⁰ ≈ 64%解方:preregistration + FDR / Bonferroni 校正(Step 12)。

"I ran 20 subgroup analyses and reported the two most significant." Plain p-hacking. Even if all H₀ are true, the chance at least one of 20 independent tests has p < 0.05 is 1 − (1 − 0.05)²⁰ ≈ 64 %. Fix: preregistration + FDR / Bonferroni (Step 12).

「不顯著」≠「沒效應」

Altman & Bland 1995 BMJ: 〈Absence of evidence is not evidence of absence〉。p > 0.05 可能來自真的沒效應、樣本太小、變異太大、設計不良。解方:看 95% CI——若 CI 跨 0 但寬到 (−5, +5),就是「不確定」,不是「等於 0」。要證明「相同」改用 TOST。

Altman & Bland 1995 BMJ: "Absence of evidence is not evidence of absence." p > 0.05 may mean no effect, small n, big variance, or bad design. Fix: read the 95 % CI — if it crosses 0 but spans (−5, +5), that's "uncertain", not "zero". Use TOST to claim equality.

基率謬誤

「p = 0.05 就 95% 信任結果」是錯的。若研究的真實假設只有 10% 為真,依貝氏分析(Ioannidis 2005 PLoS Med「Why Most Published Research Findings Are False」):在 α = 0.05、power = 0.80 下,PPV(陽性預測值)只有 ~36%——大半發表的 p < 0.05 結果其實是偽陽性。解方:提高 power、preregister、複現研究。

"p = 0.05, 95 % trust" is wrong. If only 10 % of tested hypotheses are true, Bayes (Ioannidis 2005 PLoS Med: "Why Most Published Research Findings Are False") gives, at α = 0.05 and power = 0.80, a PPV of only ~36 % — most published p < 0.05 results are false positives. Fix: raise power, preregister, replicate.

缺 effect size + CI

大 n 下,p = 0.0001 也可能對應 d = 0.05 的微小效應——統計顯著但臨床無意義。CONSORT 2010 / SAMPL 2015 都要求同時報告效應量 + 95% CI。

With large n, p = 0.0001 may correspond to a trivial d = 0.05 — statistically significant but clinically meaningless. CONSORT 2010 / SAMPL 2015 both require effect size + 95 % CI alongside p.

HARKing

看完結果再回頭「事後設計」假設讓 p 顯著——Kerr 1998 Pers Soc Psychol Rev 命名為 HARKing(Hypothesising After the Results are Known)。學術不端的核心類型之一。解方:preregistration、Registered Reports、分離「探索 vs 確證」分析。

Reverse-engineering a hypothesis to fit the significant finding — Kerr 1998 Pers Soc Psychol Rev named it HARKing (Hypothesising After Results Known). A canonical form of research misconduct. Fix: preregistration, Registered Reports, separating exploratory vs confirmatory analyses.

📝 自我檢測

1. 你看到一篇論文寫「p = 0.02,所以 H₀ 為真的機率是 2%」,最佳回應是?

1. A paper writes "p = 0.02, so the probability H₀ is true is 2 %." Best response?

A. 沒問題,這就是 p 的意思A. Fine, that's what p means
B. 改寫成「H₁ 為真的機率是 98%」更精確B. Rephrase as "98 % chance H₁ is true"
C. 錯誤:p 是 P(data | H₀),不是 P(H₀ | data);要算後者得用貝氏分析C. Wrong: p is P(data | H₀), not P(H₀ | data); the latter needs Bayes
D. 沒差,本來統計都是這樣寫D. Doesn't matter, all stats papers write this way

2. 在學名藥研發中要「證明等效於原廠藥」,最合適的設計是?

2. To "prove equivalence" of a generic to a brand-name drug, the best design is:

A. 跑傳統雙樣本 t 檢定,若 p > 0.05 就宣告等效A. Run classical two-sample t-test; declare equal if p > 0.05
B. TOST 等效檢定:事前定 δ(如 ±20%),確認 90% CI 落在 (−δ, +δ) 內B. TOST equivalence test: pre-specify δ (e.g. ±20 %), verify 90 % CI lies in (−δ, +δ)
C. 單尾 t 檢定,方向取對自己有利的C. One-tailed t-test in the convenient direction
D. 跑 100 次然後挑最不顯著的那次D. Run 100 tests and report the least significant one

3. Gelman & Loken 的「分岔花園」(garden of forking paths)描述的問題是?

3. Gelman & Loken's "garden of forking paths" describes:

A. 多重共線性 (multicollinearity)A. Multicollinearity
B. 異質性檢驗B. Heterogeneity testing
C. 即使沒有主動 p-hacking,分析中每個小選擇都會隨資料而變,整體成為隱含的多重比較C. Even without deliberate p-hacking, each analyst choice would vary with data — creating implicit multiple comparisons
D. 樣本不獨立的問題D. Non-independent samples

4. Greenland 2019 提出的 S-value,p = 0.05 對應的 S 約為?

4. Greenland's S-value: p = 0.05 corresponds to approximately:

A. 0.05 bitsA. 0.05 bits
B. 0.5 bitsB. 0.5 bits
C. 20 bitsC. 20 bits
D. ≈ 4.3 bits(相當於 4 次公平硬幣全朝同一面)D. ≈ 4.3 bits (about 4 fair coin flips landing the same way)