為什麼假設檢定被廣泛誤解?
假設檢定是 20 世紀統計學「最成功」也「最被濫用」的工具。Fisher(1925)原本提出 p-value 作為「資料是否值得進一步研究」的弱證據指標;Neyman & Pearson(1933)後來補上 α、β、Type I/II error 的決策框架。今天的「混合派」(hybrid logic)是兩者倉促拼接的產物——這就是為什麼 90% 的研究者在被問到「p = 0.04 是什麼意思?」時答錯。
ASA(American Statistical Association)2016 統計聲明(Wasserstein & Lazar 2016, Am Stat)正式承認:研究界長期誤用 p-value,列出六大原則。2019 年的續篇〈Moving to a World Beyond "p < 0.05"〉更直接呼籲:停止用「統計顯著」這個詞。本章不只教你怎麼「跑檢定」,更要教你怎麼讀懂並誠實報告。
Hypothesis testing is the 20th century's most successful — and most abused — statistical tool. Fisher (1925) proposed p-values as weak evidence that the data deserve further study; Neyman & Pearson (1933) later added α, β, and Type I/II decision theory. The "hybrid" logic taught today is a hasty splicing of the two — which is exactly why 90 % of researchers misread "p = 0.04" when asked.
The ASA 2016 Statement (Wasserstein & Lazar 2016, Am Stat) formally acknowledged decades of misuse and listed six principles. Its 2019 sequel — "Moving to a World Beyond p < 0.05" — went further: stop using the phrase "statistically significant" altogether. This chapter teaches you not just how to run a test, but how to read it and report it honestly.
一、五步檢定框架
不論你跑 t 檢定、卡方、ANOVA、迴歸——所有頻率學派假設檢定都走同樣的五步。背下這五步比背公式重要十倍:
Whether you run a t-test, chi-square, ANOVA, or regression — every frequentist test follows the same five steps. Memorising them matters ten times more than memorising formulas:
陳述 H₀ 與 H₁
H₀(虛無假設):「沒有效應 / 沒有差異 / 沒有關聯」(μ₁ = μ₂)。H₁(對立假設):你想證據支持的東西(μ₁ ≠ μ₂、μ₁ > μ₂、或 μ₁ < μ₂)。寫下方向(單尾 vs 雙尾)必須在看資料前定,事後改是 p-hacking。
H₀ (null): "no effect / no difference / no association" (μ₁ = μ₂). H₁ (alternative): what you want evidence for (μ₁ ≠ μ₂, μ₁ > μ₂, or μ₁ < μ₂). Pick directionality (one- vs two-tailed) before you see the data — switching after is p-hacking.
決定 α
傳統 α = 0.05 是 Fisher 1925 的歷史慣例而非物理常數。基因體(多重比較)通常用 α = 0.05/m 或 FDR;高利害臨床試驗用 α = 0.025(單尾);高能物理用 5σ ≈ 3×10⁻⁷。α 應依據錯誤成本選擇。
The traditional α = 0.05 is Fisher's 1925 convention, not a physical constant. Genomics (many tests) use α = 0.05/m or FDR; pivotal clinical trials use one-sided α = 0.025; particle physics uses 5σ ≈ 3 × 10⁻⁷. α should be chosen by the cost of being wrong.
計算檢定統計量
把資料壓成一個數字——z、t、F、χ² 等。一般形式:統計量 = (估計值 − 虛無值) / 標準誤。這個數字在 H₀ 為真時有已知分布。
Compress the data into a single number — z, t, F, χ². General form: statistic = (estimate − null value) / standard error. Under H₀ that number follows a known distribution.
得 p-value
p = P(觀察到目前或更極端的統計量 | H₀ 為真)。注意三個必要條件:(a) 「或更極端」——所以 p 是尾部累積機率,不是點機率;(b) 「假設 H₀ 為真」——p 不是 H₀ 的機率;(c) 「目前的檢定設計」——若你偷看資料才決定方向,這個 p 是錯的。
p = P(statistic this extreme or more | H₀ true). Three indispensable pieces: (a) "or more extreme" — p is a tail probability, not a point one; (b) "assuming H₀ is true" — p is not the probability of H₀; (c) "under the current design" — peeking before picking a direction invalidates it.
做決定
p < α → 「拒絕 H₀」;p ≥ α → 「未能拒絕 H₀」(不是「接受 H₀」、也不是「證明沒效應」)。同時必須報告 效應量(Cohen's d、OR、HR)與 95% CI——這比 p 本身更有資訊量。
p < α → "reject H₀"; p ≥ α → "fail to reject H₀" (not "accept H₀", not "no effect proven"). Always also report the effect size (Cohen's d, OR, HR) and 95 % CI — they carry more information than p itself.
二、Type I / II 與單尾 vs 雙尾
Type I (α)
P(reject H₀ ∣ H₀ true)。明明沒效應,卻說有——「無中生有」。α 是長期偽陽性率:跑 100 次測試,平均 5 次會誤報。臨床上對應「批准了無效藥」,FDA 嚴格控制。
P(reject H₀ ∣ H₀ true). No real effect, but you call one — making something from nothing. α is the long-run false-positive rate: of 100 tests, on average 5 mis-fire. Clinically: "approving a useless drug." Tightly controlled by FDA.
Type II (β)
P(fail to reject H₀ ∣ H₁ true)。明明有效應,卻沒偵測到——「眼前漏看」。β 由樣本大小、效應量、α 共同決定。Power = 1 − β。傳統慣例:power ≥ 0.80(β ≤ 0.20)。Type II 是 underpowered 研究的禍源——見 Step 11。
P(fail to reject H₀ ∣ H₁ true). A real effect, but you missed it. β depends on n, effect size, and α together. Power = 1 − β. Convention: power ≥ 0.80 (β ≤ 0.20). Type II is the curse of underpowered studies — see Step 11.
α vs β 拉鋸
固定 n 時,降低 α 會升高 β(拒絕門檻變嚴 → 更難偵測真效應)。同時降低兩者的唯一辦法是增加 n。Neyman-Pearson lemma:UMP(最強檢定)在 α 與 H₁ 點固定時最大化 power。
For fixed n, lowering α raises β (stricter threshold ⇒ harder to detect real effects). The only way to lower both is to raise n. The Neyman–Pearson lemma: the UMP (uniformly most powerful) test maximises power for a fixed α and point-H₁.
| H₀ true | H₁ true | |
|---|---|---|
| Reject H₀ | ❌ Type I error(機率 α)❌ Type I error (prob. α) | ✅ 正確(機率 1 − β = power)✅ Correct (prob. 1 − β = power) |
| Fail to reject H₀ | ✅ 正確(機率 1 − α)✅ Correct (prob. 1 − α) | ❌ Type II error(機率 β)❌ Type II error (prob. β) |
雙尾檢定
H₁: μ ≠ μ₀。α 分到兩端各 α/2。默認首選——大多數情況不知道方向。
H₁: μ ≠ μ₀. α split equally between both tails. Default choice — most of the time you do not know the direction.
單尾檢定
H₁: μ > μ₀(或 <)。α 全放一邊。須事前嚴格證明反方向的效應「不重要或不可能」(如生物等價性、非劣性),否則就是 p-hacking。
H₁: μ > μ₀ (or <). All α in one tail. Requires pre-specified justification that the opposite direction is "uninteresting or impossible" (e.g. equivalence, non-inferiority). Otherwise it is p-hacking.
α、β、Power 視覺化
下方顯示兩個常態分布:左邊(藍)= H₀ 真實時統計量的分布;右邊(綠)= H₁ 真實時的分布,差距 = 效應量 d。紅色垂直線是臨界值(由 α 決定)。紅尾 = α(Type I error,H₀ 區域中超過臨界值的部分);橙色區 = β(Type II error,H₁ 區域中未超過臨界值的部分);綠色區 = power(H₁ 區域中超過臨界值的部分,1 − β)。拖滑桿觀察三者怎麼動。
Two normal densities below: left (blue) = sampling distribution under H₀; right (green) = under H₁, gap = effect size d. The red vertical line is the critical value (set by α). Red tail = α (Type I, the slice of H₀ beyond critical); orange = β (Type II, slice of H₁ below critical); green = power (slice of H₁ above critical, 1 − β). Drag the sliders and watch all three react.
藍 = H₀ · 綠 = H₁ · 紅線 = 臨界值 · 紅尾 = α · 橙 = β · 綠區 = powerBlue = H₀ · Green = H₁ · Red line = critical value · Red tail = α · Orange = β · Green = power
三、ASA、S-value、分岔花園
2016 年 3 月 7 日,美國統計學會(ASA)發表史上第一份對單一統計概念的官方立場聲明——〈The ASA Statement on p-Values: Context, Process, and Purpose〉(Wasserstein & Lazar 2016)。文件列出六大原則,每一條都對應一個常見誤用:
On 7 March 2016 the American Statistical Association issued the first official position statement in its history about a single statistical concept — "The ASA Statement on p-Values: Context, Process, and Purpose" (Wasserstein & Lazar 2016). The document lists six principles, each addressing a frequent misuse:
| 原則 | 說明 |
|---|---|
| 1 | p-value 衡量「資料與某個指定統計模型的不相容程度」。p-values can indicate how incompatible the data are with a specified model. |
| 2 | p-value 不是「假設為真的機率」,也不是「結果是偶然產生的機率」。p-values do not measure the probability the hypothesis is true, nor that the data are "random chance alone". |
| 3 | 科學結論與商業/政策決定,不應僅以 p 是否越過某個門檻來決定。Scientific or policy decisions should not rest solely on whether p crosses a threshold. |
| 4 | 適當的推論需要完整報告與透明化(含分析過程、樣本選擇、模型假設)。Proper inference requires full reporting and transparency. |
| 5 | p-value 不衡量效應大小或結果重要性。A p-value does not measure the size or importance of an effect. |
| 6 | 單一 p-value 不能提供完整證據——須配合效應量、CI、先驗、研究設計一起讀。By itself a p-value does not provide a good measure of evidence — pair it with effect size, CI, design, priors. |
2019 年的續篇〈Moving to a World Beyond "p < 0.05"〉(Wasserstein, Schirm & Lazar 2019, Am Stat)由 43 位作者共同撰寫,立場更激進:建議停用「statistically significant」這個詞彙,因為它已被當成「結果重要 vs 不重要」的開關來濫用。同期 Amrhein, Greenland & McShane (2019, Nature) 也發起 800+ 學者連署的「Retire statistical significance」運動。
The 2019 sequel "Moving to a World Beyond p < 0.05" (Wasserstein, Schirm & Lazar 2019, Am Stat) — co-authored by 43 statisticians — went further: stop using the phrase "statistically significant", because it had become a switch flipped between "matters" and "doesn't". The same week, Amrhein, Greenland & McShane (2019, Nature) gathered 800+ scientists to "retire statistical significance".
S-value(驚奇值)
Greenland 2019 Am Stat 建議用 S = −log₂(p)——以「位元」(bit)為單位表達 p 的證據強度。直觀:S 是「擲幾次公平硬幣全部同面」的等價驚奇度。
- p = 0.5 → S = 1 bit(1 次硬幣)
- p = 0.05 → S ≈ 4.3 bits(≈ 4 次硬幣全朝同一面)
- p = 0.005 → S ≈ 7.6 bits(≈ 8 次)
- p = 0.001 → S ≈ 10 bits
S-value 的優點:(1) 線性—— p 從 0.06 到 0.04 看似巨變,S 從 4.06 到 4.64 只移 0.6 bit;(2) 不會被「< vs ≥」二分法騙;(3) 直接對應日常驚奇感。
Greenland 2019 Am Stat recommends S = −log₂(p) — measuring p in bits of surprise: how many fair-coin flips all landing the same way is equally surprising.
- p = 0.5 → S = 1 bit (1 flip)
- p = 0.05 → S ≈ 4.3 bits (≈ 4 flips all same)
- p = 0.005 → S ≈ 7.6 bits (≈ 8 flips)
- p = 0.001 → S ≈ 10 bits
Advantages: (1) linear — p moving 0.06 → 0.04 looks dramatic, but S moves only 4.06 → 4.64; (2) cannot be tricked by the < vs ≥ dichotomy; (3) maps to everyday intuition.
分岔花園 (2014)
Gelman & Loken 2014 Am Sci「The Garden of Forking Paths」指出一個比 p-hacking 更陰險的問題:研究者不需故意多次測試——只要在分析時做的每個小選擇(要不要剔除 outlier?要不要 log 變換?亞群分析?)都會在不同資料下不同,整體就成了多重比較。
結果:單一論文裡看似「事先設計」的單一檢定,其實是研究者腦中「眾多檢定」的選一個顯著的呈現。這比赤裸的 p-hacking 更難偵測,也更普遍。Borges 的「岔路花園」短篇就是論文標題的靈感。
Gelman & Loken 2014 Am Sci's "Garden of Forking Paths" describes something subtler than p-hacking: the analyst need not consciously run many tests — every small decision (exclude outliers? log-transform? subgroup?) would differ on different data, so the overall procedure is itself a multiple comparison.
Result: a single "pre-specified" test in a paper is the surviving member of many implicit tests the analyst's mind ran. Harder to detect than overt p-hacking — and far more common. Borges' "Garden of Forking Paths" is the inspiration for the name.
四、等效與非劣性
傳統 NHST 只能否決「無差異」,無法證明「無差異」——這是它最常被誤用的地方(看到 p > 0.05 就說「兩組相同」是嚴重錯誤)。當研究目的是展示差不多(如學名藥 vs 原廠藥)或不會更差(新療法非劣於現有標準),應改用以下兩種設計:
Classical NHST can only reject "no difference"; it cannot prove "no difference" — its single most abused property (concluding "equal" from p > 0.05 is a serious error). When the goal is to show similar enough (generic vs brand drug) or not meaningfully worse (new therapy non-inferior to standard), switch to one of these designs:
🎯 Equivalence — TOST
Schuirmann 1987 J Pharmacokinet Biopharm 的「Two One-Sided Tests」:定義等效邊界 δ(如 ±20%),然後做兩個單尾檢定:
H₀: μ₁ − μ₂ ≤ −δ 或 ≥ +δ vs H₁: −δ < μ₁ − μ₂ < +δ
若兩個 p 都 < α(等價於 90% CI 完全落在 (−δ, +δ) 內),即可宣告等效。FDA 生物等效性指引(2003、2022 更新)用 GMR 的 90% CI 必須在 (0.80, 1.25)。Lakens 2017 Soc Psychol Personal Sci 是入門 TOST 必讀,配套 R 套件 TOSTER。
Schuirmann 1987 J Pharmacokinet Biopharm's Two One-Sided Tests: define an equivalence margin δ (e.g. ±20 %) and run two one-sided tests:
H₀: μ₁ − μ₂ ≤ −δ or ≥ +δ vs H₁: −δ < μ₁ − μ₂ < +δ
If both ps < α (equivalently, the 90 % CI lies entirely inside (−δ, +δ)) you may claim equivalence. FDA Bioequivalence Guidance (2003; updated 2022) requires the 90 % CI of the GMR to lie within (0.80, 1.25). Lakens 2017 Soc Psychol Personal Sci is the canonical tutorial; R package TOSTER.
⚖️ Non-Inferiority
新療法只需不顯著差於標準療法即可——常見於有道德問題不能放安慰劑的臨床試驗。CONSORT-NI 2012 JAMA(Piaggio et al.)規範報告:
H₀: μnew < μstd − δ vs H₁: μnew ≥ μstd − δ
δ(非劣性邊界)必須事先決定,並基於臨床意義、不是統計方便。報告應同時呈現 CI 與 δ,並明確標示「non-inferiority margin」。FDA Non-Inferiority Guidance 2016 提供範例。
The new therapy need only be not meaningfully worse than the standard — common in clinical trials where placebo is unethical. CONSORT-NI 2012 JAMA (Piaggio et al.) regulates reporting:
H₀: μnew < μstd − δ vs H₁: μnew ≥ μstd − δ
The margin δ must be pre-specified on clinical, not statistical, grounds. Reports must show the CI alongside δ and explicitly label the "non-inferiority margin". FDA Non-Inferiority Guidance 2016 provides templates.
p-value 的分布長相
很多人不知道:當 H₀ 為真時,p-value 是均勻分布 U(0,1)——所以你看到 p = 0.04 跟 p = 0.96 的機率一樣大!只有當 H₁ 為真時,p 才會偏向 0(小 p 變多)。下方模擬 5000 次雙樣本 z 檢定,把所有 p 畫成 histogram。試試 d = 0(H₀ 真)vs d = 0.5(中等效應)。
A surprising fact: under H₀ the p-value is uniformly distributed on (0,1) — so p = 0.04 and p = 0.96 are equally likely. Only under H₁ does the distribution skew toward 0 (small p's become more common). The widget below simulates 5000 two-sample z-tests and histograms the p's. Try d = 0 (H₀ true) vs d = 0.5 (medium effect).
紅虛線 = α = 0.05 · 紅色 bin = p < 0.05 的比例(≈ 觀察 power)Red dashed = α = 0.05 · Red bin = fraction with p < 0.05 (≈ observed power)
五、檢定設計決策樹
🌳 檢定設計決策樹
六、p-value 誤用對照表
| 情境 | 常見誤讀 | 正確解讀 |
|---|---|---|
| p = 0.03 | 「H₀ 為真的機率是 3%」"There's a 3 % chance H₀ is true" | 在 H₀ 為真的前提下,觀察到目前或更極端統計量的機率是 3%。Given H₀ is true, P(statistic this extreme or more) = 3 %. |
| p = 0.03 | 「結果為偶然的機率 3%」"3 % chance the result is due to chance" | p 是「資料 ∣ 模型」的條件機率,與「偶然 vs 真效應」的後驗無直接關係。p is P(data | model); it is not the posterior split between "chance" and "real effect". |
| p > 0.05 | 「兩組沒有差異」"The two groups are the same" | 缺乏拒絕 H₀ 的證據——可能真沒差,也可能 power 不夠。要證明「相同」請用 TOST。Insufficient evidence to reject H₀ — could be no effect, could be underpowered. Use TOST for equality. |
| p < 0.001 | 「效應很大」"Large effect" | p 很小可能來自大 n而非大效應。同時看 effect size。A tiny p may come from large n, not a large effect. Read the effect size. |
| p = 0.049 vs p = 0.051 | 「一個顯著,一個不顯著,差很多」"One significant, one not — big difference" | 兩者的證據強度幾乎一樣(S 差 ~0.06 bit)。「< vs ≥」是人為二分。Evidence strength is essentially identical (≈ 0.06 bit difference). The </≥ split is artificial. |
| 重複實驗 p = 0.06 | 「沒重現原研究 p = 0.04」"Failed to replicate the original p = 0.04" | 兩次都看到一致方向的小 p——其實是一致證據!「significance ≠ replication」(McShane & Gal 2017)。Two similar small p's in the same direction — that is consistent evidence. "Significance ≠ replication" (McShane & Gal 2017). |
| p = 0.04,n = 12 | 「可以發表」"Publishable" | 小 n 配剛跨門檻的 p——典型「winner's curse」,效應量極可能被高估。需重複驗證。Tiny n + barely-crossing p — textbook "winner's curse"; effect size almost certainly inflated. Needs replication. |
七、實作範例
# R: z-test, TOST equivalence, effect size + CI library(tidyverse) library(TOSTER) # equivalence (Lakens 2017) library(effectsize) # Cohen's d + 95% CI # --- Synthetic biomarker example --- set.seed(42) ctrl <- rnorm(30, mean = 5.0, sd = 1.0) trt <- rnorm(30, mean = 5.5, sd = 1.0) # --- (1) Classical two-sided test --- res <- t.test(trt, ctrl, var.equal = FALSE) # Welch by default res$p.value res$conf.int effectsize::cohens_d(trt, ctrl) # d + 95% CI # --- (2) Report S-value (Greenland 2019) --- s_val <- -log2(res$p.value) # bits of surprise # --- (3) TOST equivalence test (δ = 0.4) --- TOSTER::tsum_TOST( m1 = mean(trt), sd1 = sd(trt), n1 = 30, m2 = mean(ctrl), sd2 = sd(ctrl), n2 = 30, low_eqbound = -0.4, high_eqbound = 0.4, eqbound_type = "raw", alpha = 0.05 ) # Both 1-sided p < 0.05 → equivalent within ±0.4 units # --- (4) Recommended sentence --- sprintf("Mean diff = %.2f (95%% CI %.2f to %.2f); p = %.3f (S = %.1f bits)", mean(trt) - mean(ctrl), -res$conf.int[2], -res$conf.int[1], res$p.value, s_val)
import numpy as np from scipy import stats import math # Synthetic biomarker example rng = np.random.default_rng(42) ctrl = rng.normal(5.0, 1.0, 30) trt = rng.normal(5.5, 1.0, 30) # (1) Welch t-test (two-sided) res = stats.ttest_ind(trt, ctrl, equal_var=False) mean_diff = trt.mean() - ctrl.mean() # 95% CI for mean difference se = math.sqrt(trt.var(ddof=1)/len(trt) + ctrl.var(ddof=1)/len(ctrl)) df = res.df tcrit = stats.t.ppf(0.975, df) ci = (mean_diff - tcrit*se, mean_diff + tcrit*se) # (2) Cohen's d (Hedges-corrected) pooled = math.sqrt(((len(trt)-1)*trt.var(ddof=1) + (len(ctrl)-1)*ctrl.var(ddof=1)) / (len(trt)+len(ctrl)-2)) d = mean_diff / pooled # (3) S-value (Greenland 2019) s_val = -math.log2(res.pvalue) # (4) TOST equivalence (δ = ±0.4) delta = 0.4 t1 = (mean_diff - (-delta)) / se # H0: diff <= -delta t2 = (mean_diff - delta ) / se # H0: diff >= +delta p1 = 1 - stats.t.cdf(t1, df) p2 = stats.t.cdf(t2, df) p_tost = max(p1, p2) # larger of the two 1-sided p's print(f"diff = {mean_diff:.2f} (95% CI {ci[0]:.2f} to {ci[1]:.2f})") print(f"p = {res.pvalue:.3f} S = {s_val:.1f} bits d = {d:.2f}") print(f"TOST p = {p_tost:.3f} → equivalent within ±{delta}: {p_tost < 0.05}")
八、四大陷阱
❌ p < 0.05 懸崖
p = 0.049 不是「正確」、p = 0.051 不是「錯誤」——兩者證據強度幾乎相同(S ≈ 4.36 vs 4.29 bits)。把 0.05 當二分懸崖是 ASA 2016 第 3 條原則明文反對的。解方:報告連續的 p、effect size、95% CI,讓讀者自己判斷臨床重要性。
p = 0.049 is not "right"; p = 0.051 is not "wrong" — the evidence strength is essentially equal (S ≈ 4.36 vs 4.29 bits). Treating 0.05 as a cliff is explicitly rejected by ASA 2016 Principle 3. Fix: report continuous p, effect size, and 95 % CI; let the reader judge clinical importance.
❌ 事後挑選
「我跑了 20 個亞群分析,挑了最顯著的兩個發表。」這是赤裸的 p-hacking。即使 H₀ 全部為真,跑 20 個獨立檢定中至少一個 p < 0.05 的機率是 1 − (1−0.05)²⁰ ≈ 64%。解方:preregistration + FDR / Bonferroni 校正(Step 12)。
"I ran 20 subgroup analyses and reported the two most significant." Plain p-hacking. Even if all H₀ are true, the chance at least one of 20 independent tests has p < 0.05 is 1 − (1 − 0.05)²⁰ ≈ 64 %. Fix: preregistration + FDR / Bonferroni (Step 12).
❌ 「不顯著」≠「沒效應」
Altman & Bland 1995 BMJ: 〈Absence of evidence is not evidence of absence〉。p > 0.05 可能來自真的沒效應、樣本太小、變異太大、設計不良。解方:看 95% CI——若 CI 跨 0 但寬到 (−5, +5),就是「不確定」,不是「等於 0」。要證明「相同」改用 TOST。
Altman & Bland 1995 BMJ: "Absence of evidence is not evidence of absence." p > 0.05 may mean no effect, small n, big variance, or bad design. Fix: read the 95 % CI — if it crosses 0 but spans (−5, +5), that's "uncertain", not "zero". Use TOST to claim equality.
❌ 基率謬誤
「p = 0.05 就 95% 信任結果」是錯的。若研究的真實假設只有 10% 為真,依貝氏分析(Ioannidis 2005 PLoS Med「Why Most Published Research Findings Are False」):在 α = 0.05、power = 0.80 下,PPV(陽性預測值)只有 ~36%——大半發表的 p < 0.05 結果其實是偽陽性。解方:提高 power、preregister、複現研究。
"p = 0.05, 95 % trust" is wrong. If only 10 % of tested hypotheses are true, Bayes (Ioannidis 2005 PLoS Med: "Why Most Published Research Findings Are False") gives, at α = 0.05 and power = 0.80, a PPV of only ~36 % — most published p < 0.05 results are false positives. Fix: raise power, preregister, replicate.
❌ 缺 effect size + CI
大 n 下,p = 0.0001 也可能對應 d = 0.05 的微小效應——統計顯著但臨床無意義。CONSORT 2010 / SAMPL 2015 都要求同時報告效應量 + 95% CI。
With large n, p = 0.0001 may correspond to a trivial d = 0.05 — statistically significant but clinically meaningless. CONSORT 2010 / SAMPL 2015 both require effect size + 95 % CI alongside p.
❌ HARKing
看完結果再回頭「事後設計」假設讓 p 顯著——Kerr 1998 Pers Soc Psychol Rev 命名為 HARKing(Hypothesising After the Results are Known)。學術不端的核心類型之一。解方:preregistration、Registered Reports、分離「探索 vs 確證」分析。
Reverse-engineering a hypothesis to fit the significant finding — Kerr 1998 Pers Soc Psychol Rev named it HARKing (Hypothesising After Results Known). A canonical form of research misconduct. Fix: preregistration, Registered Reports, separating exploratory vs confirmatory analyses.
📝 自我檢測
1. 你看到一篇論文寫「p = 0.02,所以 H₀ 為真的機率是 2%」,最佳回應是?
1. A paper writes "p = 0.02, so the probability H₀ is true is 2 %." Best response?
2. 在學名藥研發中要「證明等效於原廠藥」,最合適的設計是?
2. To "prove equivalence" of a generic to a brand-name drug, the best design is:
3. Gelman & Loken 的「分岔花園」(garden of forking paths)描述的問題是?
3. Gelman & Loken's "garden of forking paths" describes:
4. Greenland 2019 提出的 S-value,p = 0.05 對應的 S 約為?
4. Greenland's S-value: p = 0.05 corresponds to approximately: