為什麼類別資料需要自己的一章?
連續資料用 t / ANOVA,但臨床、流病、遺傳學的核心問題往往是「比例 / 計數」:用藥組 vs 對照組的死亡比例、基因型 AA/Aa/aa 的疾病頻率、Mendel 9:3:3:1 的子代分配。對這些資料用 t 檢定不只是「精度差」——是本質錯誤,因為 t 檢定假設常態 + 等變異,但二元變數的變異與平均直接綁定(Var = p(1−p))。
類別資料的故事從 Karl Pearson 1900 開始:他在 Philosophical Magazine 提出「χ² 適合度檢定」,把「觀察 vs 期望」轉化為一個可加的距離總和。1922 年 R. A. Fisher 修正了自由度(df = (r−1)(c−1) 而非 rc−1),同年提出 Fisher's exact test 處理小樣本。1947 McNemar 解決「同一受試者前後」的配對問題,1959 Mantel-Haenszel 處理「分層 2×2 表」。整個類別資料分析的骨架由 1900-1959 這 60 年構築完成。
Continuous data go through t-tests and ANOVA, but the central questions in clinical, epidemiologic, and genetic research are often about proportions and counts: mortality in treatment vs control, disease frequency among AA / Aa / aa genotypes, Mendel's 9:3:3:1 offspring split. Running a t-test on data like these isn't merely "imprecise" — it's conceptually wrong: t-tests assume normality and constant variance, but for a binary variable the variance is locked to the mean (Var = p(1−p)).
The story starts with Karl Pearson 1900 in Philosophical Magazine, who turned "observed vs expected" into a single additive distance — the χ² goodness-of-fit test. R. A. Fisher 1922 fixed the degrees of freedom (df = (r−1)(c−1), not rc−1) and, in the same year, introduced Fisher's exact test for small samples. McNemar (1947) handled before/after pairing in the same subjects; Mantel-Haenszel (1959) handled stratified 2×2 tables. The whole skeleton of categorical-data analysis was built between 1900 and 1959.
一、列聯表、期望值、卡方統計量
把資料攤成 r × c 的列聯表(contingency table):行是一個類別變數(如 treatment / control),列是另一個(如 event / no event)。在「兩變數獨立」的虛無假設下,每一格的期望次數等於邊際機率乘積 × N:
Lay the data out as an r × c contingency table: rows are one categorical variable (treatment / control), columns are another (event / no event). Under the null hypothesis of independence, the expected count in each cell equals the product of the marginal probabilities times N:
獨立性
兩個類別變數是否相關?例如:藥物 × 結果、基因型 × 疾病、吸菸 × 肺癌。H₀:兩變數獨立;E 由邊際分布計算。
Are two categorical variables associated? Example: drug × outcome, genotype × disease, smoking × lung cancer. H₀: the two variables are independent; E is computed from the marginals.
適合度
觀察的計數是否符合某個理論分布?例如:Mendel 9:3:3:1、Hardy-Weinberg p², 2pq, q²、均勻分布(骰子是否公平)。E 由理論機率 × N。df = k − 1 − m(m = 估計的參數個數)。
Do observed counts match a theoretical distribution? Example: Mendel 9:3:3:1, Hardy-Weinberg p², 2pq, q², uniform (is the die fair?). E = theoretical probability × N. df = k − 1 − m (m = number of parameters estimated from the data).
同質性
多個獨立樣本是否來自同一母體?數學形式與獨立性檢定相同(同一 χ² 公式),差別只在取樣設計:homogeneity 是行邊際固定(從每組抽固定 n),independence 是總和 N 固定。Agresti 2018 Ch.2。
Do multiple independent samples come from the same population? Mathematically identical to the independence test (same χ² formula), the difference is purely in sampling design: homogeneity fixes the row margins (sample fixed n from each group), independence fixes only the total N. Agresti 2018 Ch.2.
2×2 列聯表計算器
輸入四格次數——左上 a(treatment + event)、右上 b(treatment + no event)、左下 c(control + event)、右下 d(control + no event)。下面同步顯示期望值 E、χ² 統計量(含 Yates 校正版本)、p 值、Fisher's exact p、以及三大效應量 OR / RR / RD 連同 95% 信賴區間。最小期望值 < 5 時,介面會跳警告,建議切換到 Fisher。
Enter four counts — top-left a (treatment + event), top-right b (treatment + no event), bottom-left c (control + event), bottom-right d (control + no event). The panel shows expected counts E, χ² statistic (with and without Yates), p value, Fisher's exact p, and the three effect sizes OR / RR / RD with 95% CIs. If any expected cell < 5, a warning appears recommending Fisher's exact.
深色=觀察 O · 淺色=期望 EDark = Observed O · Light = Expected E
二、四個必須認得的變體
🎲 Fisher's exact
當任一期望次數 < 5,卡方近似失準(Cochran 1954 經典準則:所有 E ≥ 5,或 ≥ 80% 的格子 E ≥ 5)。Fisher 用超幾何分布(hypergeometric)枚舉所有「邊際固定」下比觀察更極端的表格。
常用於:小樣本臨床試驗、稀有突變的 GWAS 子集、單細胞 cluster vs marker overlap(fisher.test 是 Seurat FindAllMarkers 的選項之一)。
When any expected cell < 5, the chi-square approximation breaks down (Cochran 1954 rule: all E ≥ 5, or ≥ 80% of cells have E ≥ 5). Fisher uses the hypergeometric distribution to enumerate every table at least as extreme as the observed one, conditional on fixed margins.
Used for: small clinical trials, rare-variant GWAS subsets, single-cell cluster-vs-marker overlap (fisher.test is one of the options in Seurat's FindAllMarkers).
⚙️ Yates correction
2×2 表:每個 |O−E| 先扣 0.5 再平方。動機:χ² 是連續分布,但計數是整數,校正可以「平滑掉」這個誤差。
現代評價:過度保守。Camilli & Hopkins (1979)、Camilli 1995 Psychol Bull、Sokal & Rohlf (2012) 的模擬都顯示 Yates 校正讓 Type I error 顯著低於名義 α。R 預設 correct = TRUE——強烈建議改成 FALSE,或直接用 Fisher's exact。
For 2×2 tables: subtract 0.5 from each |O−E| before squaring. Motivation: χ² is continuous but counts are integers; the correction "smooths" the discreteness gap.
Modern verdict: over-conservative. Camilli & Hopkins (1979), Camilli 1995 Psychol Bull, and Sokal & Rohlf (2012) all show Yates depresses Type I error far below nominal α. R defaults correct = TRUE — set it to FALSE, or just switch to Fisher's exact.
🔄 McNemar — 配對二元
同一受試者前後的二元結果,或配對病例對照。表的格子是 (前+/後+, 前+/後−, 前−/後+, 前−/後−);只看「不一致」的兩格 b 與 c:
χ²McN = (b − c)² / (b + c) · df = 1
例:100 人服藥前後高血壓狀態。直接用一般卡方會把配對結構視為獨立——錯。Bennett 2017 BMJ:「配對資料用 unpaired test 就是 SD/√n 的浪費。」
Binary outcomes before vs after on the same subjects, or matched case-control. The table is (pre+/post+, pre+/post−, pre−/post+, pre−/post−); only the two discordant cells b and c matter:
χ²McN = (b − c)² / (b + c) · df = 1
Example: hypertension status in 100 patients before vs after a drug. Running a vanilla chi-square treats paired data as independent — wrong. Bennett (2017 BMJ): "Using an unpaired test on paired data throws away SD/√n of power."
🧭 CMH — 分層 2×2
把 2×2 表按「混淆變數」(confounder, 如年齡層、性別、研究中心)分層,再合併估計共同 OR。可同時:(1) 控制混淆,(2) 用 Breslow-Day 檢定 OR 是否跨層恆定(若 OR 隨層改變 → 有交互作用,CMH 不合適)。
例:多中心臨床試驗、流病分層分析。處理 Simpson's paradox 的標準工具。
Stratify 2×2 tables by a confounder (age band, sex, study site) and pool a common OR. CMH lets you (1) control for the confounder and (2) test, via Breslow-Day, whether the OR is constant across strata (if OR varies → interaction, CMH inappropriate).
Example: multi-center trials, stratified epidemiology. The standard antidote to Simpson's paradox.
OR vs RR vs RD 比較器
選定一個「相對風險 RR」(如 2 倍風險),然後拖動基準風險 p₀從 0.01 到 0.5。觀察:當 p₀ 小(< 10%)時,OR ≈ RR;但當 p₀ 變大,OR 急遽膨脹,遠超過 RR——這就是「common outcome bias of OR」。流病 / 臨床期刊建議:罕見結果(< 10%)報 OR 可,常見結果(≥ 10%)請改報 RR 或 RD(Zhang & Yu 1998 JAMA、Pearce 2024 Int J Epidemiol)。
Pick a "relative risk" (say RR = 2) and drag baseline risk p₀ from 0.01 to 0.5. Notice: at low p₀ (< 10%), OR ≈ RR; but as p₀ grows, OR balloons far past RR — the famous "common outcome bias of OR". Epidemiology and clinical journals advise: rare outcomes (< 10%) can be reported as OR, common outcomes (≥ 10%) should be reported as RR or RD (Zhang & Yu 1998 JAMA, Pearce 2024 IJE).
橫軸=基準風險 p₀ · 紅=OR · 藍=RR · 綠=RDx = baseline risk p₀ · red = OR · blue = RR · green = RD
三、怎麼選?
🌳 類別資料檢定決策樹
fisher.test(simulate.p.value=TRUE))。fisher.test(simulate.p.value=TRUE)).四、OR / RR / RD 的特性
| 效應量 | 公式 | 範圍 | 適合設計 | 加 / 乘 | 陷阱 | |||
|---|---|---|---|---|---|---|---|---|
| OR Odds Ratio | ad / bc | (0, ∞) | case-control、logistic 回歸、罕見結果 | 乘法(log(OR) 可加) | 常見結果時嚴重高估 RR | case-control, logistic regression, rare outcomes | multiplicative (log OR additive) | overstates RR when outcome is common |
| RR Relative Risk | [a/(a+b)] / [c/(c+d)] | (0, ∞) | cohort、RCT、流病追蹤 | 乘法(log RR 可加) | case-control 不能直接算(無分母) | cohort, RCT, prospective epi | multiplicative (log RR additive) | undefined in case-control (no denominator) |
| RD Risk Difference | a/(a+b) − c/(c+d) | (−1, 1) | RCT、絕對風險溝通、NNT 計算 | 加法(直接相減) | 基準風險很小或很大時不夠敏感 | RCT, absolute-risk communication, NNT | additive (direct subtraction) | insensitive at very low / very high p₀ |
| NNT NNT | 1 / |RD| | [1, ∞) | RCT 臨床決策溝通 | (衍生量) | RD 跨 0 時 NNT 無意義;建議報 RD 與 95% CI | RCT clinical decision communication | (derived) | undefined when RD spans 0; report RD + 95% CI instead |
· log(OR) ± 1.96 × √(1/a + 1/b + 1/c + 1/d)
· log(RR) ± 1.96 × √[(1/a − 1/(a+b)) + (1/c − 1/(c+d))]
· RD ± 1.96 × √[p₁(1−p₁)/n₁ + p₂(1−p₂)/n₂](直接在原尺度)。
R:
epitools::oddsratio(tab)、epitools::riskratio(tab) 都會給 Wald / Taylor 兩種 CI。
95% CIs (on the log scale): OR and RR are heavily skewed on the raw scale; compute CIs on the log scale, then exponentiate.· log(OR) ± 1.96 × √(1/a + 1/b + 1/c + 1/d)
· log(RR) ± 1.96 × √[(1/a − 1/(a+b)) + (1/c − 1/(c+d))]
· RD ± 1.96 × √[p₁(1−p₁)/n₁ + p₂(1−p₂)/n₂] (raw scale).
R:
epitools::oddsratio(tab) and epitools::riskratio(tab) output both Wald and Taylor CIs.
五、遺傳學的卡方傳統
Mendel 9:3:3:1
F₂ 觀察數(皺/黃, 圓/黃, 皺/綠, 圓/綠)= 32, 101, 108, 315;總 N = 556。理論期望比 9:3:3:1:
E = (556 × 9/16, 556 × 3/16, 556 × 3/16, 556 × 1/16) = 312.75, 104.25, 104.25, 34.75。
χ² = Σ(O−E)²/E ≈ 0.47, df = 3, p ≈ 0.93——資料與理論一致。Fisher 1936 著名爭議:Mendel 的卡方總和「太合適」(過多接近期望),暗示資料可能被修飾。這是「太好的擬合」反成質疑的經典案例。
F₂ counts (wrinkled/yellow, round/yellow, wrinkled/green, round/green) = 32, 101, 108, 315; total N = 556. Expected under 9:3:3:1:
E = (556 × 9/16, 556 × 3/16, 556 × 3/16, 556 × 1/16) = 312.75, 104.25, 104.25, 34.75.
χ² ≈ 0.47, df = 3, p ≈ 0.93 — observed agrees with theory. The famous Fisher 1936 reanalysis: Mendel's aggregated χ² is "too good a fit" (suspiciously close to expectation), suggesting the data may have been polished. A classic case where the fit is too good to be true.
HWE 檢定
SNP 三個基因型 AA, Aa, aa 在 HWE 下期望比例為 p², 2pq, q²(p = 等位頻率 A)。從觀察數估 p̂,計算 E,再算 χ²。df = 1(k = 3 個類別 − 1 − 1 個估計參數 p̂)。
GWAS QC 標準:對照組 HWE p < 1e−6 通常剔除(疾病組偏離 HWE 可能是真訊號)。Wigginton et al. 2005 AJHG 提供 exact HWE 檢定(避免低 MAF 時卡方近似失準)——PLINK 的 --hardy 預設用 exact。
For an SNP with genotypes AA, Aa, aa, HWE expects proportions p², 2pq, q² (p = A allele frequency). Estimate p̂ from the data, compute E, then χ². df = 1 (k = 3 genotypes − 1 − 1 estimated parameter).
GWAS QC convention: SNPs with control HWE p < 1e−6 are typically dropped (case-group deviation from HWE can be real signal). Wigginton et al. 2005 AJHG introduced an exact HWE test (avoids χ² approximation failure at low MAF) — PLINK's --hardy uses exact by default.
六、實作範例
# R: chi-square family + effect sizes library(epitools) # oddsratio(), riskratio(), riskdiff() library(vcd) # mosaic plots, assocstats() # --- 2x2 table: drug vs outcome --- tab <- matrix(c(40, 60, 20, 80), nrow = 2, byrow = TRUE, dimnames = list(treat = c("drug", "placebo"), event = c("yes", "no"))) # Pearson chi-square — turn OFF Yates by default chisq.test(tab, correct = FALSE) chisq.test(tab)$expected # inspect E_ij # Fisher's exact (any expected < 5, or just safer at small n) fisher.test(tab) # Effect sizes with 95% CI epitools::oddsratio(tab)$measure # OR + Wald CI epitools::riskratio(tab)$measure # RR + Wald CI epitools::riskdiff(tab) # RD + CI # --- McNemar: paired binary (before/after) --- paired <- matrix(c(30, 12, 25, 33), 2, dimnames = list(pre = c("+", "-"), post = c("+", "-"))) mcnemar.test(paired, correct = FALSE) # --- Cochran-Mantel-Haenszel: stratify by age band --- data(UCBAdmissions) # classic Simpson's paradox mantelhaen.test(UCBAdmissions) # pools OR across departments # --- Goodness-of-fit: Mendel 9:3:3:1 --- obs <- c(315, 108, 101, 32) chisq.test(obs, p = c(9, 3, 3, 1) / 16) # --- HWE exact via HardyWeinberg pkg --- # HardyWeinberg::HWExact(c(AA=298, Aa=489, aa=213))
import numpy as np import pandas as pd from scipy import stats from statsmodels.stats.contingency_tables import Table2x2, mcnemar, StratifiedTable # --- 2x2 table --- tab = np.array([[40, 60], [20, 80]]) # Pearson chi-square (scipy's correction=False) chi2, p, dof, expected = stats.chi2_contingency(tab, correction=False) # Fisher's exact (one-sided / two-sided) stats.fisher_exact(tab, alternative="two-sided") # Effect sizes + 95% CI from statsmodels t = Table2x2(tab) t.odds_ratio(), t.oddsratio_confint() t.riskratio(), t.riskratio_confint() # risk difference: t.summary() returns everything print(t.summary()) # --- McNemar paired --- paired = np.array([[30, 12], [25, 33]]) mcnemar(paired, exact=False, correction=False) # --- Cochran-Mantel-Haenszel: 3D array (layers x 2 x 2) --- strata = np.array([[[12, 88], [5, 95]], [[28, 72], [15, 85]], [[35, 65], [20, 80]]]) st = StratifiedTable(strata) st.test_null_odds() # CMH null test st.test_equal_odds() # Breslow-Day homogeneity st.oddsratio_pooled, st.oddsratio_pooled_confint() # --- Goodness-of-fit: Mendel 9:3:3:1 --- obs = np.array([315, 108, 101, 32]) exp = obs.sum() * np.array([9, 3, 3, 1]) / 16 stats.chisquare(obs, exp)
chisq.test() 對 2×2 表預設 correct = TRUE(Yates)——大多數情況請設成 FALSE。Python 的 scipy.stats.chi2_contingency 預設 correction = True,同樣請改成 False。兩個語言都同一個陷阱。
One detail that bites everyone: R's chisq.test() defaults to correct = TRUE on 2×2 (Yates) — set to FALSE in most cases. Python's scipy.stats.chi2_contingency defaults to correction = True — same fix. Both languages, same trap.
七、論文最常見的六個錯誤
❌ Yates 預設誤用
R 的 chisq.test(2×2) 預設加 Yates,導致 p 值系統性高估。Camilli (1995) 與 Sokal-Rohlf (2012) 都建議關閉。實務:寫 chisq.test(tab, correct = FALSE),或乾脆用 Fisher's exact。
R's chisq.test(2×2) applies Yates by default, systematically inflating p. Camilli (1995) and Sokal-Rohlf (2012) both recommend turning it off. Practical fix: chisq.test(tab, correct = FALSE), or just use Fisher's exact.
❌ OR 誤譯為 RR
當結果常見(> 10–20%)時,OR 大幅高估 RR。Greenland (1987 AJE)、Zhang & Yu (1998 JAMA)。臨床溝通請用 RR 或 RD + NNT,並在 Methods 報告兩者。
When the outcome is common (> 10–20%), OR overstates RR substantially. Greenland (1987 AJE), Zhang & Yu (1998 JAMA). For clinical communication, report RR or RD + NNT and disclose both in Methods.
❌ E < 5 仍跑 χ²
違反 Cochran 1954 的近似條件 → 卡方分布近似失準(特別在邊緣 p 值,0.01–0.10 區)。R / Python 多半會顯示「Chi-squared approximation may be incorrect」警告——別忽略,改用 Fisher's exact。
Violating Cochran (1954) → poor χ² approximation, especially near the borderline p (0.01–0.10). R / Python both warn "Chi-squared approximation may be incorrect" — don't ignore it; switch to Fisher's exact.
❌ Simpson 悖論
Simpson 1951 JRSS-B:合併資料的關係方向,可能在分層後逆轉。經典例:UC Berkeley 1973 招生(合併看女性入學率較低,分系後反而較高)。解方:CMH + Breslow-Day 或 logistic 回歸把混淆變數放進模型。
Simpson 1951 JRSS-B: the direction of association in pooled data can flip after stratification. Classic case: UC Berkeley 1973 admissions (lower female admission overall, higher within most departments). Fix: CMH + Breslow-Day, or logistic regression with confounders as covariates.
❌ 配對資料用獨立檢定
同一受試者前後 / 配對病例對照→必須用 McNemar。把配對視為獨立會嚴重低估配對相關,浪費功效。Bennett 2017 BMJ。
Same subject before/after, or matched case-control → use McNemar. Treating paired as independent underuses the within-pair correlation and loses power. Bennett (2017 BMJ).
❌ 多重比較未修正
GWAS、scRNA marker、藥物篩選——對每個變數跑卡方,p < 0.05 早已被「多重檢定」吞噬。請看 Step 12,至少跑 Bonferroni(最嚴)或 BH-FDR(最常用)。
GWAS, scRNA marker tests, drug screens — running a chi-square per variable means p < 0.05 is consumed by multiplicity long before you notice. See Step 12; at minimum apply Bonferroni (strictest) or BH-FDR (most common).
📝 自我檢測
1. 你的 2×2 表 4 格分別是 3, 27, 1, 29,總 N = 60。下列何者最合適?
1. Your 2×2 table has cells 3, 27, 1, 29 with N = 60. Best choice?
2. RCT 中事件發生率:治療組 30%,對照組 50%。下列敘述何者錯誤?
2. RCT event rates: treatment 30%, control 50%. Which statement is WRONG?
3. 你想檢驗某藥物使 100 位高血壓患者「治療前 / 後」的控制狀態變化。最合適的檢定是?
3. You want to test whether a drug changes BP control status (yes/no) before vs after in 100 patients. Best test?
4. 三家醫院的藥物試驗合併看顯示 OR = 1.5(不利於藥物),但每家醫院分別看 OR < 1(藥物有效)。這是什麼現象?應該用什麼方法分析?
4. Pooled across three hospitals, OR = 1.5 against the drug; within each hospital, OR < 1 (drug helps). What is happening? What analysis should you run?