為什麼分布會「決定」你的統計檢定?
機率分布是你資料的「生成模型」——當你寫下 t.test(),背後假設兩組資料來自同變異數的常態分布;當你跑 DESeq2 的 RNA-seq 差異分析,背後是 Negative Binomial GLM(Love 2014)。分布選錯,p 值與 CI 全部失真:用常態近似計數資料會低估變異;忽略 RNA-seq 的 overdispersion 會把 Type I error 衝到 50%。
本章帶你走過生統的「八個工作分布」:四連續(Normal、t、F、χ²、log-normal)、三離散(Binomial、Poisson、Negative Binomial)。每個分布都有一段生物學起源故事,與一個必須認得的「踩雷時刻」。
A probability distribution is the generative model behind your data. When you call t.test(), you assume two samples from normals with equal variance; when you run DESeq2 for RNA-seq, you are fitting a Negative Binomial GLM (Love 2014). Choose the wrong distribution and the p-value and CI lie to you: normal-approximating count data underestimates variance; ignoring RNA-seq overdispersion can push Type I error past 50%.
This chapter walks through the eight workhorse distributions of biostatistics — five continuous (Normal, t, F, χ², log-normal) and three discrete (Binomial, Poisson, Negative Binomial). Each has a biological origin story and a "trap" you have to recognise.
一、PMF、PDF、CDF
機率分布有兩種「描述方式」與一個「累積版本」,總共三個函數,分別給離散資料與連續資料用。混用是論文最常見的初級錯誤。
A distribution offers two "descriptions" plus a cumulative one, totalling three functions — one set for discrete data, one for continuous. Mixing them is among the most common rookie errors in print.
PMF(離散)
機率質量函數 P(X=k):對離散變數,直接給出取某值的機率。例:擲 10 次硬幣中 3 次正面 P(X=3) = C(10,3)·0.5¹⁰。
性質:Σ P(X=k) = 1,每個 P(X=k) ∈ [0, 1]。
Probability mass function P(X=k) — for discrete variables, gives the probability of exactly value k. Example: 3 heads in 10 flips, P(X=3) = C(10,3)·0.5¹⁰.
Properties: Σ P(X=k) = 1; each P(X=k) ∈ [0, 1].
PDF(連續)
機率密度函數 f(x):對連續變數,f(x) 是「密度」不是機率——單點機率為 0,必須積分區間才有機率:P(a<X<b) = ∫f(x)dx。
性質:∫ f(x)dx = 1,f(x) ≥ 0(但可以 > 1)。
Probability density function f(x) — for continuous variables, f(x) is a density, not a probability. The probability of a single point is 0; you need to integrate over an interval: P(a<X<b) = ∫f(x)dx.
Properties: ∫ f(x)dx = 1; f(x) ≥ 0 (but may exceed 1).
CDF(兩者皆有)
累積分布函數 F(x) = P(X ≤ x):永遠從 0 單調遞增至 1。CDF 的反函數叫 分位函數 Q(p)——這就是 R 的 qnorm(0.975)=1.96 在做的事。
用途:算 p 值(CDF 尾巴)、求臨界值(Q)、QQ-plot 比對。
Cumulative distribution function F(x) = P(X ≤ x) — always non-decreasing from 0 to 1. Its inverse is the quantile function Q(p) — that is exactly what qnorm(0.975)=1.96 computes.
Uses: p-values (CDF tails), critical values (Q), QQ-plots.
dnorm/pnorm/qnorm/rnorm)。在 Python (scipy.stats):.pdf/.pmf、.cdf、.ppf、.rvs。
⌜ Discrete PMF: P(X=k) · Continuous PDF: f(x) · CDF: F(x) = P(X ≤ x) · Quantile: Q(p) = F⁻¹(p) ⌝
In R: d* = density/PMF, p* = CDF, q* = quantile, r* = random sample (dnorm/pnorm/qnorm/rnorm). In Python (scipy.stats): .pdf/.pmf, .cdf, .ppf, .rvs.
dnorm(0)≈0.399——但常態分布在 x=0 的「機率」是 0;0.399 是密度,需乘上微小區間 dx 才會逼近機率。沒有所謂「P(身高 = 170.000 cm)」,只有 P(169.5 < 身高 < 170.5)。
Common confusion: a PDF value is not a probability. dnorm(0)≈0.399 — but the "probability" of x = 0 under a normal is 0; 0.399 is a density that must be multiplied by an infinitesimal dx to approach probability. "P(height = 170.000 cm)" does not exist — only P(169.5 < height < 170.5) does.
分布形狀遊樂場
切換不同分布,拖動關鍵參數,觀察形狀如何改變。下方即時計算 mean、variance、skewness,並顯示 P(X ≤ x)(CDF)值——這就是 p 值計算背後的機制。
Pick a distribution, drag its key parameter, watch the shape morph. The panel below recomputes mean, variance, skewness live and shows P(X ≤ x) (the CDF) — the same machinery your p-value calculations rely on.
綠色 = PDF/PMF · 橘色虛線 = CDF · 紅線 = meanGreen = PDF/PMF · orange dashed = CDF · red = mean
二、五個連續分布
Normal N(µ, σ²)
形狀:對稱鐘形;µ 控制中心,σ 控制寬度。68-95-99.7 法則(Galton 1886):±1σ 含 68.27%、±2σ 含 95.45%、±3σ 含 99.73%。
為何到處都是常態?中心極限定理(CLT,Step 3)——任何多個小效應相加的結果都會收斂到常態。所以人的身高、實驗誤差、聚合的免疫染色強度,都接近常態。
生物例:成年男性身高(µ≈175 cm, σ≈7)、HbA1c、Z-score 用的標準化。
Shape: symmetric bell; µ sets the centre, σ the width. The 68-95-99.7 rule (Galton 1886): ±1σ holds 68.27%, ±2σ 95.45%, ±3σ 99.73%.
Why is it everywhere? The central limit theorem (Step 3): any sum of many small effects converges to normality. Human height, instrument noise, summed immunostain intensity — all approximately normal.
Biology: adult male height (µ≈175 cm, σ≈7), HbA1c, the Z-score normalisation step.
Student's t (df)
來源:Gosset(1908 筆名 "Student")在 Guinness 啤酒廠處理小樣本(n < 30)時,發現用 s 取代 σ 後分布不再是常態——尾巴更厚,這就是 t 分布。
形狀:對稱、零均值、比常態厚尾。df → ∞ 時 t → N(0,1);df = 30 時兩者實務上已難分辨。
生物例:所有小樣本 mean 差異檢定(兩組老鼠的腫瘤體積、配對比較),都建立在 t 分布上(Step 5)。
Origin: Gosset (1908, pen name "Student") at Guinness needed small-sample inference (n < 30). Replacing σ with s broadens the tails — born is the t distribution.
Shape: symmetric, zero-mean, heavier tails than normal. As df → ∞, t → N(0,1); by df = 30 the two are practically indistinguishable.
Biology: every small-sample mean-difference test (tumour volumes in two mouse groups, paired comparisons) rides on t (Step 5).
F (df₁, df₂)
來源:R.A. Fisher 1924——兩個獨立卡方除以自由度的比值。Snedecor 1934 命名為 F 紀念 Fisher。
用途:變異數比(ANOVA, Step 7);F = MSbetween / MSwithin。永遠 ≥ 0、右偏,df₁=df₂=∞ 時形狀塌縮為 1。
生物例:多組劑量比較(5 個治療組老鼠)、多因子實驗(基因型 × 飲食)。F 大代表「組間差異 ≫ 組內變異」。
Origin: R.A. Fisher 1924 — the ratio of two independent chi-squares each divided by their df. Snedecor (1934) named it "F" for Fisher.
Use: variance ratios (ANOVA, Step 7); F = MSbetween / MSwithin. Always ≥ 0, right-skewed; with df₁ = df₂ = ∞ it collapses to 1.
Biology: multi-dose comparisons (5 treatment arms in mice), factorial designs (genotype × diet). A large F means "between > within".
Chi-square χ²(df)
定義:df 個獨立 N(0,1) 變數平方和。對df = 1, 2 高度右偏;df → ∞ 趨近常態。
用途:(1) 類別資料 goodness-of-fit / 列聯表獨立性(Step 6);(2) 變異數的信賴區間:(n−1)s² / σ² ~ χ²(n−1);(3) 概似比檢定(LRT)的標準漸近分布(Wilks 1938)。
生物例:Hardy-Weinberg 平衡檢定、基因型 vs 表型的 2×3 列聯表、GWAS 中的 LRT。
Definition: sum of df squared independent N(0,1) variables. Highly right-skewed for df = 1, 2; normalises as df → ∞.
Use: (1) categorical goodness-of-fit / contingency tables (Step 6); (2) CI for variance: (n−1)s² / σ² ~ χ²(n−1); (3) the asymptotic distribution of likelihood-ratio tests (Wilks 1938).
Biology: Hardy-Weinberg equilibrium tests, 2×3 genotype × phenotype tables, LRTs across GWAS.
Log-normal
定義:X 是 log-normal ⟺ ln(X) ~ Normal。形狀:永遠正值、右偏。
為何到處都是?當資料來自多個因子相乘(每一步以比例增長),對數會把乘法變加法 → CLT → log-normal。Limpert 等(2001 BioScience)證實生醫資料絕大多數天然是 log-normal 而非 normal。
生物例:血液中細胞激素濃度、CRP、IgE、抗體效價、單細胞 RNA-seq 表現量、細菌 CFU 計數。
處理:分析前 log-transform(或在 GLM 用 log link);報告時 mean(log X) → 反 log 是 geometric mean。
Definition: X is log-normal ⟺ ln(X) ~ Normal. Strictly positive, right-skewed.
Why so common? When the data result from multiplicative factors (each step a percentage change), log converts multiplication to addition → CLT → log-normal. Limpert et al. (2001 BioScience) showed most biomedical readouts are natively log-normal, not normal.
Biology: blood cytokine levels, CRP, IgE, antibody titres, single-cell RNA-seq counts, bacterial CFU.
Handling: log-transform before analysis (or use a log link in a GLM); reporting: back-transforming mean(log X) gives the geometric mean.
三、三個離散分布
Binomial Bin(n, p)
故事:n 次獨立 Bernoulli 試驗,每次成功機率 p,數成功次數 X ∈ {0,...,n}。
均值/變異:E[X] = np, Var[X] = np(1−p)。
常態近似:當 np ≥ 5 且 n(1−p) ≥ 5 時,Bin ≈ N(np, np(1−p)),可改用 Z 檢定。
生物例:癌篩陽性率(n 人中 X 陽性)、PCR 成功率、基因型頻率、CRISPR 編輯效率。
Story: n independent Bernoulli trials with success probability p; X ∈ {0,...,n}.
Mean / variance: E[X] = np, Var[X] = np(1−p).
Normal approximation: when np ≥ 5 and n(1−p) ≥ 5, Bin ≈ N(np, np(1−p)) — a Z-test becomes feasible.
Biology: cancer-screening positive rate (X positives in n people), PCR success rate, allele frequency, CRISPR editing efficiency.
Poisson Pois(λ)
故事:單位時間 / 空間中事件發生數,事件彼此獨立且發生率恆定 λ。
關鍵:E[X] = Var[X] = λ——這是 Poisson 的「身份證」。
歷史:Luria & Delbrück 1943 用 Poisson 證明細菌突變是預先存在而非誘發(Nobel 1969);Bortkiewicz 1898 用普魯士騎兵被馬踢死人數驗證。
生物例:單位時間細胞分裂事件、RNA-seq 中低表達基因計數(高表達基因常已 overdispersed→NB)、PCR 模板數、放射衰變、photon count。
Story: count of events per unit time/space when events are independent and the rate λ is constant.
Key: E[X] = Var[X] = λ — the Poisson "ID badge".
History: Luria & Delbrück 1943 used Poisson to prove that bacterial mutations are pre-existing, not induced (Nobel 1969); Bortkiewicz 1898 validated it on Prussian soldiers kicked to death by horses.
Biology: cell divisions per unit time, low-expression RNA-seq counts (high-expression genes are typically overdispersed → NB), PCR template numbers, radioactive decay, photon counts.
Negative Binomial
故事:Poisson 的「鬆綁版」——當資料變異大於均值(overdispersion)時用 NB。可解讀為「λ 本身來自 Gamma 分布的混合 Poisson」(Gamma-Poisson mixture)。
關鍵:Var[X] = µ + αµ²,α > 0 是 dispersion 參數;α → 0 時退化為 Poisson。
為何是 RNA-seq 的工作分布?跨樣本生物變異 + 技術變異 = Var ≫ Mean。edgeR(Robinson 2010)與 DESeq2(Love 2014 Genome Biology)都以 NB GLM 為核心;用 Poisson 會大幅膨脹假陽性。
生物例:bulk RNA-seq、ATAC-seq、ChIP-seq peak count、單細胞 UMI count(也常用 NB 或 ZINB)。
Story: the loosened Poisson — used when the variance exceeds the mean (overdispersion). Interpretable as a Gamma-Poisson mixture (λ itself is Gamma-distributed).
Key: Var[X] = µ + αµ²; α > 0 is the dispersion; α → 0 reverts to Poisson.
Why the workhorse of RNA-seq? Biological + technical variation across samples drive Var ≫ Mean. edgeR (Robinson 2010) and DESeq2 (Love 2014 Genome Biology) both rest on NB GLMs; using Poisson dramatically inflates false positives.
Biology: bulk RNA-seq, ATAC-seq, ChIP-seq peak counts, single-cell UMI counts (often NB or ZINB).
過離散實驗室
同樣的均值 µ,Poisson 的變異固定 = µ,而 NB 的變異 = µ + αµ² 可以遠大於 µ。拖動 α 觀察 NB 如何「胖過」Poisson——這就是為何 DESeq2 / edgeR 不用 Poisson 而用 NB 來分析 RNA-seq。下方數字顯示 sample variance / mean ratio:>1.5 通常代表你需要 NB。
For the same mean µ, Poisson has variance fixed at µ, but NB has variance µ + αµ² — potentially far larger. Drag α and watch NB "fatten" past Poisson — exactly why DESeq2 / edgeR pick NB, not Poisson, for RNA-seq. The panel reports the sample variance/mean ratio: >1.5 typically means you need NB.
綠 = Poisson · 橘 = NB(同 µ)Green = Poisson · Orange = NB (same µ)
四、選哪個分布?
🌳 分布選擇決策樹
五、八大分布並排
| 分布 | 類型 | 參數 | 均值/變異 | 典型用途 | ||
|---|---|---|---|---|---|---|
| Normal | 連續 | µ, σ | µ / σ² | CLT 後的彙總、Z-score、t/ANOVA 的母體假設 | Continuous | Post-CLT summaries, Z-score, population assumption for t/ANOVA |
| t (df) | 連續 | df | 0 / df/(df−2) | 小樣本均值差異、迴歸係數檢定 | Continuous | Small-sample mean diffs, regression coefficient tests |
| F (df₁, df₂) | 連續 | df₁, df₂ | df₂/(df₂−2) | ANOVA 變異數比、混合模型 | Continuous | ANOVA variance ratios, mixed models |
| Chi² (df) | 連續 | df | df / 2df | 類別資料、變異數 CI、LRT | Continuous | Categorical data, variance CI, LRT |
| Log-normal | 連續 | µ_log, σ_log | e^(µ+σ²/2) | 濃度、抗體效價、CFU、scRNA 表現 | Continuous | Concentrations, titres, CFU, scRNA expression |
| Binomial | 離散 | n, p | np / np(1−p) | 陽性率、編輯效率、基因型頻率 | Discrete | Positive rate, editing efficiency, allele frequency |
| Poisson | 離散 | λ | λ / λ | 罕見事件計數、突變、photon、低表達基因 | Discrete | Rare event counts, mutations, photons, low-expression genes |
| Neg. Binomial | 離散 | µ, α | µ / µ+αµ² | RNA-seq / ATAC / ChIP(DESeq2、edgeR) | Discrete | RNA-seq / ATAC / ChIP (DESeq2, edgeR) |
六、實作範例
# R: d* = density/PMF, p* = CDF, q* = quantile, r* = random # --- Normal --- dnorm(1.96); pnorm(1.96) # 0.0584 ; 0.975 qnorm(0.975); rnorm(5, 0, 1) # 1.96 ; sample # --- Binomial: P(X = 7) when n=10, p=0.5 --- dbinom(7, size=10, prob=0.5) pbinom(7, 10, 0.5) # P(X ≤ 7) # --- Poisson: lambda = 3 mutations / generation --- dpois(2, lambda=3); ppois(2, 3) # --- t (df = 10) --- qt(0.975, df=10) # critical value 2.228 # --- Chi-square & F critical values --- qchisq(0.95, df=3); qf(0.95, df1=2, df2=20) # --- Fit a negative binomial to RNA-seq counts --- library(MASS) counts <- rnbinom(n=200, mu=10, size=2) # size = 1/alpha fit <- MASS::fitdistr(counts, "negative binomial") fit$estimate # mu and size # --- Goodness-of-fit: chi-square --- obs <- c(120, 230, 150); exp <- c(100, 250, 150) chisq.test(obs, p=exp/sum(exp)) # --- DESeq2 idiom: NB GLM behind the curtain --- # dds <- DESeqDataSetFromMatrix(counts, coldata, ~ condition) # dds <- DESeq(dds) # fits NB with shrinkage (Love 2014)
import numpy as np from scipy import stats # --- Normal --- stats.norm.pdf(1.96); stats.norm.cdf(1.96) # 0.0584 ; 0.975 stats.norm.ppf(0.975); stats.norm.rvs(size=5) # --- Binomial --- stats.binom.pmf(7, n=10, p=0.5) stats.binom.cdf(7, 10, 0.5) # --- Poisson --- stats.poisson.pmf(2, mu=3); stats.poisson.cdf(2, 3) # --- t / Chi² / F critical values --- stats.t.ppf(0.975, df=10) # 2.228 stats.chi2.ppf(0.95, df=3) stats.f.ppf(0.95, dfn=2, dfd=20) # --- Fit NB to counts (statsmodels) --- import statsmodels.api as sm counts = stats.nbinom.rvs(n=2, p=2/(2+10), size=200) res = sm.NegativeBinomial(counts, np.ones(200)).fit(disp=0) res.params # intercept (log mu) + alpha # --- Goodness-of-fit chi-square --- stats.chisquare([120,230,150], f_exp=[100,250,150]) # --- pyDESeq2 idiom (NB GLM for RNA-seq DE) --- # from pydeseq2.dds import DeseqDataSet # dds = DeseqDataSet(counts=counts_df, metadata=meta, design_factors="condition") # dds.deseq2() # NB with empirical Bayes shrinkage
var(x)/mean(x),Python 用 x.var()/x.mean()。> 1.5 直接走 NB;近 1 用 Poisson;若有大量 0 還要看 ZINB(pscl::zeroinfl())。
Quick recipe: for any count data, compute the variance-to-mean ratio first: R var(x)/mean(x), Python x.var()/x.mean(). > 1.5 → go NB; near 1 → Poisson; many zeros → consider ZINB (pscl::zeroinfl()).
七、五個分布錯誤
❌ 計數資料用 t-test
把細胞分裂次數、CFU、reads count 當常態做 t 檢定——當均值小(λ < 5)時,常態近似失效,Type I error 顯著膨脹。改用 Poisson GLM 或 NB GLM;或先 log/√ 轉換再用 t。
Treating cell divisions, CFU, or read counts as normal in a t-test breaks down when the mean is small (λ < 5) — Type I error inflates noticeably. Use a Poisson or NB GLM, or apply a log/√ transform before t.
❌ RNA-seq 用 Poisson
跨樣本的生物變異使 Var ≫ Mean,用 Poisson 的 p 值嚴重偏小。一個基因 Var/Mean = 5 卻被當 Poisson 分析,假陽性率可從 5% 衝到 40%。用 DESeq2 / edgeR 的 NB GLM。
Cross-sample biological variation gives Var ≫ Mean; Poisson p-values become far too small. A gene with Var/Mean = 5 analysed as Poisson can see FPR jump from 5% to 40%. Use the NB GLM in DESeq2 / edgeR.
❌ Binomial vs Poisson
「100 人中 X 人陽性」是 Binomial(有上限 n);「某基因 1 kb 區段內突變 X 次」是 Poisson(無上限)。混淆會把分母(樣本數)與曝險(時間/長度)算錯,CI 全錯。
"X positives in 100 people" is Binomial (bounded by n); "X mutations in a 1 kb region" is Poisson (unbounded). Mixing them swaps denominator (sample size) and exposure (time/length); CIs go off.
❌ 比例資料硬塞常態
把陽性率 p̂ 直接當 N(p̂, p̂(1−p̂)/n) 算 95% CI——當 p̂ 接近 0 或 1 時,CI 會超出 [0, 1] 範圍。改用 Wilson 或 Clopper-Pearson CI(binom::binom.confint()、statsmodels.proportion_confint())。
Treating p̂ as N(p̂, p̂(1−p̂)/n) for a 95% CI fails near 0 or 1 — the CI can leak outside [0, 1]. Use Wilson or Clopper-Pearson intervals (binom::binom.confint(), statsmodels.proportion_confint()).
❌ Log-normal 不 log
細胞激素濃度、抗體效價在原尺度上強烈右偏。直接做 t-test 會被尾巴主導,檢力 (power) 大幅下降。先 log 再 t,或用 Mann-Whitney U(不假設形狀)。
Cytokine levels and antibody titres are strongly right-skewed on the raw scale. Running t straight away lets the tail dominate and power drops sharply. Log first, then t — or use Mann-Whitney U (no shape assumption).
❌ 離散分布報 mean ± SD
Poisson 的 SD = √λ,Binomial 的 SD = √(np(1−p))——但「mean ± SD」對離散資料常常包含負數,無意義。改用 95% CI(Wilson for Binomial、exact Poisson CI)。
Poisson SD = √λ, Binomial SD = √(np(1−p)) — but "mean ± SD" on discrete data often crosses zero, which is meaningless. Use a 95% CI instead (Wilson for Binomial, exact Poisson CI).
📝 自我檢測
1. 你做 RNA-seq 差異分析,看到某基因的 sample variance / mean ≈ 6。下列何者最合適?
1. In an RNA-seq DE analysis, a gene's sample variance / mean ≈ 6. What's most appropriate?
2. 對於連續變數 X,下列何者正確?
2. For a continuous variable X, which is correct?
3. 「篩檢 200 人,問 X 名陽性的機率分布」與「在 1 kb 染色體區段中 X 個 SNV」分別屬於?
3. "X positives out of 200 screened" vs "X SNVs in a 1 kb region" — what distributions?
4. 你有 n=8 隻老鼠的腫瘤體積要做兩組比較,最適合的抽樣分布(用來算 p 值)是?
4. Two-group comparison on n=8 mice tumour volumes — the most appropriate sampling distribution for the p-value is?