STEP 2 / 13

機率分布 (Probability Distributions)

每一個推論統計都假設你的資料來自某個機率分布——選錯分布,p 值與 CI 都會錯。八個分布、八個生物學故事。

Every inferential test assumes a distribution behind your data — pick the wrong one and p-values and CIs go wrong with it. Eight distributions, eight biology stories.

為什麼分布會「決定」你的統計檢定?

機率分布是你資料的「生成模型」——當你寫下 t.test(),背後假設兩組資料來自同變異數的常態分布;當你跑 DESeq2 的 RNA-seq 差異分析,背後是 Negative Binomial GLM(Love 2014)。分布選錯,p 值與 CI 全部失真:用常態近似計數資料會低估變異;忽略 RNA-seq 的 overdispersion 會把 Type I error 衝到 50%。

本章帶你走過生統的「八個工作分布」:四連續(Normal、t、F、χ²、log-normal)、三離散(Binomial、Poisson、Negative Binomial)。每個分布都有一段生物學起源故事,與一個必須認得的「踩雷時刻」。

A probability distribution is the generative model behind your data. When you call t.test(), you assume two samples from normals with equal variance; when you run DESeq2 for RNA-seq, you are fitting a Negative Binomial GLM (Love 2014). Choose the wrong distribution and the p-value and CI lie to you: normal-approximating count data underestimates variance; ignoring RNA-seq overdispersion can push Type I error past 50%.

This chapter walks through the eight workhorse distributions of biostatistics — five continuous (Normal, t, F, χ², log-normal) and three discrete (Binomial, Poisson, Negative Binomial). Each has a biological origin story and a "trap" you have to recognise.

💡
背一條哲學:分布來自產生資料的機制,不是你選擇的「樣式」。問題不是「我的資料看起來像哪個分布?」而是「產生我資料的過程是計數?是比例?是兩個變異數的比?」——機制決定分布,分布決定檢定。 One slogan to remember: a distribution comes from the process that generated the data, not from what the data "looks like". Don't ask "which distribution does my data resemble?"; ask "what process produced it — a count? a proportion? a ratio of variances?" The mechanism dictates the distribution, and the distribution dictates the test.

一、PMF、PDF、CDF

機率分布有兩種「描述方式」與一個「累積版本」,總共三個函數,分別給離散資料與連續資料用。混用是論文最常見的初級錯誤。

A distribution offers two "descriptions" plus a cumulative one, totalling three functions — one set for discrete data, one for continuous. Mixing them is among the most common rookie errors in print.

🎯

PMF(離散)

機率質量函數 P(X=k):對離散變數,直接給出取某值的機率。例:擲 10 次硬幣中 3 次正面 P(X=3) = C(10,3)·0.5¹⁰

性質:Σ P(X=k) = 1,每個 P(X=k) ∈ [0, 1]。

Probability mass function P(X=k) — for discrete variables, gives the probability of exactly value k. Example: 3 heads in 10 flips, P(X=3) = C(10,3)·0.5¹⁰.

Properties: Σ P(X=k) = 1; each P(X=k) ∈ [0, 1].

📐

PDF(連續)

機率密度函數 f(x):對連續變數,f(x) 是「密度」不是機率——單點機率為 0,必須積分區間才有機率:P(a<X<b) = ∫f(x)dx

性質:∫ f(x)dx = 1,f(x) ≥ 0(但可以 > 1)。

Probability density function f(x) — for continuous variables, f(x) is a density, not a probability. The probability of a single point is 0; you need to integrate over an interval: P(a<X<b) = ∫f(x)dx.

Properties: ∫ f(x)dx = 1; f(x) ≥ 0 (but may exceed 1).

📈

CDF(兩者皆有)

累積分布函數 F(x) = P(X ≤ x):永遠從 0 單調遞增至 1。CDF 的反函數叫 分位函數 Q(p)——這就是 R 的 qnorm(0.975)=1.96 在做的事。

用途:算 p 值(CDF 尾巴)、求臨界值(Q)、QQ-plot 比對。

Cumulative distribution function F(x) = P(X ≤ x) — always non-decreasing from 0 to 1. Its inverse is the quantile function Q(p) — that is exactly what qnorm(0.975)=1.96 computes.

Uses: p-values (CDF tails), critical values (Q), QQ-plots.

⌜ 離散 PMF:P(X=k) · 連續 PDF:f(x) · CDF:F(x) = P(X ≤ x) · 分位:Q(p) = F⁻¹(p)在 R 中:d* = density/PMF、p* = CDF、q* = quantile、r* = random sample(如 dnorm/pnorm/qnorm/rnorm)。在 Python (scipy.stats):.pdf/.pmf.cdf.ppf.rvs ⌜ Discrete PMF: P(X=k) · Continuous PDF: f(x) · CDF: F(x) = P(X ≤ x) · Quantile: Q(p) = F⁻¹(p)In R: d* = density/PMF, p* = CDF, q* = quantile, r* = random sample (dnorm/pnorm/qnorm/rnorm). In Python (scipy.stats): .pdf/.pmf, .cdf, .ppf, .rvs.
⚠️
常見混淆:PDF 值不是機率。dnorm(0)≈0.399——但常態分布在 x=0 的「機率」是 0;0.399 是密度,需乘上微小區間 dx 才會逼近機率。沒有所謂「P(身高 = 170.000 cm)」,只有 P(169.5 < 身高 < 170.5)。 Common confusion: a PDF value is not a probability. dnorm(0)≈0.399 — but the "probability" of x = 0 under a normal is 0; 0.399 is a density that must be multiplied by an infinitesimal dx to approach probability. "P(height = 170.000 cm)" does not exist — only P(169.5 < height < 170.5) does.

分布形狀遊樂場

切換不同分布,拖動關鍵參數,觀察形狀如何改變。下方即時計算 mean、variance、skewness,並顯示 P(X ≤ x)(CDF)值——這就是 p 值計算背後的機制。

Pick a distribution, drag its key parameter, watch the shape morph. The panel below recomputes mean, variance, skewness live and shows P(X ≤ x) (the CDF) — the same machinery your p-value calculations rely on.

綠色 = PDF/PMF · 橘色虛線 = CDF · 紅線 = meanGreen = PDF/PMF · orange dashed = CDF · red = mean

二、五個連續分布

🔔

Normal N(µ, σ²)

形狀:對稱鐘形;µ 控制中心,σ 控制寬度。68-95-99.7 法則(Galton 1886):±1σ 含 68.27%、±2σ 含 95.45%、±3σ 含 99.73%。

為何到處都是常態?中心極限定理(CLT,Step 3)——任何多個小效應相加的結果都會收斂到常態。所以人的身高、實驗誤差、聚合的免疫染色強度,都接近常態。

生物例:成年男性身高(µ≈175 cm, σ≈7)、HbA1c、Z-score 用的標準化。

Shape: symmetric bell; µ sets the centre, σ the width. The 68-95-99.7 rule (Galton 1886): ±1σ holds 68.27%, ±2σ 95.45%, ±3σ 99.73%.

Why is it everywhere? The central limit theorem (Step 3): any sum of many small effects converges to normality. Human height, instrument noise, summed immunostain intensity — all approximately normal.

Biology: adult male height (µ≈175 cm, σ≈7), HbA1c, the Z-score normalisation step.

📏

Student's t (df)

來源:Gosset(1908 筆名 "Student")在 Guinness 啤酒廠處理小樣本(n < 30)時,發現用 s 取代 σ 後分布不再是常態——尾巴更厚,這就是 t 分布。

形狀:對稱、零均值、比常態厚尾。df → ∞ 時 t → N(0,1);df = 30 時兩者實務上已難分辨。

生物例:所有小樣本 mean 差異檢定(兩組老鼠的腫瘤體積、配對比較),都建立在 t 分布上(Step 5)。

Origin: Gosset (1908, pen name "Student") at Guinness needed small-sample inference (n < 30). Replacing σ with s broadens the tails — born is the t distribution.

Shape: symmetric, zero-mean, heavier tails than normal. As df → ∞, t → N(0,1); by df = 30 the two are practically indistinguishable.

Biology: every small-sample mean-difference test (tumour volumes in two mouse groups, paired comparisons) rides on t (Step 5).

⚖️

F (df₁, df₂)

來源:R.A. Fisher 1924——兩個獨立卡方除以自由度的比值。Snedecor 1934 命名為 F 紀念 Fisher。

用途:變異數比(ANOVA, Step 7);F = MSbetween / MSwithin。永遠 ≥ 0、右偏,df₁=df₂=∞ 時形狀塌縮為 1。

生物例:多組劑量比較(5 個治療組老鼠)、多因子實驗(基因型 × 飲食)。F 大代表「組間差異 ≫ 組內變異」。

Origin: R.A. Fisher 1924 — the ratio of two independent chi-squares each divided by their df. Snedecor (1934) named it "F" for Fisher.

Use: variance ratios (ANOVA, Step 7); F = MSbetween / MSwithin. Always ≥ 0, right-skewed; with df₁ = df₂ = ∞ it collapses to 1.

Biology: multi-dose comparisons (5 treatment arms in mice), factorial designs (genotype × diet). A large F means "between > within".

📐

Chi-square χ²(df)

定義:df 個獨立 N(0,1) 變數平方和。對df = 1, 2 高度右偏;df → ∞ 趨近常態。

用途:(1) 類別資料 goodness-of-fit / 列聯表獨立性(Step 6);(2) 變異數的信賴區間:(n−1)s² / σ² ~ χ²(n−1);(3) 概似比檢定(LRT)的標準漸近分布(Wilks 1938)。

生物例:Hardy-Weinberg 平衡檢定、基因型 vs 表型的 2×3 列聯表、GWAS 中的 LRT。

Definition: sum of df squared independent N(0,1) variables. Highly right-skewed for df = 1, 2; normalises as df → ∞.

Use: (1) categorical goodness-of-fit / contingency tables (Step 6); (2) CI for variance: (n−1)s² / σ² ~ χ²(n−1); (3) the asymptotic distribution of likelihood-ratio tests (Wilks 1938).

Biology: Hardy-Weinberg equilibrium tests, 2×3 genotype × phenotype tables, LRTs across GWAS.

🌋

Log-normal

定義:X 是 log-normal ⟺ ln(X) ~ Normal。形狀:永遠正值、右偏。

為何到處都是?當資料來自多個因子相乘(每一步以比例增長),對數會把乘法變加法 → CLT → log-normal。Limpert 等(2001 BioScience)證實生醫資料絕大多數天然是 log-normal 而非 normal。

生物例:血液中細胞激素濃度、CRP、IgE、抗體效價、單細胞 RNA-seq 表現量、細菌 CFU 計數。

處理:分析前 log-transform(或在 GLM 用 log link);報告時 mean(log X) → 反 log 是 geometric mean

Definition: X is log-normal ⟺ ln(X) ~ Normal. Strictly positive, right-skewed.

Why so common? When the data result from multiplicative factors (each step a percentage change), log converts multiplication to addition → CLT → log-normal. Limpert et al. (2001 BioScience) showed most biomedical readouts are natively log-normal, not normal.

Biology: blood cytokine levels, CRP, IgE, antibody titres, single-cell RNA-seq counts, bacterial CFU.

Handling: log-transform before analysis (or use a log link in a GLM); reporting: back-transforming mean(log X) gives the geometric mean.

三、三個離散分布

🪙

Binomial Bin(n, p)

故事:n 次獨立 Bernoulli 試驗,每次成功機率 p,數成功次數 X ∈ {0,...,n}。

均值/變異:E[X] = np, Var[X] = np(1−p)

常態近似:np ≥ 5n(1−p) ≥ 5 時,Bin ≈ N(np, np(1−p)),可改用 Z 檢定。

生物例:癌篩陽性率(n 人中 X 陽性)、PCR 成功率、基因型頻率、CRISPR 編輯效率。

Story: n independent Bernoulli trials with success probability p; X ∈ {0,...,n}.

Mean / variance: E[X] = np, Var[X] = np(1−p).

Normal approximation: when np ≥ 5 and n(1−p) ≥ 5, Bin ≈ N(np, np(1−p)) — a Z-test becomes feasible.

Biology: cancer-screening positive rate (X positives in n people), PCR success rate, allele frequency, CRISPR editing efficiency.

Poisson Pois(λ)

故事:單位時間 / 空間中事件發生數,事件彼此獨立且發生率恆定 λ。

關鍵:E[X] = Var[X] = λ——這是 Poisson 的「身份證」。

歷史:Luria & Delbrück 1943 用 Poisson 證明細菌突變是預先存在而非誘發(Nobel 1969);Bortkiewicz 1898 用普魯士騎兵被馬踢死人數驗證。

生物例:單位時間細胞分裂事件、RNA-seq 中低表達基因計數(高表達基因常已 overdispersed→NB)、PCR 模板數、放射衰變、photon count。

Story: count of events per unit time/space when events are independent and the rate λ is constant.

Key: E[X] = Var[X] = λ — the Poisson "ID badge".

History: Luria & Delbrück 1943 used Poisson to prove that bacterial mutations are pre-existing, not induced (Nobel 1969); Bortkiewicz 1898 validated it on Prussian soldiers kicked to death by horses.

Biology: cell divisions per unit time, low-expression RNA-seq counts (high-expression genes are typically overdispersed → NB), PCR template numbers, radioactive decay, photon counts.

📊

Negative Binomial

故事:Poisson 的「鬆綁版」——當資料變異大於均值(overdispersion)時用 NB。可解讀為「λ 本身來自 Gamma 分布的混合 Poisson」(Gamma-Poisson mixture)。

關鍵:Var[X] = µ + αµ²,α > 0 是 dispersion 參數;α → 0 時退化為 Poisson。

為何是 RNA-seq 的工作分布?跨樣本生物變異 + 技術變異 = Var ≫ Mean。edgeR(Robinson 2010)與 DESeq2(Love 2014 Genome Biology)都以 NB GLM 為核心;用 Poisson 會大幅膨脹假陽性。

生物例:bulk RNA-seq、ATAC-seq、ChIP-seq peak count、單細胞 UMI count(也常用 NB 或 ZINB)。

Story: the loosened Poisson — used when the variance exceeds the mean (overdispersion). Interpretable as a Gamma-Poisson mixture (λ itself is Gamma-distributed).

Key: Var[X] = µ + αµ²; α > 0 is the dispersion; α → 0 reverts to Poisson.

Why the workhorse of RNA-seq? Biological + technical variation across samples drive Var ≫ Mean. edgeR (Robinson 2010) and DESeq2 (Love 2014 Genome Biology) both rest on NB GLMs; using Poisson dramatically inflates false positives.

Biology: bulk RNA-seq, ATAC-seq, ChIP-seq peak counts, single-cell UMI counts (often NB or ZINB).

⌜ Bin: P(X=k) = C(n,k) pᵏ (1−p)ⁿ⁻ᵏ   ·   Pois: P(X=k) = λᵏe⁻ᵏ/k!   ·   NB: Var = µ + αµ²三者的關係:Bin(n, p) 在 n→∞ 且 np→λ 固定時 → Pois(λ);Pois 加入個體變異 → NB。這是離散分布的「家族樹」。 ⌜ Bin: P(X=k) = C(n,k) pᵏ (1−p)ⁿ⁻ᵏ   ·   Pois: P(X=k) = λᵏe⁻ᵏ/k!   ·   NB: Var = µ + αµ²Family tree: Bin(n, p) → Pois(λ) as n→∞ with np→λ fixed; Pois + extra-individual variance → NB. That is the discrete-distribution lineage.

過離散實驗室

同樣的均值 µ,Poisson 的變異固定 = µ,而 NB 的變異 = µ + αµ² 可以遠大於 µ。拖動 α 觀察 NB 如何「胖過」Poisson——這就是為何 DESeq2 / edgeR 不用 Poisson 而用 NB 來分析 RNA-seq。下方數字顯示 sample variance / mean ratio:>1.5 通常代表你需要 NB

For the same mean µ, Poisson has variance fixed at µ, but NB has variance µ + αµ² — potentially far larger. Drag α and watch NB "fatten" past Poisson — exactly why DESeq2 / edgeR pick NB, not Poisson, for RNA-seq. The panel reports the sample variance/mean ratio: >1.5 typically means you need NB.

綠 = Poisson · 橘 = NB(同 µ)Green = Poisson · Orange = NB (same µ)

陷阱:對 RNA-seq 計數用 Poisson 早期工具(如 2010 年前的 DEGseq)用 Poisson 模擬 RNA-seq,結果假陽性率高達 30–50%。Robinson & Smyth 2008、Anders & Huber 2010、Love 2014 都證實:跨生物樣本的變異遠大於 Poisson 允許。實務原則:原始 count 用 NB(DESeq2 / edgeR);TPM / FPKM 不適合差異分析(已經正規化),請用原始 count。 Pre-2010 tools (e.g. DEGseq) modelled RNA-seq with Poisson, yielding false-positive rates of 30–50%. Robinson & Smyth 2008, Anders & Huber 2010, and Love 2014 all show biological variation between samples far exceeds Poisson's budget. Rule: raw counts → NB (DESeq2 / edgeR); TPM / FPKM are unsuitable for DE (already normalised) — feed raw counts in.

四、選哪個分布?

🌳 分布選擇決策樹

Q1:
資料是連續還是離散計數→ 連續 → 進 Q2;→ 計數 → 進 Q5。
Q2:
連續資料對稱嗎?→ 是 → Normal(若是兩組均值比較且 n < 30 用 t;變異數比較用 F)。
Q3:
右偏且皆為正值?(濃度、效價、表達量)→ 是 → Log-normal,log 轉換後當常態處理。
Q4:
變異數估計 / 類別資料 goodness-of-fit?→ 是 → Chi-square χ²
Q5:
計數有「上限 n」(n 個個體中 X 個陽性)?→ 是 → Binomial
Q6:
無上限的事件計數(單位時間 / 區間)?檢查 Var/Mean 比:→ ≈ 1 → Poisson→ > 1.5 → Negative Binomial
Q7:
有大量 0(如 scRNA-seq dropout)?→ 是 → Zero-Inflated NB (ZINB) 或 hurdle 模型。
Q1:
Continuous or discrete count? → continuous → Q2; → count → Q5.
Q2:
Continuous and symmetric? → Yes → Normal (use t for two-mean tests with n < 30; F for variance ratios).
Q3:
Right-skewed and strictly positive? (concentration, titres, expression) → Yes → Log-normal; log-transform and treat as normal.
Q4:
Variance estimation / categorical goodness-of-fit? → Yes → Chi-square χ².
Q5:
Count with an upper bound n (X positives out of n)? → Yes → Binomial.
Q6:
Unbounded event counts (per unit time/region)? Check Var/Mean: → ≈ 1 → Poisson; → > 1.5 → Negative Binomial.
Q7:
Excess zeros (scRNA-seq dropout)? → Yes → Zero-Inflated NB (ZINB) or hurdle model.

五、八大分布並排

分布 類型 參數 均值/變異 典型用途
Normal連續µ, σµ / σ²CLT 後的彙總、Z-score、t/ANOVA 的母體假設ContinuousPost-CLT summaries, Z-score, population assumption for t/ANOVA
t (df)連續df0 / df/(df−2)小樣本均值差異、迴歸係數檢定ContinuousSmall-sample mean diffs, regression coefficient tests
F (df₁, df₂)連續df₁, df₂df₂/(df₂−2)ANOVA 變異數比、混合模型ContinuousANOVA variance ratios, mixed models
Chi² (df)連續dfdf / 2df類別資料、變異數 CI、LRTContinuousCategorical data, variance CI, LRT
Log-normal連續µ_log, σ_loge^(µ+σ²/2)濃度、抗體效價、CFU、scRNA 表現ContinuousConcentrations, titres, CFU, scRNA expression
Binomial離散n, pnp / np(1−p)陽性率、編輯效率、基因型頻率DiscretePositive rate, editing efficiency, allele frequency
Poisson離散λλ / λ罕見事件計數、突變、photon、低表達基因DiscreteRare event counts, mutations, photons, low-expression genes
Neg. Binomial離散µ, αµ / µ+αµ²RNA-seq / ATAC / ChIP(DESeq2、edgeR)DiscreteRNA-seq / ATAC / ChIP (DESeq2, edgeR)
四個收斂事實要背熟:(1) Binomial(n, p) → Poisson(λ=np) 當 n 大、p 小;(2) Poisson(λ) → Normal(λ, λ) 當 λ 大(> 30);(3) t(df) → Normal(0,1) 當 df 大(> 30);(4) χ²(df) → Normal 當 df 大。這些「正常化」是 CLT 的本質——下一章 Step 3 會深入。 Four convergence facts to memorise: (1) Binomial(n, p) → Poisson(λ = np) for large n and small p; (2) Poisson(λ) → Normal(λ, λ) for large λ (> 30); (3) t(df) → N(0,1) for large df (> 30); (4) χ²(df) → Normal for large df. These "normalisations" are the CLT in action — Step 3 unpacks them.

六、實作範例

# R: d* = density/PMF, p* = CDF, q* = quantile, r* = random
# --- Normal ---
dnorm(1.96); pnorm(1.96)        # 0.0584 ; 0.975
qnorm(0.975); rnorm(5, 0, 1)     # 1.96 ; sample

# --- Binomial: P(X = 7) when n=10, p=0.5 ---
dbinom(7, size=10, prob=0.5)
pbinom(7, 10, 0.5)               # P(X ≤ 7)

# --- Poisson: lambda = 3 mutations / generation ---
dpois(2, lambda=3); ppois(2, 3)

# --- t (df = 10) ---
qt(0.975, df=10)                # critical value 2.228

# --- Chi-square & F critical values ---
qchisq(0.95, df=3); qf(0.95, df1=2, df2=20)

# --- Fit a negative binomial to RNA-seq counts ---
library(MASS)
counts <- rnbinom(n=200, mu=10, size=2)   # size = 1/alpha
fit <- MASS::fitdistr(counts, "negative binomial")
fit$estimate                                # mu and size

# --- Goodness-of-fit: chi-square ---
obs <- c(120, 230, 150); exp <- c(100, 250, 150)
chisq.test(obs, p=exp/sum(exp))

# --- DESeq2 idiom: NB GLM behind the curtain ---
# dds <- DESeqDataSetFromMatrix(counts, coldata, ~ condition)
# dds <- DESeq(dds)   # fits NB with shrinkage (Love 2014)
import numpy as np
from scipy import stats

# --- Normal ---
stats.norm.pdf(1.96); stats.norm.cdf(1.96)   # 0.0584 ; 0.975
stats.norm.ppf(0.975); stats.norm.rvs(size=5)

# --- Binomial ---
stats.binom.pmf(7, n=10, p=0.5)
stats.binom.cdf(7, 10, 0.5)

# --- Poisson ---
stats.poisson.pmf(2, mu=3); stats.poisson.cdf(2, 3)

# --- t / Chi² / F critical values ---
stats.t.ppf(0.975, df=10)                     # 2.228
stats.chi2.ppf(0.95, df=3)
stats.f.ppf(0.95, dfn=2, dfd=20)

# --- Fit NB to counts (statsmodels) ---
import statsmodels.api as sm
counts = stats.nbinom.rvs(n=2, p=2/(2+10), size=200)
res = sm.NegativeBinomial(counts, np.ones(200)).fit(disp=0)
res.params  # intercept (log mu) + alpha

# --- Goodness-of-fit chi-square ---
stats.chisquare([120,230,150], f_exp=[100,250,150])

# --- pyDESeq2 idiom (NB GLM for RNA-seq DE) ---
# from pydeseq2.dds import DeseqDataSet
# dds = DeseqDataSet(counts=counts_df, metadata=meta, design_factors="condition")
# dds.deseq2()   # NB with empirical Bayes shrinkage
💡
實作小撇步:遇到計數資料,先跑「Var/Mean 比」:R 用 var(x)/mean(x),Python 用 x.var()/x.mean()。> 1.5 直接走 NB;近 1 用 Poisson;若有大量 0 還要看 ZINB(pscl::zeroinfl())。 Quick recipe: for any count data, compute the variance-to-mean ratio first: R var(x)/mean(x), Python x.var()/x.mean(). > 1.5 → go NB; near 1 → Poisson; many zeros → consider ZINB (pscl::zeroinfl()).

七、五個分布錯誤

計數資料用 t-test

把細胞分裂次數、CFU、reads count 當常態做 t 檢定——當均值小(λ < 5)時,常態近似失效,Type I error 顯著膨脹。改用 Poisson GLM 或 NB GLM;或先 log/√ 轉換再用 t。

Treating cell divisions, CFU, or read counts as normal in a t-test breaks down when the mean is small (λ < 5) — Type I error inflates noticeably. Use a Poisson or NB GLM, or apply a log/√ transform before t.

RNA-seq 用 Poisson

跨樣本的生物變異使 Var ≫ Mean,用 Poisson 的 p 值嚴重偏小。一個基因 Var/Mean = 5 卻被當 Poisson 分析,假陽性率可從 5% 衝到 40%。用 DESeq2 / edgeR 的 NB GLM

Cross-sample biological variation gives Var ≫ Mean; Poisson p-values become far too small. A gene with Var/Mean = 5 analysed as Poisson can see FPR jump from 5% to 40%. Use the NB GLM in DESeq2 / edgeR.

Binomial vs Poisson

「100 人中 X 人陽性」是 Binomial(有上限 n);「某基因 1 kb 區段內突變 X 次」是 Poisson(無上限)。混淆會把分母(樣本數)與曝險(時間/長度)算錯,CI 全錯。

"X positives in 100 people" is Binomial (bounded by n); "X mutations in a 1 kb region" is Poisson (unbounded). Mixing them swaps denominator (sample size) and exposure (time/length); CIs go off.

比例資料硬塞常態

把陽性率 p̂ 直接當 N(p̂, p̂(1−p̂)/n) 算 95% CI——當 p̂ 接近 0 或 1 時,CI 會超出 [0, 1] 範圍。改用 Wilson 或 Clopper-Pearson CI(binom::binom.confint()statsmodels.proportion_confint())。

Treating p̂ as N(p̂, p̂(1−p̂)/n) for a 95% CI fails near 0 or 1 — the CI can leak outside [0, 1]. Use Wilson or Clopper-Pearson intervals (binom::binom.confint(), statsmodels.proportion_confint()).

Log-normal 不 log

細胞激素濃度、抗體效價在原尺度上強烈右偏。直接做 t-test 會被尾巴主導,檢力 (power) 大幅下降。先 log 再 t,或用 Mann-Whitney U(不假設形狀)。

Cytokine levels and antibody titres are strongly right-skewed on the raw scale. Running t straight away lets the tail dominate and power drops sharply. Log first, then t — or use Mann-Whitney U (no shape assumption).

離散分布報 mean ± SD

Poisson 的 SD = √λ,Binomial 的 SD = √(np(1−p))——但「mean ± SD」對離散資料常常包含負數,無意義。改用 95% CI(Wilson for Binomial、exact Poisson CI)。

Poisson SD = √λ, Binomial SD = √(np(1−p)) — but "mean ± SD" on discrete data often crosses zero, which is meaningless. Use a 95% CI instead (Wilson for Binomial, exact Poisson CI).

🚨
一句話總結:沒有「萬用分布」,只有「適合產生機制的分布」。看到資料先問三件事:(1) 連續還是計數?(2) 有沒有上限?(3) Var/Mean 比是多少?——三題答完,分布幾乎已定。 One-liner: there is no universal distribution — only the one that matches the generative mechanism. When data lands on your desk ask three things: (1) continuous or count? (2) bounded or unbounded? (3) variance-to-mean ratio? Three answers narrow the distribution down to one.

📝 自我檢測

1. 你做 RNA-seq 差異分析,看到某基因的 sample variance / mean ≈ 6。下列何者最合適?

1. In an RNA-seq DE analysis, a gene's sample variance / mean ≈ 6. What's most appropriate?

A. 用 Poisson GLM——畢竟是計數資料A. Use a Poisson GLM — it's count data
B. 把 count 當常態做 t-testB. Treat counts as normal and run t-test
C. 用 Negative Binomial GLM(DESeq2 / edgeR)C. Use a Negative Binomial GLM (DESeq2 / edgeR)
D. 取 TPM 後算相關係數D. Compute TPM and look at correlation

2. 對於連續變數 X,下列何者正確?

2. For a continuous variable X, which is correct?

A. PDF 值就是 P(X = x) 的機率A. The PDF value equals P(X = x)
B. P(X = x) = 0;機率要靠對區間積分 PDF 得到B. P(X = x) = 0; probabilities come from integrating the PDF over an interval
C. PDF 永遠 ≤ 1C. The PDF is always ≤ 1
D. CDF 可能大於 1D. The CDF can exceed 1

3. 「篩檢 200 人,問 X 名陽性的機率分布」與「在 1 kb 染色體區段中 X 個 SNV」分別屬於?

3. "X positives out of 200 screened" vs "X SNVs in a 1 kb region" — what distributions?

A. 兩者都是 PoissonA. Both Poisson
B. 兩者都是 BinomialB. Both Binomial
C. 前者 Poisson、後者 BinomialC. First Poisson, second Binomial
D. 前者 Binomial(有上限 n=200)、後者 Poisson(無上限的事件計數)D. First Binomial (bounded by n=200), second Poisson (unbounded events)

4. 你有 n=8 隻老鼠的腫瘤體積要做兩組比較,最適合的抽樣分布(用來算 p 值)是?

4. Two-group comparison on n=8 mice tumour volumes — the most appropriate sampling distribution for the p-value is?

A. 標準常態 N(0,1)A. Standard normal N(0,1)
B. Chi-squareB. Chi-square
C. Student's t(小樣本,df 控制尾巴厚度)C. Student's t (small sample, df controls tails)
D. BinomialD. Binomial