Step 2: Probability Distributions — Biostatistics Tutorial

總覽

為什麼分布會「決定」你的統計檢定？

機率分布是你資料的「生成模型」——當你寫下 t.test()，背後假設兩組資料來自同變異數的常態分布；當你跑 DESeq2 的 RNA-seq 差異分析，背後是 Negative Binomial GLM（Love 2014）。分布選錯，p 值與 CI 全部失真：用常態近似計數資料會低估變異；忽略 RNA-seq 的 overdispersion 會把 Type I error 衝到 50%。

本章帶你走過生統的「八個工作分布」：四連續（Normal、t、F、χ²、log-normal）、三離散（Binomial、Poisson、Negative Binomial）。每個分布都有一段生物學起源故事，與一個必須認得的「踩雷時刻」。

A probability distribution is the generative model behind your data. When you call t.test(), you assume two samples from normals with equal variance; when you run DESeq2 for RNA-seq, you are fitting a Negative Binomial GLM (Love 2014). Choose the wrong distribution and the p-value and CI lie to you: normal-approximating count data underestimates variance; ignoring RNA-seq overdispersion can push Type I error past 50%.

This chapter walks through the eight workhorse distributions of biostatistics — five continuous (Normal, t, F, χ², log-normal) and three discrete (Binomial, Poisson, Negative Binomial). Each has a biological origin story and a "trap" you have to recognise.

💡

背一條哲學：分布來自產生資料的機制，不是你選擇的「樣式」。問題不是「我的資料看起來像哪個分布？」而是「產生我資料的過程是計數？是比例？是兩個變異數的比？」——機制決定分布，分布決定檢定。 One slogan to remember: a distribution comes from the process that generated the data, not from what the data "looks like". Don't ask "which distribution does my data resemble?"; ask "what process produced it — a count? a proportion? a ratio of variances?" The mechanism dictates the distribution, and the distribution dictates the test.

基本詞彙

一、PMF、PDF、CDF

機率分布有兩種「描述方式」與一個「累積版本」，總共三個函數，分別給離散資料與連續資料用。混用是論文最常見的初級錯誤。

A distribution offers two "descriptions" plus a cumulative one, totalling three functions — one set for discrete data, one for continuous. Mixing them is among the most common rookie errors in print.

🎯

PMF（離散）

機率質量函數 P(X=k)：對離散變數，直接給出取某值的機率。例：擲 10 次硬幣中 3 次正面 P(X=3) = C(10,3)·0.5¹⁰。

性質：Σ P(X=k) = 1，每個 P(X=k) ∈ [0, 1]。

Probability mass function P(X=k) — for discrete variables, gives the probability of exactly value k. Example: 3 heads in 10 flips, P(X=3) = C(10,3)·0.5¹⁰.

Properties: Σ P(X=k) = 1; each P(X=k) ∈ [0, 1].

📐

PDF（連續）

機率密度函數 f(x)：對連續變數，f(x) 是「密度」不是機率——單點機率為 0，必須積分區間才有機率：P(a<X<b) = ∫f(x)dx。

性質：∫ f(x)dx = 1，f(x) ≥ 0（但可以 > 1）。

Probability density function f(x) — for continuous variables, f(x) is a density, not a probability. The probability of a single point is 0; you need to integrate over an interval: P(a<X<b) = ∫f(x)dx.

Properties: ∫ f(x)dx = 1; f(x) ≥ 0 (but may exceed 1).

📈

CDF（兩者皆有）

累積分布函數 F(x) = P(X ≤ x)：永遠從 0 單調遞增至 1。CDF 的反函數叫 分位函數 Q(p)——這就是 R 的 qnorm(0.975)=1.96 在做的事。

用途：算 p 值（CDF 尾巴）、求臨界值（Q）、QQ-plot 比對。

Cumulative distribution function F(x) = P(X ≤ x) — always non-decreasing from 0 to 1. Its inverse is the quantile function Q(p) — that is exactly what qnorm(0.975)=1.96 computes.

Uses: p-values (CDF tails), critical values (Q), QQ-plots.

⌜ 離散 PMF：P(X=k) · 連續 PDF：f(x) · CDF：F(x) = P(X ≤ x) · 分位：Q(p) = F⁻¹(p) ⌝ 在 R 中：d* = density/PMF、p* = CDF、q* = quantile、r* = random sample（如 dnorm/pnorm/qnorm/rnorm）。在 Python (scipy.stats)：.pdf/.pmf、.cdf、.ppf、.rvs。 ⌜ Discrete PMF: P(X=k) · Continuous PDF: f(x) · CDF: F(x) = P(X ≤ x) · Quantile: Q(p) = F⁻¹(p) ⌝ In R: d* = density/PMF, p* = CDF, q* = quantile, r* = random sample (dnorm/pnorm/qnorm/rnorm). In Python (scipy.stats): .pdf/.pmf, .cdf, .ppf, .rvs.

⚠️

常見混淆：PDF 值不是機率。dnorm(0)≈0.399——但常態分布在 x=0 的「機率」是 0；0.399 是密度，需乘上微小區間 dx 才會逼近機率。沒有所謂「P(身高 = 170.000 cm)」，只有 P(169.5 < 身高 < 170.5)。 Common confusion: a PDF value is not a probability. dnorm(0)≈0.399 — but the "probability" of x = 0 under a normal is 0; 0.399 is a density that must be multiplied by an infinitesimal dx to approach probability. "P(height = 170.000 cm)" does not exist — only P(169.5 < height < 170.5) does.

互動模擬 ①

分布形狀遊樂場

切換不同分布，拖動關鍵參數，觀察形狀如何改變。下方即時計算 mean、variance、skewness，並顯示 P(X ≤ x)（CDF）值——這就是 p 值計算背後的機制。

Pick a distribution, drag its key parameter, watch the shape morph. The panel below recomputes mean, variance, skewness live and shows P(X ≤ x) (the CDF) — the same machinery your p-value calculations rely on.

分布類型：

µ 0

σ 1

綠色 = PDF/PMF · 橘色虛線 = CDF · 紅線 = meanGreen = PDF/PMF · orange dashed = CDF · red = mean

連續分布巡禮

二、五個連續分布

🔔

Normal N(µ, σ²)

形狀：對稱鐘形；µ 控制中心，σ 控制寬度。68-95-99.7 法則（Galton 1886）：±1σ 含 68.27%、±2σ 含 95.45%、±3σ 含 99.73%。

為何到處都是常態？中心極限定理（CLT，Step 3）——任何多個小效應相加的結果都會收斂到常態。所以人的身高、實驗誤差、聚合的免疫染色強度，都接近常態。

生物例：成年男性身高（µ≈175 cm, σ≈7）、HbA1c、Z-score 用的標準化。

Shape: symmetric bell; µ sets the centre, σ the width. The 68-95-99.7 rule (Galton 1886): ±1σ holds 68.27%, ±2σ 95.45%, ±3σ 99.73%.

Why is it everywhere? The central limit theorem (Step 3): any sum of many small effects converges to normality. Human height, instrument noise, summed immunostain intensity — all approximately normal.

Biology: adult male height (µ≈175 cm, σ≈7), HbA1c, the Z-score normalisation step.

📏

Student's t (df)

來源：Gosset（1908 筆名 "Student"）在 Guinness 啤酒廠處理小樣本（n < 30）時，發現用 s 取代 σ 後分布不再是常態——尾巴更厚，這就是 t 分布。

形狀：對稱、零均值、比常態厚尾。df → ∞ 時 t → N(0,1)；df = 30 時兩者實務上已難分辨。

生物例：所有小樣本 mean 差異檢定（兩組老鼠的腫瘤體積、配對比較），都建立在 t 分布上（Step 5）。

Origin: Gosset (1908, pen name "Student") at Guinness needed small-sample inference (n < 30). Replacing σ with s broadens the tails — born is the t distribution.

Shape: symmetric, zero-mean, heavier tails than normal. As df → ∞, t → N(0,1); by df = 30 the two are practically indistinguishable.

Biology: every small-sample mean-difference test (tumour volumes in two mouse groups, paired comparisons) rides on t (Step 5).

⚖️

F (df₁, df₂)

來源：R.A. Fisher 1924——兩個獨立卡方除以自由度的比值。Snedecor 1934 命名為 F 紀念 Fisher。

用途：變異數比（ANOVA, Step 7）；F = MS_between / MS_within。永遠 ≥ 0、右偏，df₁=df₂=∞ 時形狀塌縮為 1。

生物例：多組劑量比較（5 個治療組老鼠）、多因子實驗（基因型 × 飲食）。F 大代表「組間差異 ≫ 組內變異」。

Origin: R.A. Fisher 1924 — the ratio of two independent chi-squares each divided by their df. Snedecor (1934) named it "F" for Fisher.

Use: variance ratios (ANOVA, Step 7); F = MS_between / MS_within. Always ≥ 0, right-skewed; with df₁ = df₂ = ∞ it collapses to 1.

Biology: multi-dose comparisons (5 treatment arms in mice), factorial designs (genotype × diet). A large F means "between > within".

📐

Chi-square χ²(df)

定義：df 個獨立 N(0,1) 變數平方和。對df = 1, 2 高度右偏；df → ∞ 趨近常態。

用途：(1) 類別資料 goodness-of-fit / 列聯表獨立性（Step 6）；(2) 變異數的信賴區間：(n−1)s² / σ² ~ χ²(n−1)；(3) 概似比檢定（LRT）的標準漸近分布（Wilks 1938）。

生物例：Hardy-Weinberg 平衡檢定、基因型 vs 表型的 2×3 列聯表、GWAS 中的 LRT。

Definition: sum of df squared independent N(0,1) variables. Highly right-skewed for df = 1, 2; normalises as df → ∞.

Use: (1) categorical goodness-of-fit / contingency tables (Step 6); (2) CI for variance: (n−1)s² / σ² ~ χ²(n−1); (3) the asymptotic distribution of likelihood-ratio tests (Wilks 1938).

Biology: Hardy-Weinberg equilibrium tests, 2×3 genotype × phenotype tables, LRTs across GWAS.

🌋

Log-normal

定義：X 是 log-normal ⟺ ln(X) ~ Normal。形狀：永遠正值、右偏。

為何到處都是？當資料來自多個因子相乘（每一步以比例增長），對數會把乘法變加法 → CLT → log-normal。Limpert 等（2001 BioScience）證實生醫資料絕大多數天然是 log-normal 而非 normal。

生物例：血液中細胞激素濃度、CRP、IgE、抗體效價、單細胞 RNA-seq 表現量、細菌 CFU 計數。

處理：分析前 log-transform（或在 GLM 用 log link）；報告時 mean(log X) → 反 log 是 geometric mean。

Definition: X is log-normal ⟺ ln(X) ~ Normal. Strictly positive, right-skewed.

Why so common? When the data result from multiplicative factors (each step a percentage change), log converts multiplication to addition → CLT → log-normal. Limpert et al. (2001 BioScience) showed most biomedical readouts are natively log-normal, not normal.

Biology: blood cytokine levels, CRP, IgE, antibody titres, single-cell RNA-seq counts, bacterial CFU.

Handling: log-transform before analysis (or use a log link in a GLM); reporting: back-transforming mean(log X) gives the geometric mean.

離散分布巡禮

三、三個離散分布

🪙

Binomial Bin(n, p)

故事：n 次獨立 Bernoulli 試驗，每次成功機率 p，數成功次數 X ∈ {0,...,n}。

均值/變異：E[X] = np, Var[X] = np(1−p)。

常態近似：當 np ≥ 5 且 n(1−p) ≥ 5 時，Bin ≈ N(np, np(1−p))，可改用 Z 檢定。

生物例：癌篩陽性率（n 人中 X 陽性）、PCR 成功率、基因型頻率、CRISPR 編輯效率。

Story: n independent Bernoulli trials with success probability p; X ∈ {0,...,n}.

Mean / variance: E[X] = np, Var[X] = np(1−p).

Normal approximation: when np ≥ 5 and n(1−p) ≥ 5, Bin ≈ N(np, np(1−p)) — a Z-test becomes feasible.

Biology: cancer-screening positive rate (X positives in n people), PCR success rate, allele frequency, CRISPR editing efficiency.

⚡

Poisson Pois(λ)

故事：單位時間 / 空間中事件發生數，事件彼此獨立且發生率恆定 λ。

關鍵：E[X] = Var[X] = λ——這是 Poisson 的「身份證」。

歷史：Luria & Delbrück 1943 用 Poisson 證明細菌突變是預先存在而非誘發（Nobel 1969）；Bortkiewicz 1898 用普魯士騎兵被馬踢死人數驗證。

生物例：單位時間細胞分裂事件、RNA-seq 中低表達基因計數（高表達基因常已 overdispersed→NB）、PCR 模板數、放射衰變、photon count。

Story: count of events per unit time/space when events are independent and the rate λ is constant.

Key: E[X] = Var[X] = λ — the Poisson "ID badge".

History: Luria & Delbrück 1943 used Poisson to prove that bacterial mutations are pre-existing, not induced (Nobel 1969); Bortkiewicz 1898 validated it on Prussian soldiers kicked to death by horses.

Biology: cell divisions per unit time, low-expression RNA-seq counts (high-expression genes are typically overdispersed → NB), PCR template numbers, radioactive decay, photon counts.

📊

Negative Binomial

故事：Poisson 的「鬆綁版」——當資料變異大於均值（overdispersion）時用 NB。可解讀為「λ 本身來自 Gamma 分布的混合 Poisson」（Gamma-Poisson mixture）。

關鍵：Var[X] = µ + αµ²，α > 0 是 dispersion 參數；α → 0 時退化為 Poisson。

為何是 RNA-seq 的工作分布？跨樣本生物變異 + 技術變異 = Var ≫ Mean。edgeR（Robinson 2010）與 DESeq2（Love 2014 Genome Biology）都以 NB GLM 為核心；用 Poisson 會大幅膨脹假陽性。

生物例：bulk RNA-seq、ATAC-seq、ChIP-seq peak count、單細胞 UMI count（也常用 NB 或 ZINB）。

Story: the loosened Poisson — used when the variance exceeds the mean (overdispersion). Interpretable as a Gamma-Poisson mixture (λ itself is Gamma-distributed).

Key: Var[X] = µ + αµ²; α > 0 is the dispersion; α → 0 reverts to Poisson.

Why the workhorse of RNA-seq? Biological + technical variation across samples drive Var ≫ Mean. edgeR (Robinson 2010) and DESeq2 (Love 2014 Genome Biology) both rest on NB GLMs; using Poisson dramatically inflates false positives.

Biology: bulk RNA-seq, ATAC-seq, ChIP-seq peak counts, single-cell UMI counts (often NB or ZINB).

⌜ Bin: P(X=k) = C(n,k) pᵏ (1−p)ⁿ⁻ᵏ · Pois: P(X=k) = λᵏe⁻ᵏ/k! · NB: Var = µ + αµ² ⌝ 三者的關係：Bin(n, p) 在 n→∞ 且 np→λ 固定時 → Pois(λ)；Pois 加入個體變異 → NB。這是離散分布的「家族樹」。 ⌜ Bin: P(X=k) = C(n,k) pᵏ (1−p)ⁿ⁻ᵏ · Pois: P(X=k) = λᵏe⁻ᵏ/k! · NB: Var = µ + αµ² ⌝ Family tree: Bin(n, p) → Pois(λ) as n→∞ with np→λ fixed; Pois + extra-individual variance → NB. That is the discrete-distribution lineage.

互動模擬 ②

過離散實驗室

同樣的均值 µ，Poisson 的變異固定 = µ，而 NB 的變異 = µ + αµ² 可以遠大於 µ。拖動 α 觀察 NB 如何「胖過」Poisson——這就是為何 DESeq2 / edgeR 不用 Poisson 而用 NB 來分析 RNA-seq。下方數字顯示 sample variance / mean ratio：>1.5 通常代表你需要 NB。

For the same mean µ, Poisson has variance fixed at µ, but NB has variance µ + αµ² — potentially far larger. Drag α and watch NB "fatten" past Poisson — exactly why DESeq2 / edgeR pick NB, not Poisson, for RNA-seq. The panel reports the sample variance/mean ratio: >1.5 typically means you need NB.

均值 µ 10

過離散 α 0.5

綠 = Poisson · 橘 = NB（同 µ）Green = Poisson · Orange = NB (same µ)

陷阱：對 RNA-seq 計數用 Poisson 早期工具（如 2010 年前的 DEGseq）用 Poisson 模擬 RNA-seq，結果假陽性率高達 30–50%。Robinson & Smyth 2008、Anders & Huber 2010、Love 2014 都證實：跨生物樣本的變異遠大於 Poisson 允許。實務原則：原始 count 用 NB（DESeq2 / edgeR）；TPM / FPKM 不適合差異分析（已經正規化），請用原始 count。 Pre-2010 tools (e.g. DEGseq) modelled RNA-seq with Poisson, yielding false-positive rates of 30–50%. Robinson & Smyth 2008, Anders & Huber 2010, and Love 2014 all show biological variation between samples far exceeds Poisson's budget. Rule: raw counts → NB (DESeq2 / edgeR); TPM / FPKM are unsuitable for DE (already normalised) — feed raw counts in.

決策引導

四、選哪個分布？

🌳 分布選擇決策樹

Q1:

資料是連續還是離散計數？→ 連續 → 進 Q2；→ 計數 → 進 Q5。

Q2:

連續資料對稱嗎？→ 是 → Normal（若是兩組均值比較且 n < 30 用 t；變異數比較用 F）。

Q3:

右偏且皆為正值？（濃度、效價、表達量）→ 是 → Log-normal，log 轉換後當常態處理。

Q4:

變異數估計 / 類別資料 goodness-of-fit？→ 是 → Chi-square χ²。

Q5:

計數有「上限 n」（n 個個體中 X 個陽性）？→ 是 → Binomial。

Q6:

無上限的事件計數（單位時間 / 區間）？檢查 Var/Mean 比：→ ≈ 1 → Poisson；→ > 1.5 → Negative Binomial。

Q7:

有大量 0（如 scRNA-seq dropout）？→ 是 → Zero-Inflated NB (ZINB) 或 hurdle 模型。

Q1:

Continuous or discrete count? → continuous → Q2; → count → Q5.

Q2:

Continuous and symmetric? → Yes → Normal (use t for two-mean tests with n < 30; F for variance ratios).

Q3:

Right-skewed and strictly positive? (concentration, titres, expression) → Yes → Log-normal; log-transform and treat as normal.

Q4:

Variance estimation / categorical goodness-of-fit? → Yes → Chi-square χ².

Q5:

Count with an upper bound n (X positives out of n)? → Yes → Binomial.

Q6:

Unbounded event counts (per unit time/region)? Check Var/Mean: → ≈ 1 → Poisson; → > 1.5 → Negative Binomial.

Q7:

Excess zeros (scRNA-seq dropout)? → Yes → Zero-Inflated NB (ZINB) or hurdle model.

總表速查

五、八大分布並排

分布	類型	參數	均值/變異	典型用途
Normal	連續	µ, σ	µ / σ²	CLT 後的彙總、Z-score、t/ANOVA 的母體假設	Continuous	Post-CLT summaries, Z-score, population assumption for t/ANOVA
t (df)	連續	df	0 / df/(df−2)	小樣本均值差異、迴歸係數檢定	Continuous	Small-sample mean diffs, regression coefficient tests
F (df₁, df₂)	連續	df₁, df₂	df₂/(df₂−2)	ANOVA 變異數比、混合模型	Continuous	ANOVA variance ratios, mixed models
Chi² (df)	連續	df	df / 2df	類別資料、變異數 CI、LRT	Continuous	Categorical data, variance CI, LRT
Log-normal	連續	µ_log, σ_log	e^(µ+σ²/2)	濃度、抗體效價、CFU、scRNA 表現	Continuous	Concentrations, titres, CFU, scRNA expression
Binomial	離散	n, p	np / np(1−p)	陽性率、編輯效率、基因型頻率	Discrete	Positive rate, editing efficiency, allele frequency
Poisson	離散	λ	λ / λ	罕見事件計數、突變、photon、低表達基因	Discrete	Rare event counts, mutations, photons, low-expression genes
Neg. Binomial	離散	µ, α	µ / µ+αµ²	RNA-seq / ATAC / ChIP（DESeq2、edgeR）	Discrete	RNA-seq / ATAC / ChIP (DESeq2, edgeR)

四個收斂事實要背熟：(1) Binomial(n, p) → Poisson(λ=np) 當 n 大、p 小；(2) Poisson(λ) → Normal(λ, λ) 當 λ 大（> 30）；(3) t(df) → Normal(0,1) 當 df 大（> 30）；(4) χ²(df) → Normal 當 df 大。這些「正常化」是 CLT 的本質——下一章 Step 3 會深入。 Four convergence facts to memorise: (1) Binomial(n, p) → Poisson(λ = np) for large n and small p; (2) Poisson(λ) → Normal(λ, λ) for large λ (> 30); (3) t(df) → N(0,1) for large df (> 30); (4) χ²(df) → Normal for large df. These "normalisations" are the CLT in action — Step 3 unpacks them.

程式碼

六、實作範例

# R: d* = density/PMF, p* = CDF, q* = quantile, r* = random
# --- Normal ---
dnorm(1.96); pnorm(1.96)        # 0.0584 ; 0.975
qnorm(0.975); rnorm(5, 0, 1)     # 1.96 ; sample

# --- Binomial: P(X = 7) when n=10, p=0.5 ---
dbinom(7, size=10, prob=0.5)
pbinom(7, 10, 0.5)               # P(X ≤ 7)

# --- Poisson: lambda = 3 mutations / generation ---
dpois(2, lambda=3); ppois(2, 3)

# --- t (df = 10) ---
qt(0.975, df=10)                # critical value 2.228

# --- Chi-square & F critical values ---
qchisq(0.95, df=3); qf(0.95, df1=2, df2=20)

# --- Fit a negative binomial to RNA-seq counts ---
library(MASS)
counts <- rnbinom(n=200, mu=10, size=2)   # size = 1/alpha
fit <- MASS::fitdistr(counts, "negative binomial")
fit$estimate                                # mu and size

# --- Goodness-of-fit: chi-square ---
obs <- c(120, 230, 150); exp <- c(100, 250, 150)
chisq.test(obs, p=exp/sum(exp))

# --- DESeq2 idiom: NB GLM behind the curtain ---
# dds <- DESeqDataSetFromMatrix(counts, coldata, ~ condition)
# dds <- DESeq(dds)   # fits NB with shrinkage (Love 2014)

import numpy as np
from scipy import stats

# --- Normal ---
stats.norm.pdf(1.96); stats.norm.cdf(1.96)   # 0.0584 ; 0.975
stats.norm.ppf(0.975); stats.norm.rvs(size=5)

# --- Binomial ---
stats.binom.pmf(7, n=10, p=0.5)
stats.binom.cdf(7, 10, 0.5)

# --- Poisson ---
stats.poisson.pmf(2, mu=3); stats.poisson.cdf(2, 3)

# --- t / Chi² / F critical values ---
stats.t.ppf(0.975, df=10)                     # 2.228
stats.chi2.ppf(0.95, df=3)
stats.f.ppf(0.95, dfn=2, dfd=20)

# --- Fit NB to counts (statsmodels) ---
import statsmodels.api as sm
counts = stats.nbinom.rvs(n=2, p=2/(2+10), size=200)
res = sm.NegativeBinomial(counts, np.ones(200)).fit(disp=0)
res.params  # intercept (log mu) + alpha

# --- Goodness-of-fit chi-square ---
stats.chisquare([120,230,150], f_exp=[100,250,150])

# --- pyDESeq2 idiom (NB GLM for RNA-seq DE) ---
# from pydeseq2.dds import DeseqDataSet
# dds = DeseqDataSet(counts=counts_df, metadata=meta, design_factors="condition")
# dds.deseq2()   # NB with empirical Bayes shrinkage

💡

實作小撇步：遇到計數資料，先跑「Var/Mean 比」：R 用 var(x)/mean(x)，Python 用 x.var()/x.mean()。> 1.5 直接走 NB；近 1 用 Poisson；若有大量 0 還要看 ZINB（pscl::zeroinfl()）。 Quick recipe: for any count data, compute the variance-to-mean ratio first: R var(x)/mean(x), Python x.var()/x.mean(). > 1.5 → go NB; near 1 → Poisson; many zeros → consider ZINB (pscl::zeroinfl()).

常見陷阱

七、五個分布錯誤

❌ 計數資料用 t-test

把細胞分裂次數、CFU、reads count 當常態做 t 檢定——當均值小（λ < 5）時，常態近似失效，Type I error 顯著膨脹。改用 Poisson GLM 或 NB GLM；或先 log/√ 轉換再用 t。

Treating cell divisions, CFU, or read counts as normal in a t-test breaks down when the mean is small (λ < 5) — Type I error inflates noticeably. Use a Poisson or NB GLM, or apply a log/√ transform before t.

❌ RNA-seq 用 Poisson

跨樣本的生物變異使 Var ≫ Mean，用 Poisson 的 p 值嚴重偏小。一個基因 Var/Mean = 5 卻被當 Poisson 分析，假陽性率可從 5% 衝到 40%。用 DESeq2 / edgeR 的 NB GLM。

Cross-sample biological variation gives Var ≫ Mean; Poisson p-values become far too small. A gene with Var/Mean = 5 analysed as Poisson can see FPR jump from 5% to 40%. Use the NB GLM in DESeq2 / edgeR.

❌ Binomial vs Poisson

「100 人中 X 人陽性」是 Binomial（有上限 n）；「某基因 1 kb 區段內突變 X 次」是 Poisson（無上限）。混淆會把分母（樣本數）與曝險（時間/長度）算錯，CI 全錯。

"X positives in 100 people" is Binomial (bounded by n); "X mutations in a 1 kb region" is Poisson (unbounded). Mixing them swaps denominator (sample size) and exposure (time/length); CIs go off.

❌ 比例資料硬塞常態

把陽性率 p̂ 直接當 N(p̂, p̂(1−p̂)/n) 算 95% CI——當 p̂ 接近 0 或 1 時，CI 會超出 [0, 1] 範圍。改用 Wilson 或 Clopper-Pearson CI（binom::binom.confint()、statsmodels.proportion_confint()）。

Treating p̂ as N(p̂, p̂(1−p̂)/n) for a 95% CI fails near 0 or 1 — the CI can leak outside [0, 1]. Use Wilson or Clopper-Pearson intervals (binom::binom.confint(), statsmodels.proportion_confint()).

❌ Log-normal 不 log

細胞激素濃度、抗體效價在原尺度上強烈右偏。直接做 t-test 會被尾巴主導，檢力 (power) 大幅下降。先 log 再 t，或用 Mann-Whitney U（不假設形狀）。

Cytokine levels and antibody titres are strongly right-skewed on the raw scale. Running t straight away lets the tail dominate and power drops sharply. Log first, then t — or use Mann-Whitney U (no shape assumption).

❌ 離散分布報 mean ± SD

Poisson 的 SD = √λ，Binomial 的 SD = √(np(1−p))——但「mean ± SD」對離散資料常常包含負數，無意義。改用 95% CI（Wilson for Binomial、exact Poisson CI）。

Poisson SD = √λ, Binomial SD = √(np(1−p)) — but "mean ± SD" on discrete data often crosses zero, which is meaningless. Use a 95% CI instead (Wilson for Binomial, exact Poisson CI).

🚨

一句話總結：沒有「萬用分布」，只有「適合產生機制的分布」。看到資料先問三件事：(1) 連續還是計數？(2) 有沒有上限？(3) Var/Mean 比是多少？——三題答完，分布幾乎已定。 One-liner: there is no universal distribution — only the one that matches the generative mechanism. When data lands on your desk ask three things: (1) continuous or count? (2) bounded or unbounded? (3) variance-to-mean ratio? Three answers narrow the distribution down to one.

📝 自我檢測

1. 你做 RNA-seq 差異分析，看到某基因的 sample variance / mean ≈ 6。下列何者最合適？

1. In an RNA-seq DE analysis, a gene's sample variance / mean ≈ 6. What's most appropriate?

A. 用 Poisson GLM——畢竟是計數資料A. Use a Poisson GLM — it's count data

B. 把 count 當常態做 t-testB. Treat counts as normal and run t-test

C. 用 Negative Binomial GLM（DESeq2 / edgeR）C. Use a Negative Binomial GLM (DESeq2 / edgeR)

D. 取 TPM 後算相關係數D. Compute TPM and look at correlation

2. 對於連續變數 X，下列何者正確？

2. For a continuous variable X, which is correct?

A. PDF 值就是 P(X = x) 的機率A. The PDF value equals P(X = x)

B. P(X = x) = 0；機率要靠對區間積分 PDF 得到B. P(X = x) = 0; probabilities come from integrating the PDF over an interval

C. PDF 永遠 ≤ 1C. The PDF is always ≤ 1

D. CDF 可能大於 1D. The CDF can exceed 1

3. 「篩檢 200 人，問 X 名陽性的機率分布」與「在 1 kb 染色體區段中 X 個 SNV」分別屬於？

3. "X positives out of 200 screened" vs "X SNVs in a 1 kb region" — what distributions?

A. 兩者都是 PoissonA. Both Poisson

B. 兩者都是 BinomialB. Both Binomial

C. 前者 Poisson、後者 BinomialC. First Poisson, second Binomial

D. 前者 Binomial（有上限 n=200）、後者 Poisson（無上限的事件計數）D. First Binomial (bounded by n=200), second Poisson (unbounded events)

4. 你有 n=8 隻老鼠的腫瘤體積要做兩組比較，最適合的抽樣分布（用來算 p 值）是？

4. Two-group comparison on n=8 mice tumour volumes — the most appropriate sampling distribution for the p-value is?

A. 標準常態 N(0,1)A. Standard normal N(0,1)

B. Chi-squareB. Chi-square

C. Student's t（小樣本，df 控制尾巴厚度）C. Student's t (small sample, df controls tails)

D. BinomialD. Binomial