為什麼要先從機率分布開始?
所有統計推論——無論是 t 檢定、線性迴歸還是 limma 的 moderated t——都建立在「資料來自某個機率分布」的假設上。選錯分布是 RNA-seq 分析最常見的根本性錯誤:把 NB 計數當成常態,t 檢定的 p-value 就會嚴重失真;把 Poisson 套在過度離散的計數上,整批假陽性就會湧入。
這一章我們先建立四個核心概念:(1) 隨機變數 (Random Variable, RV) 與機率空間,(2) 常見離散與連續分布,(3) 期望值、變異數與動差,(4) 大數法則 (LLN) 與中央極限定理 (CLT)——後者是「為何均值的抽樣分布漸近常態」的關鍵。
Every statistical inference — t-test, linear regression, limma's moderated t — rests on a distributional assumption for the data. Choosing the wrong distribution is the single most common foundational mistake in RNA-seq analysis: treat NB counts as Gaussian and your t-test p-values are wildly off; fit Poisson to overdispersed counts and a flood of false positives follows.
This chapter establishes four pillars: (1) random variables and probability spaces, (2) common discrete and continuous distributions, (3) expectation, variance, and moments, (4) the Law of Large Numbers (LLN) and Central Limit Theorem (CLT) — the latter explains why sample-mean sampling distributions become approximately normal even when the raw data are not.
一、隨機變數與機率分布家族
隨機變數是把隨機試驗的結果映射到實數的函數——例如「在某基因上偵測到的 read 數」、「兩組樣本均值差」。每個 RV 都有自己的分布,描述其取值的機率行為。
A random variable maps the outcomes of a random experiment to real numbers — e.g. "reads detected at a given locus", "the difference of two sample means". Each RV has a distribution describing the probability of its values.
常見離散分布
- Bernoulli(p):單次試驗成功/失敗——allele 是否突變。
- Binomial(n, p):n 次獨立 Bernoulli——測序中某 SNP 的支援 read 數。
- Poisson(λ):稀有事件計數——理想化的 RNA-seq counts,但通常低估變異。
- Negative Binomial (NB):Var = μ + αμ² > Poisson 的 Var = μ;DESeq2 / edgeR 的核心。
- Bernoulli(p): a single success/failure — allele mutated or not.
- Binomial(n, p): n independent Bernoullis — supporting reads at a SNP.
- Poisson(λ): rare-event counts — idealized RNA-seq counts, usually under-disperses reality.
- Negative Binomial: Var = μ + αμ² > Poisson Var = μ; the core of DESeq2 / edgeR.
常見連續分布
- Normal(μ, σ²):CLT 的歸宿;眾多檢定的漸近基礎。
- Student's t(ν):小樣本下取代常態;limma 的 moderated t 即此分布的延伸。
- χ²(ν), F(ν₁, ν₂):方差比、卡方檢定、ANOVA 的工作分布。
- Beta(α, β), Gamma(α, β):Beta 是二項機率的共軛先驗;Gamma 是 Poisson 速率的共軛先驗。
- Normal(μ, σ²): the asymptotic destination via CLT; underlies most large-sample tests.
- Student's t(ν): replaces Normal at small n; limma's moderated t extends this.
- χ²(ν), F(ν₁, ν₂): variance ratios, chi-square tests, ANOVA.
- Beta(α, β), Gamma(α, β): Beta is conjugate to Binomial; Gamma to Poisson.
Poisson vs NB 過度離散模擬器
RNA-seq 計數的 Var/Mean 通常顯著大於 1。拖動下方滑桿觀察:當 dispersion α 增加時,NB 的右尾如何拉長,而 Poisson 保持窄峰——這就是為什麼把 Poisson 套上 RNA-seq 會嚴重低估標準誤、產生大量假陽性。
RNA-seq counts almost always show Var/Mean > 1. Drag the sliders below: as dispersion α grows, the NB right tail stretches while Poisson stays narrow — this is why fitting Poisson to RNA-seq counts hugely underestimates SE and floods you with false positives.
X: count k | Y: P(X=k) | 實線:NB;點線:Poisson
二、期望、變異數與中央極限定理
期望值 E[X]、變異數 Var[X] = E[(X−μ)²] 是描述分布最常用的兩個摘要。對 NB 而言 Var = μ + αμ² > μ,這個「方差大於均值」就是 overdispersion 的數學定義。
大數法則 (LLN):樣本平均 X̄ 隨著 n → ∞ 機率收斂於真實平均 μ。中央極限定理 (CLT):當 X₁,...,Xₙ 為獨立同分布且變異數有限時,(X̄ − μ)·√n / σ → N(0, 1)。注意 CLT 並未要求 X 本身常態。這就是為什麼即使 RNA-seq 計數不是常態,樣本平均在 n 夠大時仍能用常態近似。
Expectation E[X] and variance Var[X] = E[(X−μ)²] are the two most-used distribution summaries. For NB, Var = μ + αμ² > μ — this "variance exceeds mean" is the mathematical definition of overdispersion.
LLN: X̄ → μ in probability as n → ∞. CLT: for iid X₁,...,Xₙ with finite variance, (X̄ − μ)·√n / σ → N(0, 1). CLT does NOT require X itself to be normal. That is precisely why even non-normal RNA-seq counts admit a Gaussian approximation for their sample mean once n is large.
三、選擇分布的決策樹
🌳 分布選擇流程
χ²/df。≈ 1 → Poisson 即可;明顯 > 1(RNA-seq 通常如此) → NB。χ²/df. ≈ 1 → Poisson; clearly > 1 (typical of RNA-seq) → NB.| 情境 | 分布 | 診斷 | |||
|---|---|---|---|---|---|
| Bulk RNA-seq 計數 | NB | χ²/df > 1 | Bulk RNA-seq counts | NB | χ²/df > 1 |
| scRNA-seq 計數(高零比例) | ZINB / NB | % zeros vs NB 預測 | scRNA-seq counts (high zeros) | ZINB / NB | % zeros vs NB-expected |
| 蛋白質定量 (log intensity) | Normal | QQ plot | Protein intensities (log) | Normal | QQ plot |
| 細胞反應比例 | Beta / Binomial | 值域 [0, 1] | Cell response proportions | Beta / Binomial | support [0, 1] |
| 病人存活時間 | Weibull / Exponential | hazard 是否常數 | Patient survival times | Weibull / Exponential | constant hazard? |
實作:分布擬合與 QQ plot
# --- R --- 從 PMF / PDF 到分布擬合 # 1) 內建分布函數族:d/p/q/r 前綴 dnorm(0); pnorm(1.96); qnorm(0.975); rnorm(10, mean=0, sd=1) dnbinom(0:20, mu=5, size=2) # DESeq2 風格:mu, size=1/alpha # 2) 對實際資料擬合 NB library(MASS) fit <- fitdistr(counts, "negative binomial") fit$estimate # mu 與 size # 3) QQ plot 與 Kolmogorov-Smirnov 檢定 qqnorm(residuals); qqline(residuals) ks.test(counts, "pnbinom", mu=fit$estimate["mu"], size=fit$estimate["size"]) # 4) 視覺化過度離散:Var/Mean 比 var(counts) / mean(counts) # > 1 → 過度離散
# --- Python --- import numpy as np, scipy.stats as st import statsmodels.api as sm # 1) scipy.stats 各分布 st.norm.pdf(0); st.norm.ppf(0.975); st.norm.rvs(size=10) st.nbinom.pmf(np.arange(21), n=2, p=2/(2+5)) # n=size, p=size/(size+mu) # 2) NB MLE via statsmodels nb = sm.GLM(y, X, family=sm.families.NegativeBinomial(alpha=0.3)).fit() nb.summary() # 3) QQ plot sm.qqplot(residuals, line="45") # 4) Overdispersion 診斷 np.var(counts) / np.mean(counts) # > 1 → overdispersed st.kstest(counts, lambda x: st.nbinom.cdf(x, n=2, p=2/7))
log(x+1) 後跑 t 檢定,等同假設「轉換後常態」。對 low-count 基因,這個近似極差——這是 DESeq2 / edgeR 一定要用 NB GLM 而非 t 檢定的原因。
Common mistake: running a t-test on log(x+1)-transformed RNA-seq counts assumes post-transformation normality. For low-count genes this approximation is awful — exactly why DESeq2/edgeR insist on NB GLM rather than t-test.
📝 自我檢測
1. RNA-seq 計數通常違反 Poisson 假設的主要原因是?
1. Why do RNA-seq counts typically violate Poisson assumptions?
2. 關於 CLT 的敘述何者正確?
2. Which statement about the CLT is correct?
3. 哪個分布是 Binomial 機率參數 p 的共軛先驗?
3. Which distribution is the conjugate prior for the Binomial probability parameter p?