STEP 1 / 17

機率與分布

從隨機變數與機率分布出發,理解 RNA-seq 計數、t/F 檢定、貝氏推論共同的數學基石。

Random variables and distributions — the shared mathematical foundation for RNA-seq counts, t/F tests, and Bayesian inference.

為什麼要先從機率分布開始?

所有統計推論——無論是 t 檢定、線性迴歸還是 limma 的 moderated t——都建立在「資料來自某個機率分布」的假設上。選錯分布是 RNA-seq 分析最常見的根本性錯誤:把 NB 計數當成常態,t 檢定的 p-value 就會嚴重失真;把 Poisson 套在過度離散的計數上,整批假陽性就會湧入。

這一章我們先建立四個核心概念:(1) 隨機變數 (Random Variable, RV) 與機率空間,(2) 常見離散與連續分布,(3) 期望值、變異數與動差,(4) 大數法則 (LLN) 與中央極限定理 (CLT)——後者是「為何均值的抽樣分布漸近常態」的關鍵。

Every statistical inference — t-test, linear regression, limma's moderated t — rests on a distributional assumption for the data. Choosing the wrong distribution is the single most common foundational mistake in RNA-seq analysis: treat NB counts as Gaussian and your t-test p-values are wildly off; fit Poisson to overdispersed counts and a flood of false positives follows.

This chapter establishes four pillars: (1) random variables and probability spaces, (2) common discrete and continuous distributions, (3) expectation, variance, and moments, (4) the Law of Large Numbers (LLN) and Central Limit Theorem (CLT) — the latter explains why sample-mean sampling distributions become approximately normal even when the raw data are not.

💡
核心原則:「資料是否常態」與「檢定是否需要常態」是兩件事。多數檢定 (如 t 檢定) 要求的是統計量的抽樣分布近似常態,而 CLT 在 n 夠大時會自動保證;真正不能跨過的是分布的家族選擇(NB vs Poisson vs Gaussian)。 Core principle: "Is the data normal?" and "Does this test require normality?" are different questions. Most tests (e.g. t-test) need the sampling distribution of the statistic to be approximately normal — the CLT delivers that for large n. What you cannot fudge is the family choice (NB vs Poisson vs Gaussian).

一、隨機變數與機率分布家族

隨機變數是把隨機試驗的結果映射到實數的函數——例如「在某基因上偵測到的 read 數」、「兩組樣本均值差」。每個 RV 都有自己的分布,描述其取值的機率行為。

A random variable maps the outcomes of a random experiment to real numbers — e.g. "reads detected at a given locus", "the difference of two sample means". Each RV has a distribution describing the probability of its values.

🔢

常見離散分布

  • Bernoulli(p):單次試驗成功/失敗——allele 是否突變。
  • Binomial(n, p):n 次獨立 Bernoulli——測序中某 SNP 的支援 read 數。
  • Poisson(λ):稀有事件計數——理想化的 RNA-seq counts,但通常低估變異。
  • Negative Binomial (NB):Var = μ + αμ² > Poisson 的 Var = μ;DESeq2 / edgeR 的核心。
  • Bernoulli(p): a single success/failure — allele mutated or not.
  • Binomial(n, p): n independent Bernoullis — supporting reads at a SNP.
  • Poisson(λ): rare-event counts — idealized RNA-seq counts, usually under-disperses reality.
  • Negative Binomial: Var = μ + αμ² > Poisson Var = μ; the core of DESeq2 / edgeR.
〰️

常見連續分布

  • Normal(μ, σ²):CLT 的歸宿;眾多檢定的漸近基礎。
  • Student's t(ν):小樣本下取代常態;limma 的 moderated t 即此分布的延伸。
  • χ²(ν), F(ν₁, ν₂):方差比、卡方檢定、ANOVA 的工作分布。
  • Beta(α, β), Gamma(α, β):Beta 是二項機率的共軛先驗;Gamma 是 Poisson 速率的共軛先驗。
  • Normal(μ, σ²): the asymptotic destination via CLT; underlies most large-sample tests.
  • Student's t(ν): replaces Normal at small n; limma's moderated t extends this.
  • χ²(ν), F(ν₁, ν₂): variance ratios, chi-square tests, ANOVA.
  • Beta(α, β), Gamma(α, β): Beta is conjugate to Binomial; Gamma to Poisson.

Poisson vs NB 過度離散模擬器

RNA-seq 計數的 Var/Mean 通常顯著大於 1。拖動下方滑桿觀察:當 dispersion α 增加時,NB 的右尾如何拉長,而 Poisson 保持窄峰——這就是為什麼把 Poisson 套上 RNA-seq 會嚴重低估標準誤、產生大量假陽性。

RNA-seq counts almost always show Var/Mean > 1. Drag the sliders below: as dispersion α grows, the NB right tail stretches while Poisson stays narrow — this is why fitting Poisson to RNA-seq counts hugely underestimates SE and floods you with false positives.

X: count k | Y: P(X=k) | 實線:NB;點線:Poisson

二、期望、變異數與中央極限定理

期望值 E[X]、變異數 Var[X] = E[(X−μ)²] 是描述分布最常用的兩個摘要。對 NB 而言 Var = μ + αμ² > μ,這個「方差大於均值」就是 overdispersion 的數學定義。

大數法則 (LLN):樣本平均 X̄ 隨著 n → ∞ 機率收斂於真實平均 μ。中央極限定理 (CLT):當 X₁,...,Xₙ 為獨立同分布且變異數有限時,(X̄ − μ)·√n / σ → N(0, 1)注意 CLT 並未要求 X 本身常態。這就是為什麼即使 RNA-seq 計數不是常態,樣本平均在 n 夠大時仍能用常態近似。

Expectation E[X] and variance Var[X] = E[(X−μ)²] are the two most-used distribution summaries. For NB, Var = μ + αμ² > μ — this "variance exceeds mean" is the mathematical definition of overdispersion.

LLN: X̄ → μ in probability as n → ∞. CLT: for iid X₁,...,Xₙ with finite variance, (X̄ − μ)·√n / σ → N(0, 1). CLT does NOT require X itself to be normal. That is precisely why even non-normal RNA-seq counts admit a Gaussian approximation for their sample mean once n is large.

⚠️
CLT 的隱性條件:① 獨立性 — 細胞間相關、批次效應、配對樣本都會破壞獨立。② 有限變異數 — 重尾分布 (Cauchy、low-count single-cell dropout) 收斂極慢甚至無法收斂。③ n 多大才夠?看分布偏度——對 NB α ≈ 0.3,n ≈ 30 通常夠用;對極度偏態,n > 100 也未必。
CLT's hidden conditions: ① Independence — inter-cell correlation, batch effects, paired samples break it. ② Finite variance — heavy-tailed distributions (Cauchy, single-cell dropout) converge slowly or not at all. ③ "How large is large enough?" depends on skewness: for NB with α ≈ 0.3, n ≈ 30 usually suffices; for extreme skew even n > 100 may not.

三、選擇分布的決策樹

🌳 分布選擇流程

Q1:
資料是計數 (非負整數) 嗎?→ 是 → 走 Q2;否 → Q4。
Q2:
計算 Pearson dispersion χ²/df。≈ 1 → Poisson 即可;明顯 > 1(RNA-seq 通常如此) → NB
Q3:
零比例異常高(單細胞、生態學)?→ 是 → Zero-Inflated NB 或 hurdle 模型。
Q4:
資料連續且近似對稱、無極端值→ 是 → Normal;右偏(基因表達、藥物濃度)→ log-Normal 或 Gamma。
Q5:
變數是比例 / 機率 (0–1)?→ 是 → Beta;事件率?→ Gamma 或 exponential。
Q1:
Is the data counts (non-negative integers)? → Yes → Q2; otherwise → Q4.
Q2:
Compute Pearson dispersion χ²/df. ≈ 1 → Poisson; clearly > 1 (typical of RNA-seq) → NB.
Q3:
Excess zeros (single-cell, ecology)? → Yes → Zero-Inflated NB or hurdle model.
Q4:
Continuous, roughly symmetric, no extreme outliers? → Yes → Normal; right-skew (gene expression, drug concentrations) → log-Normal or Gamma.
Q5:
Proportion / probability (0–1)? → Yes → Beta; event rate? → Gamma or Exponential.
情境 分布 診斷
Bulk RNA-seq 計數NBχ²/df > 1Bulk RNA-seq countsNBχ²/df > 1
scRNA-seq 計數(高零比例)ZINB / NB% zeros vs NB 預測scRNA-seq counts (high zeros)ZINB / NB% zeros vs NB-expected
蛋白質定量 (log intensity)NormalQQ plotProtein intensities (log)NormalQQ plot
細胞反應比例Beta / Binomial值域 [0, 1]Cell response proportionsBeta / Binomialsupport [0, 1]
病人存活時間Weibull / Exponentialhazard 是否常數Patient survival timesWeibull / Exponentialconstant hazard?

實作:分布擬合與 QQ plot

# --- R --- 從 PMF / PDF 到分布擬合
# 1) 內建分布函數族:d/p/q/r 前綴
dnorm(0); pnorm(1.96); qnorm(0.975); rnorm(10, mean=0, sd=1)
dnbinom(0:20, mu=5, size=2)  # DESeq2 風格:mu, size=1/alpha

# 2) 對實際資料擬合 NB
library(MASS)
fit <- fitdistr(counts, "negative binomial")
fit$estimate                            # mu 與 size

# 3) QQ plot 與 Kolmogorov-Smirnov 檢定
qqnorm(residuals); qqline(residuals)
ks.test(counts, "pnbinom", mu=fit$estimate["mu"], size=fit$estimate["size"])

# 4) 視覺化過度離散:Var/Mean 比
var(counts) / mean(counts)         # > 1 → 過度離散
# --- Python ---
import numpy as np, scipy.stats as st
import statsmodels.api as sm

# 1) scipy.stats 各分布
st.norm.pdf(0); st.norm.ppf(0.975); st.norm.rvs(size=10)
st.nbinom.pmf(np.arange(21), n=2, p=2/(2+5))  # n=size, p=size/(size+mu)

# 2) NB MLE via statsmodels
nb = sm.GLM(y, X, family=sm.families.NegativeBinomial(alpha=0.3)).fit()
nb.summary()

# 3) QQ plot
sm.qqplot(residuals, line="45")

# 4) Overdispersion 診斷
np.var(counts) / np.mean(counts)         # > 1 → overdispersed
st.kstest(counts, lambda x: st.nbinom.cdf(x, n=2, p=2/7))
🚫
常見錯誤:在 RNA-seq 計數上直接 log(x+1) 後跑 t 檢定,等同假設「轉換後常態」。對 low-count 基因,這個近似極差——這是 DESeq2 / edgeR 一定要用 NB GLM 而非 t 檢定的原因。 Common mistake: running a t-test on log(x+1)-transformed RNA-seq counts assumes post-transformation normality. For low-count genes this approximation is awful — exactly why DESeq2/edgeR insist on NB GLM rather than t-test.

📝 自我檢測

1. RNA-seq 計數通常違反 Poisson 假設的主要原因是?

1. Why do RNA-seq counts typically violate Poisson assumptions?

A. 計數不是整數A. Counts are not integers
B. 樣本數太多B. Too many samples
C. 生物變異使 Var > Mean(過度離散)C. Biological variation makes Var > Mean (overdispersion)
D. Poisson 的期望值無限大D. Poisson has infinite expectation

2. 關於 CLT 的敘述何者正確?

2. Which statement about the CLT is correct?

A. CLT 要求 X 本身必須常態A. CLT requires X itself to be normal
B. CLT 保證在 iid 與有限變異數下,樣本均值的抽樣分布漸近常態B. Under iid & finite variance, the sample mean's sampling distribution becomes asymptotically normal
C. CLT 對相關樣本仍然成立C. CLT holds even for correlated samples
D. CLT 對 Cauchy 分布也成立D. CLT holds for the Cauchy distribution

3. 哪個分布是 Binomial 機率參數 p 的共軛先驗?

3. Which distribution is the conjugate prior for the Binomial probability parameter p?

A. NormalA. Normal
B. GammaB. Gamma
C. BetaC. Beta
D. PoissonD. Poisson