Step 1: Probability & Distributions — Statistical Inference Tutorial

概覽

為什麼要先從機率分布開始？

所有統計推論——無論是 t 檢定、線性迴歸還是 limma 的 moderated t——都建立在「資料來自某個機率分布」的假設上。選錯分布是 RNA-seq 分析最常見的根本性錯誤：把 NB 計數當成常態，t 檢定的 p-value 就會嚴重失真；把 Poisson 套在過度離散的計數上，整批假陽性就會湧入。

這一章我們先建立四個核心概念：(1) 隨機變數 (Random Variable, RV) 與機率空間，(2) 常見離散與連續分布，(3) 期望值、變異數與動差，(4) 大數法則 (LLN) 與中央極限定理 (CLT)——後者是「為何均值的抽樣分布漸近常態」的關鍵。

Every statistical inference — t-test, linear regression, limma's moderated t — rests on a distributional assumption for the data. Choosing the wrong distribution is the single most common foundational mistake in RNA-seq analysis: treat NB counts as Gaussian and your t-test p-values are wildly off; fit Poisson to overdispersed counts and a flood of false positives follows.

This chapter establishes four pillars: (1) random variables and probability spaces, (2) common discrete and continuous distributions, (3) expectation, variance, and moments, (4) the Law of Large Numbers (LLN) and Central Limit Theorem (CLT) — the latter explains why sample-mean sampling distributions become approximately normal even when the raw data are not.

💡

核心原則：「資料是否常態」與「檢定是否需要常態」是兩件事。多數檢定 (如 t 檢定) 要求的是統計量的抽樣分布近似常態，而 CLT 在 n 夠大時會自動保證；真正不能跨過的是分布的家族選擇（NB vs Poisson vs Gaussian）。 Core principle: "Is the data normal?" and "Does this test require normality?" are different questions. Most tests (e.g. t-test) need the sampling distribution of the statistic to be approximately normal — the CLT delivers that for large n. What you cannot fudge is the family choice (NB vs Poisson vs Gaussian).

核心概念

一、隨機變數與機率分布家族

隨機變數是把隨機試驗的結果映射到實數的函數——例如「在某基因上偵測到的 read 數」、「兩組樣本均值差」。每個 RV 都有自己的分布，描述其取值的機率行為。

A random variable maps the outcomes of a random experiment to real numbers — e.g. "reads detected at a given locus", "the difference of two sample means". Each RV has a distribution describing the probability of its values.

🔢

常見離散分布

Bernoulli(p)：單次試驗成功/失敗——allele 是否突變。
Binomial(n, p)：n 次獨立 Bernoulli——測序中某 SNP 的支援 read 數。
Poisson(λ)：稀有事件計數——理想化的 RNA-seq counts，但通常低估變異。
Negative Binomial (NB)：Var = μ + αμ² ＞ Poisson 的 Var = μ；DESeq2 / edgeR 的核心。

Bernoulli(p): a single success/failure — allele mutated or not.
Binomial(n, p): n independent Bernoullis — supporting reads at a SNP.
Poisson(λ): rare-event counts — idealized RNA-seq counts, usually under-disperses reality.
Negative Binomial: Var = μ + αμ² ＞ Poisson Var = μ; the core of DESeq2 / edgeR.

〰️

常見連續分布

Normal(μ, σ²)：CLT 的歸宿；眾多檢定的漸近基礎。
Student's t(ν)：小樣本下取代常態；limma 的 moderated t 即此分布的延伸。
χ²(ν), F(ν₁, ν₂)：方差比、卡方檢定、ANOVA 的工作分布。
Beta(α, β), Gamma(α, β)：Beta 是二項機率的共軛先驗；Gamma 是 Poisson 速率的共軛先驗。

Normal(μ, σ²): the asymptotic destination via CLT; underlies most large-sample tests.
Student's t(ν): replaces Normal at small n; limma's moderated t extends this.
χ²(ν), F(ν₁, ν₂): variance ratios, chi-square tests, ANOVA.
Beta(α, β), Gamma(α, β): Beta is conjugate to Binomial; Gamma to Poisson.

互動模擬

Poisson vs NB 過度離散模擬器

RNA-seq 計數的 Var/Mean 通常顯著大於 1。拖動下方滑桿觀察：當 dispersion α 增加時，NB 的右尾如何拉長，而 Poisson 保持窄峰——這就是為什麼把 Poisson 套上 RNA-seq 會嚴重低估標準誤、產生大量假陽性。

RNA-seq counts almost always show Var/Mean > 1. Drag the sliders below: as dispersion α grows, the NB right tail stretches while Poisson stays narrow — this is why fitting Poisson to RNA-seq counts hugely underestimates SE and floods you with false positives.

———

平均 μ 10

NB dispersion α 0.30

X: count k | Y: P(X=k) | 實線：NB；點線：Poisson

動差與極限定理

二、期望、變異數與中央極限定理

期望值 E[X]、變異數 Var[X] = E[(X−μ)²] 是描述分布最常用的兩個摘要。對 NB 而言 Var = μ + αμ² > μ，這個「方差大於均值」就是 overdispersion 的數學定義。

大數法則 (LLN)：樣本平均 X̄ 隨著 n → ∞ 機率收斂於真實平均 μ。中央極限定理 (CLT)：當 X₁,...,Xₙ 為獨立同分布且變異數有限時，(X̄ − μ)·√n / σ → N(0, 1)。注意 CLT 並未要求 X 本身常態。這就是為什麼即使 RNA-seq 計數不是常態，樣本平均在 n 夠大時仍能用常態近似。

Expectation E[X] and variance Var[X] = E[(X−μ)²] are the two most-used distribution summaries. For NB, Var = μ + αμ² > μ — this "variance exceeds mean" is the mathematical definition of overdispersion.

LLN: X̄ → μ in probability as n → ∞. CLT: for iid X₁,...,Xₙ with finite variance, (X̄ − μ)·√n / σ → N(0, 1). CLT does NOT require X itself to be normal. That is precisely why even non-normal RNA-seq counts admit a Gaussian approximation for their sample mean once n is large.

⚠️

CLT 的隱性條件：① 獨立性 — 細胞間相關、批次效應、配對樣本都會破壞獨立。② 有限變異數 — 重尾分布 (Cauchy、low-count single-cell dropout) 收斂極慢甚至無法收斂。③ n 多大才夠？看分布偏度——對 NB α ≈ 0.3，n ≈ 30 通常夠用；對極度偏態，n > 100 也未必。

CLT's hidden conditions: ① Independence — inter-cell correlation, batch effects, paired samples break it. ② Finite variance — heavy-tailed distributions (Cauchy, single-cell dropout) converge slowly or not at all. ③ "How large is large enough?" depends on skewness: for NB with α ≈ 0.3, n ≈ 30 usually suffices; for extreme skew even n > 100 may not.

決策引導

三、選擇分布的決策樹

🌳 分布選擇流程

Q1:

資料是計數 (非負整數) 嗎？→ 是 → 走 Q2；否 → Q4。

Q2:

計算 Pearson dispersion χ²/df。≈ 1 → Poisson 即可；明顯 > 1（RNA-seq 通常如此） → NB。

Q3:

零比例異常高（單細胞、生態學）？→ 是 → Zero-Inflated NB 或 hurdle 模型。

Q4:

資料連續且近似對稱、無極端值？→ 是 → Normal；右偏（基因表達、藥物濃度）→ log-Normal 或 Gamma。

Q5:

變數是比例 / 機率 (0–1)？→ 是 → Beta；事件率？→ Gamma 或 exponential。

Q1:

Is the data counts (non-negative integers)? → Yes → Q2; otherwise → Q4.

Q2:

Compute Pearson dispersion χ²/df. ≈ 1 → Poisson; clearly > 1 (typical of RNA-seq) → NB.

Q3:

Excess zeros (single-cell, ecology)? → Yes → Zero-Inflated NB or hurdle model.

Q4:

Continuous, roughly symmetric, no extreme outliers? → Yes → Normal; right-skew (gene expression, drug concentrations) → log-Normal or Gamma.

Q5:

Proportion / probability (0–1)? → Yes → Beta; event rate? → Gamma or Exponential.

情境	分布	診斷
Bulk RNA-seq 計數	NB	χ²/df > 1	Bulk RNA-seq counts	NB	χ²/df > 1
scRNA-seq 計數（高零比例）	ZINB / NB	% zeros vs NB 預測	scRNA-seq counts (high zeros)	ZINB / NB	% zeros vs NB-expected
蛋白質定量 (log intensity)	Normal	QQ plot	Protein intensities (log)	Normal	QQ plot
細胞反應比例	Beta / Binomial	值域 [0, 1]	Cell response proportions	Beta / Binomial	support [0, 1]
病人存活時間	Weibull / Exponential	hazard 是否常數	Patient survival times	Weibull / Exponential	constant hazard?

程式碼

實作：分布擬合與 QQ plot

# --- R --- 從 PMF / PDF 到分布擬合
# 1) 內建分布函數族：d/p/q/r 前綴
dnorm(0); pnorm(1.96); qnorm(0.975); rnorm(10, mean=0, sd=1)
dnbinom(0:20, mu=5, size=2)  # DESeq2 風格：mu, size=1/alpha

# 2) 對實際資料擬合 NB
library(MASS)
fit <- fitdistr(counts, "negative binomial")
fit$estimate                            # mu 與 size

# 3) QQ plot 與 Kolmogorov-Smirnov 檢定
qqnorm(residuals); qqline(residuals)
ks.test(counts, "pnbinom", mu=fit$estimate["mu"], size=fit$estimate["size"])

# 4) 視覺化過度離散：Var/Mean 比
var(counts) / mean(counts)         # > 1 → 過度離散

# --- Python ---
import numpy as np, scipy.stats as st
import statsmodels.api as sm

# 1) scipy.stats 各分布
st.norm.pdf(0); st.norm.ppf(0.975); st.norm.rvs(size=10)
st.nbinom.pmf(np.arange(21), n=2, p=2/(2+5))  # n=size, p=size/(size+mu)

# 2) NB MLE via statsmodels
nb = sm.GLM(y, X, family=sm.families.NegativeBinomial(alpha=0.3)).fit()
nb.summary()

# 3) QQ plot
sm.qqplot(residuals, line="45")

# 4) Overdispersion 診斷
np.var(counts) / np.mean(counts)         # > 1 → overdispersed
st.kstest(counts, lambda x: st.nbinom.cdf(x, n=2, p=2/7))

🚫

常見錯誤：在 RNA-seq 計數上直接 log(x+1) 後跑 t 檢定，等同假設「轉換後常態」。對 low-count 基因，這個近似極差——這是 DESeq2 / edgeR 一定要用 NB GLM 而非 t 檢定的原因。 Common mistake: running a t-test on log(x+1)-transformed RNA-seq counts assumes post-transformation normality. For low-count genes this approximation is awful — exactly why DESeq2/edgeR insist on NB GLM rather than t-test.

📝 自我檢測

1. RNA-seq 計數通常違反 Poisson 假設的主要原因是？

1. Why do RNA-seq counts typically violate Poisson assumptions?

A. 計數不是整數A. Counts are not integers

B. 樣本數太多B. Too many samples

C. 生物變異使 Var > Mean（過度離散）C. Biological variation makes Var > Mean (overdispersion)

D. Poisson 的期望值無限大D. Poisson has infinite expectation

2. 關於 CLT 的敘述何者正確？

2. Which statement about the CLT is correct?

A. CLT 要求 X 本身必須常態A. CLT requires X itself to be normal

B. CLT 保證在 iid 與有限變異數下，樣本均值的抽樣分布漸近常態B. Under iid & finite variance, the sample mean's sampling distribution becomes asymptotically normal

C. CLT 對相關樣本仍然成立C. CLT holds even for correlated samples

D. CLT 對 Cauchy 分布也成立D. CLT holds for the Cauchy distribution

3. 哪個分布是 Binomial 機率參數 p 的共軛先驗？

3. Which distribution is the conjugate prior for the Binomial probability parameter p?

A. NormalA. Normal

B. GammaB. Gamma

C. BetaC. Beta

D. PoissonD. Poisson