為什麼 CLT 是生物統計的「核反應爐」?
幾乎每一個你用過的推論統計都依賴中央極限定理(Central Limit Theorem, CLT):t 檢定為什麼能用?ANOVA 的 F 值為什麼有意義?線性迴歸的 95% CI 為什麼合法?答案都是 CLT——當樣本數 n 夠大時,樣本平均(或迴歸係數、比例、差值)的抽樣分布會接近常態,與原始資料是否常態無關。
更深層的觀念:SE(standard error)與 SD(standard deviation)是兩件完全不同的事。SD 描述「資料散得多開」,SE = SD/√n 描述「我對 mean 的估計有多精確」。Curran-Everett 2008 Adv Physiol Educ 指出,這是論文最常見的混淆——許多人把 SEM 當 SD 寫,誤差條看起來「乾淨」,卻把資料變異藏起來。
Nearly every inferential procedure you've ever used relies on the Central Limit Theorem (CLT): why does the t-test work? Why does the ANOVA F-ratio mean anything? Why is a regression's 95% CI legal? The answer is CLT — when n is large, the sampling distribution of the mean (or of regression coefficients, proportions, or differences) approaches normal, regardless of whether the raw data are normal.
A deeper idea: SE (standard error) and SD (standard deviation) are entirely different things. SD captures how spread the data are; SE = SD/√n captures how precisely you've estimated the mean. Curran-Everett 2008 (Adv Physiol Educ) flagged this as the most common confusion in biology papers — people report SEM as if it described variability, making error bars look tidy while hiding the actual spread.
一、抽樣分布、SE、CLT
抽樣分布
不是資料的分布——而是「統計量(如 X̄)跨假想重複抽樣後的分布」。每次實驗只看到一個 X̄,但 CLT 告訴我們這個 X̄ 來自一個寬度 σ/√n、中心 μ 的常態分布。
所有 p 值、CI、t 值都是基於這個分布計算。
Not the distribution of the data — but the distribution of a statistic (like X̄) over hypothetical repeated samples. Each experiment gives one X̄, but CLT tells us it comes from a normal distribution with mean μ and width σ/√n.
Every p-value, CI, and t-statistic lives on this distribution.
SE vs SD
SD:資料的離散(越大表示個體差異越大)。
SE = SD/√n:估計的精確度(越小表示越能精準估到 mean)。
n = 100 時 SE ≈ SD/10;n = 10000 時 SE ≈ SD/100——但 SD 本身不變。
論文中報告 mean ± SD(離散)或 mean (95% CI)(精度),不要報 mean ± SEM。
SD: how spread the data are (bigger SD = more variation between individuals).
SE = SD/√n: precision of the estimate (smaller SE = mean estimated more precisely).
At n = 100, SE ≈ SD/10; at n = 10 000, SE ≈ SD/100 — yet SD itself doesn't change.
Report mean ± SD (spread) or mean (95% CI) (precision); do not report mean ± SEM.
CLT 敘述
設 X₁, …, Xₙ 為 i.i.d.,具有有限均值 μ 與有限變異 σ²,則
(X̄ − μ) / (σ/√n) → N(0, 1)
當 n → ∞(Lindeberg-Lévy CLT, 1922)。實務上 n ≥ 30 通常足夠;但偏態強或尾巴重時,需要更大 n。
Let X₁, …, Xₙ be i.i.d. with finite mean μ and finite variance σ². Then
(X̄ − μ) / (σ/√n) → N(0, 1)
as n → ∞ (Lindeberg-Lévy CLT, 1922). In practice n ≥ 30 often suffices, but heavily skewed or heavy-tailed data need more.
CLT 演示
選一個母體分布(均勻 / 指數 / 雙峰 / Cauchy),拖動 n 滑桿。每次從母體抽 n 個觀察、算一個平均,重複 2000 次,繪出「樣本平均的分布」。觀察:
· 母體越偏態,需要越大 n 才能收斂;
· Cauchy 分布沒有定義的 mean / 變異,CLT 不適用——無論 n 多大,樣本平均仍然亂跳。這正是 CLT「有限變異」假設的破口。
Pick a population (uniform / exponential / bimodal / Cauchy) and drag n. We draw n observations, compute the mean, repeat 2 000 times, and plot the sampling distribution. Watch:
· The more skewed the population, the larger n needed for convergence.
· The Cauchy distribution has no defined mean or variance, so CLT fails — sample means keep jumping no matter how big n is. That is the breach in the finite-variance assumption.
↑ 母體分布(單次大樣本,N = 5000)↑ Population (one large sample, N = 5000)
↑ 樣本平均的分布(2000 次重複,n 顆觀察取平均)· 紅虛線 = 理論常態(μ, σ/√n)↑ Sampling distribution of the mean (2000 reps, n per draw) · Red dashed = theoretical Normal(μ, σ/√n)
二、CLT 的歷史與假設
CLT 簡史
1733 De Moivre:在《The Doctrine of Chances》第二版證明「擲銅板很多次後,正面比例近似常態」——這是 CLT 的最早版本(二項 → 常態)。
1810 Laplace:推廣為任何 i.i.d. 變數,給出全名「中央極限定理」的雛形。
1922 Lindeberg-Lévy:以現代測度論寫下標準形式,假設只需 i.i.d. + 有限變異。
1901 Lyapunov:放寬到非同分布但獨立的情況(Lyapunov CLT)。
1733 De Moivre: in the second edition of The Doctrine of Chances he proved the proportion of heads in many coin tosses is approximately normal — the earliest CLT (binomial → normal).
1810 Laplace: extended it to any i.i.d. variables, coining what we now call the Central Limit Theorem.
1922 Lindeberg-Lévy: laid out the modern measure-theoretic statement under i.i.d. + finite variance.
1901 Lyapunov: relaxed to independent-but-not-identically-distributed (Lyapunov's CLT).
兩條關鍵假設
(1) i.i.d.(獨立且同分布):每個觀察彼此獨立,且來自同一個分布。Panel data、time series、cluster sampling 都違反這條——必須用 mixed model 或 GEE。
(2) 有限變異 σ² < ∞:Cauchy 分布(t₁)的變異不存在,CLT 失效。Pareto α < 2 的分布也是。
除此之外,不需要原始資料常態——這是最常被誤解的一點。
(1) i.i.d. (independent, identically distributed): observations independent and from the same population. Panel data, time series, and cluster sampling violate this — use a mixed model or GEE.
(2) Finite variance σ² < ∞: Cauchy (t₁) has no variance, so CLT fails. Pareto with α < 2 likewise.
Otherwise, raw data normality is not required — the most widely misread point.
Bootstrap 演示
當 n 太小或資料太偏,CLT 給出的 t-CI 不可信。Efron 1979 提出 Bootstrap:把手上的 n 筆資料當作「迷你母體」,有放回地重抽 n 筆 B 次,每次算統計量,得到的分布就是經驗抽樣分布。
· Percentile CI:取 bootstrap 統計量的 2.5% 與 97.5% 分位數。
· t-CI:mean ± t·SE/√n,假設常態。
當資料偏態時,percentile CI 通常較準。
When n is small or data are skewed, CLT-based t-CIs can mislead. Efron 1979 introduced the bootstrap: treat your n data points as a mini-population, resample n with replacement B times, compute the statistic each time, and the spread is the empirical sampling distribution.
· Percentile CI: 2.5th and 97.5th quantiles of the bootstrap statistics.
· t-CI: mean ± t·SE/√n, assuming normality.
For skewed data, the percentile CI is usually more honest.
綠色 = Bootstrap 平均分布 · 藍虛線 = 95% percentile CI · 紅虛線 = t-CI · 黑色 = 觀察到的 meanGreen = bootstrap mean distribution · Blue dashed = 95% percentile CI · Red dashed = t-CI · Black = observed mean
boot::boot.ci(b, type="bca");Python: scipy.stats.bootstrap(method="BCa")。對於中等偏態資料,BCa 比 percentile 更準。
BCa (Bias-Corrected accelerated) CI: DiCiccio & Efron (1996) refined the percentile CI by correcting for skewness and an "acceleration" factor a. R: boot::boot.ci(b, type="bca"); Python: scipy.stats.bootstrap(method="BCa"). For moderately skewed data, BCa beats plain percentile.
三、CI 怎麼選?
🌳 信賴區間決策樹
四、四種誤差條怎麼用?
| 量 | 意義 | 公式 | 何時用 | 陷阱 | ||||
|---|---|---|---|---|---|---|---|---|
| SD | 資料的離散 | s = √(Σ(xᵢ−x̄)²/(n−1)) | 描述個體差異 | 不隨 n 變小 | Data spread | s = √(Σ(xᵢ−x̄)²/(n−1)) | Describe individual variation | Does not shrink with n |
| SE (SEM) | mean 估計的精確度 | SE = s/√n | 罕用——通常該用 CI | 勿當 SD 報 | Precision of the mean estimate | SE = s/√n | Rarely the right report — usually use a CI | Do not report as SD |
| 95% t-CI | mean 的 95% 信賴區間(CLT 推導) | x̄ ± tn−1, 0.975·s/√n | 資料近似常態 / n ≥ 20 | n 小且偏態時覆蓋率差 | 95% CI for the mean (CLT-based) | x̄ ± tn−1, 0.975·s/√n | Approx. normal data / n ≥ 20 | Poor coverage when n is small and data skewed |
| Bootstrap CI | 經驗抽樣分布的分位數 | 2.5%, 97.5% 分位數(percentile);或 BCa | 偏態、n 小、統計量複雜 | n < 10 時不穩定 | Quantiles of empirical sampling distribution | 2.5th / 97.5th percentile, or BCa | Skewed data, small n, complex statistics | Unstable when n < 10 |
| Wilson CI (proportion) | 比例的 95% CI | (p̂ + z²/2n ± z√(p̂q̂/n + z²/4n²)) / (1+z²/n) | 二元結果,n 小或 p 極端 | 不要用 p̂ ± 1.96·SE | 95% CI for a proportion | (p̂ + z²/2n ± z√(p̂q̂/n + z²/4n²)) / (1+z²/n) | Binary outcome, small n or extreme p | Avoid p̂ ± 1.96·SE (Wald) |
五、實作範例
# R: CLT demo + bootstrap CI library(tidyverse); library(boot) # --- 1) Simulate CLT: exponential population --- n <- 30 sims <- replicate(5000, mean(rexp(n, rate = 1))) hist(sims, breaks = 40, main = "Sampling distribution of X̄") qqnorm(sims); qqline(sims, col = "red") # --- 2) SE vs SD --- x <- rnorm(100, mean = 120, sd = 15) sd_x <- sd(x) # data spread (~15) se_x <- sd_x / sqrt(length(x)) # mean precision (~1.5) # --- 3) Classic t-CI (CLT-based) --- ci_t <- t.test(x)$conf.int # 95% t-CI # --- 4) Bootstrap percentile + BCa CI --- b <- boot(x, statistic = function(d, i) mean(d[i]), R = 2000) boot.ci(b, type = c("perc", "bca")) # --- 5) Finite-population correction --- N <- 500; n <- 60 fpc <- sqrt((N - n) / (N - 1)) se_fpc <- (sd_x / sqrt(n)) * fpc
import numpy as np from scipy import stats import matplotlib.pyplot as plt rng = np.random.default_rng(0) # --- 1) Simulate CLT: exponential population --- n = 30 sims = np.array([rng.exponential(1, n).mean() for _ in range(5000)]) plt.hist(sims, bins=40); plt.title("Sampling distribution of X̄"); plt.show() stats.probplot(sims, plot=plt); plt.show() # --- 2) SE vs SD --- x = rng.normal(120, 15, 100) sd_x = x.std(ddof=1) se_x = sd_x / np.sqrt(len(x)) # --- 3) Classic t-CI (CLT-based) --- ci_t = stats.t.interval(0.95, df=len(x)-1, loc=x.mean(), scale=se_x) # --- 4) Bootstrap percentile + BCa (scipy ≥ 1.7) --- res = stats.bootstrap((x,), np.mean, n_resamples=2000, method="BCa", confidence_level=0.95, random_state=rng) res.confidence_interval # (low, high) # --- 5) Finite-population correction --- N, n = 500, 60 fpc = np.sqrt((N - n) / (N - 1)) se_fpc = (sd_x / np.sqrt(n)) * fpc
boot::boot.ci() 或 scipy.stats.bootstrap() 對同一份資料做 BCa CI,比較兩者寬度。若兩個 CI 差很多,你的資料對 CLT 來說太偏,應該優先使用 bootstrap 結果,並考慮對數轉換。
Exercise: Before running your next t-test, also compute a BCa CI with boot::boot.ci() or scipy.stats.bootstrap() and compare widths. If they differ noticeably, your data are too skewed for CLT — prefer the bootstrap result and consider a log transform.
六、六個常見錯誤
❌ SE 不是 SD
n 越大 SEM 越小,誤差條看起來「漂亮」——但這只是估計的精度,不是資料散布。Curran-Everett 2008:報告變異請用 SD 或 IQR;報告精度請用 95% CI。SEM 幾乎沒有獨立用途。
Bigger n shrinks SEM and makes error bars look tidy — but that's estimate precision, not data spread. Curran-Everett 2008: report SD or IQR for spread, 95% CI for precision. SEM has almost no standalone use.
❌ CLT 是關於 mean,不是原始資料
常見錯誤:「我的資料不常態,所以 t 檢定不能用」。CLT 是對樣本平均的陳述——只要 n 夠大、變異有限,X̄ 就近似常態,t 檢定就能用。檢測 raw data normality(Shapiro-Wilk)本身意義有限。
Common mistake: "My data aren't normal, so the t-test isn't valid." CLT is about the sample mean, not the raw data. If n is large enough and the variance is finite, X̄ is approximately normal and t still works. Testing raw-data normality (Shapiro-Wilk) is largely a distraction.
❌ Bootstrap 不是萬靈丹
當 n < 10,bootstrap 重抽的只是同樣 10 個值的不同組合——抽樣分布幾乎沒有資訊。Chernick 2008:n < 10 時 bootstrap CI 不穩定;理想 n ≥ 30。極端情況需用 Bayesian 或精確檢定。
With n < 10, the bootstrap just rearranges the same 10 values — the sampling distribution carries little information. Chernick 2008: bootstrap CIs are unstable when n < 10; n ≥ 30 is ideal. Extreme small-sample problems call for Bayesian or exact methods.
❌ 忽略相依性
同一隻老鼠的 3 個切片不是獨立——把 n 寫成「老鼠數 × 切片數」會誇大有效樣本數,SE 太小、p 值太小。Lazic 2010 BMC Neurosci:使用 mixed model 或 cluster bootstrap,把生物 / 技術重複分開處理。
Three slices from the same mouse are not independent. Counting "n = mice × slices" inflates the effective sample size, shrinks SE, and shrinks p. Lazic 2010 BMC Neurosci: use mixed models or cluster bootstrap to keep biological and technical replication separate.
❌ CLT 不適用 Cauchy
單細胞表達資料、收入、保險賠付都可能是重尾。對於 Pareto α < 2 或 Cauchy,變異不存在 → CLT 失效。應改用 median 與 quantile-based 推論,或 stable distribution。
Single-cell expression, income, insurance payouts can be heavy-tailed. For Pareto with α < 2 or Cauchy, variance does not exist → CLT fails. Use medians and quantile-based inference, or stable distributions.
❌ 忘記 FPC
抽樣全國醫院 500 家中的 100 家(占 20%),不加 finite-population correction (FPC) 會高估 SE。FPC = √((N−n)/(N−1)),當 n/N → 1 時 SE → 0。Cochran 1977《Sampling Techniques》。
Sampling 100 of 500 hospitals (20%) without the finite-population correction overstates SE. FPC = √((N−n)/(N−1)); as n/N → 1, SE → 0. See Cochran 1977 Sampling Techniques.
📝 自我檢測
1. 在報告 100 位病人的收縮壓(SBP)資料時,論文寫「mean ± SEM = 132 ± 1.4 mmHg」。最合適的修正是?
1. A paper reports systolic BP for 100 patients as "mean ± SEM = 132 ± 1.4 mmHg." The best correction is?
2. 同事看到單細胞 RNA-seq 某基因表達量 QQ plot 嚴重偏離 45° 線,建議「不要做 t 檢定」。最合適的回應?
2. A colleague sees a single-cell RNA-seq gene-expression QQ plot deviates badly from the 45° line and says "don't run a t-test." Best response?
3. 你只有 n = 8 個樣本,分布看起來右偏。下列哪個 CI 方法最不可靠?
3. You have n = 8 samples that look right-skewed. Which CI method is least reliable?
4. 從 500 家醫院中隨機抽 100 家做平均住院費調查。要計算 mean 的 95% CI,應該?
4. From 500 hospitals you randomly sample 100 to estimate mean length-of-stay. To compute a 95% CI for the mean you should?