為什麼描述統計是「不能跳過」的第一步?
很多研究者把描述統計當成「論文表 1」的應付項目——填一下 mean ± SD 就完事。這是大錯特錯。所有推論統計(t 檢定、ANOVA、迴歸)都建構在描述統計的觀察之上:你選 Welch 還是 Student t、用 mean 還是 median、log 轉換還是原尺度報告,全部取決於你看到的資料形狀。
更危險的是,同一組 summary statistics 可以對應到截然不同的資料——這就是 Anscombe (1973) 提出的「Anscombe's Quartet」與 Matejka & Fitzmaurice (2017)「Datasaurus Dozen」的核心警示。四組資料有完全相同的 mean、SD、相關係數,但畫出散佈圖卻是恐龍、十字、星星。看圖,永遠先看圖。
Many researchers treat descriptive statistics as Table 1 boilerplate — fill in mean ± SD and move on. That is a serious mistake. All inferential statistics (t-tests, ANOVA, regression) are built on top of descriptive observations: whether you pick Welch vs Student t, mean vs median, log scale vs raw — all of it follows from what you saw in the data.
Worse, the same summary statistics can correspond to wildly different datasets — the warning at the heart of Anscombe's Quartet (1973) and Matejka & Fitzmaurice's Datasaurus Dozen (2017). Four datasets share identical mean, SD, and correlation, yet their scatter plots are a dinosaur, a cross, a star. Look at the figure. Always look at the figure first.
tidyverse::ggplot2 與 seaborn 都是這個哲學的後裔。
The EDA philosophy (Tukey 1977): Exploratory Data Analysis isn't about "proving" something — it's about letting the data speak first. In Exploratory Data Analysis (1977), Tukey told us: draw the picture before you run the test. Today, ggplot2 and seaborn are descendants of this same philosophy.
一、位置、離散、形狀
任何資料的初步描述都圍繞三個維度:位置(資料中心在哪?)、離散(資料散得多開?)、形狀(資料對稱嗎?尾巴重嗎?)。每個維度都有「敏感」與「穩健」兩派選項,差別在於它對 outlier 的容忍度。
Every data summary lives on three axes: location (where is the centre?), spread (how scattered?), and shape (symmetric? heavy-tailed?). Each axis offers a "sensitive" choice and a "robust" choice — they differ in how much they tolerate outliers.
位置 (Location)
Mean:x̄ = Σxᵢ/n · 對 outlier 敏感
Median:第 50 百分位數 · 對 outlier 穩健
Trimmed mean:去掉前後 5/10% 後的 mean,介於兩者之間
Geometric mean:log 後取 mean 再 exp 回去——適用於比例 / 倍數資料(基因表達 fold-change)
Mean: x̄ = Σxᵢ/n; sensitive to outliers.
Median: 50th percentile; outlier-robust.
Trimmed mean: mean after dropping the top/bottom 5–10%; a middle ground.
Geometric mean: mean of logs, then exp — for ratio/fold-change data (e.g. gene expression).
離散 (Spread)
SD:s = √(Σ(xᵢ−x̄)²/(n−1)) · 跟 mean 配對
IQR:Q₃ − Q₁ · 跟 median 配對,包含中央 50%
MAD:median(|xᵢ − median(x)|) · 最穩健
Range:max − min · 對 outlier 極敏感,不建議單獨使用
SD: s = √(Σ(xᵢ−x̄)²/(n−1)); pairs with the mean.
IQR: Q₃ − Q₁; pairs with median, covers middle 50%.
MAD: median(|xᵢ − median(x)|); the most robust.
Range: max − min; outlier-driven, not for solo use.
形狀 (Shape)
Skewness γ₁:> 0 右偏(尾巴在右側);< 0 左偏
Kurtosis γ₂:> 0 leptokurtic(尾巴重 / 中心尖);= 0 normal;< 0 platykurtic
看 histogram + QQ plot 比看數值更直觀——γ₁ = 0 不保證對稱(雙峰分布 γ₁ 也可能 = 0)。
Skewness γ₁: > 0 right-skewed (tail on the right); < 0 left-skewed.
Kurtosis γ₂: > 0 leptokurtic (heavy tails / sharp peak); = 0 normal; < 0 platykurtic.
A histogram + QQ plot beats raw numbers — γ₁ = 0 does not guarantee symmetry (bimodal distributions can also have γ₁ = 0).
偏態如何撕裂 mean 與 median
拖動下方 skew 滑桿。當資料對稱(skew=0),mean ≈ median;當右偏(skew > 0),mean 被高值拉走,median 仍接近資料主體。試試 skew = 3 或 5,這正是血液生物標記、薪資、住院日數的典型形狀。
Drag the skew slider. With symmetric data (skew = 0), mean ≈ median; under right skew (skew > 0), the mean is pulled by the tail while the median stays near the bulk. Try skew = 3 or 5 — that is the typical shape of blood biomarkers, income, and length-of-stay.
紅線 = mean · 藍線 = medianRed = mean · Blue = median
二、Box / Violin / Dot 三種圖怎麼選?
論文最常見「五星級錯誤」之一:用 bar chart(長條圖)+ SEM 報告連續資料。Weissgerber 2015 PLOS Biology「Beyond Bar and Line Graphs」的著名警告:相同的 bar + SEM 可以對應 4 種完全不同的資料分布(雙峰、極端值、偏態、合理)。解方:永遠把個別資料點畫出來,這也是 SuperPlots(Lord 2020 JCB)與 ggbeeswarm 流行的原因。
One of the most common visualisation errors in the literature: bar chart + SEM for continuous data. Weissgerber 2015 PLOS Biology ("Beyond Bar and Line Graphs") showed that the same bar+SEM can correspond to four entirely different distributions (bimodal, outlier-driven, skewed, reasonable). The fix: always show the individual points — the same reason SuperPlots (Lord 2020 JCB) and ggbeeswarm exist.
| 圖型 | 顯示資訊 | 優點 | 風險 | |||
|---|---|---|---|---|---|---|
| Bar + SEM | 僅 mean + 一條誤差線 | 簡潔 | 隱藏 n、隱藏分布——應避免 | Only mean + one error bar | Clean | Hides n and distribution — avoid |
| Boxplot | 5 數字摘要(min, Q1, median, Q3, max)+ outlier | 穩健、空間效率高 | 看不到雙峰或細節分布 | 5-number summary + outliers | Robust, space-efficient | Hides bimodality |
| Violin | 分布密度 + box 中央 | 看得到雙峰、形狀 | n 小(< 20)時 KDE 平滑會誤導 | Density + median/IQR core | Shows bimodality and shape | KDE misleads when n < 20 |
| Dot / Strip / Bee | 每個資料點 | 最透明,沒有摘要扭曲 | n > 1000 時重疊嚴重 | Every data point | Most transparent | Overplots when n > 1000 |
| SuperPlot | 技術重複(小點)+ 生物重複(大點 mean)疊加 | 區分 biological vs technical replicate | 需要設計時就分層 | Technical (small) + biological (large) reps overlaid | Separates biological vs technical replication | Requires hierarchical design |
盒鬚圖解剖實驗室
調整 outlier 強度 + 形狀偏態,觀察 5 數字(min/Q1/median/Q3/max)與 outlier whisker(Tukey fences = Q1 − 1.5·IQR, Q3 + 1.5·IQR)的變化。注意當 outlier 增加時,median + IQR 幾乎不動——這就是穩健統計的力量。
Tune outlier strength and skew. Watch the 5-number summary (min/Q1/median/Q3/max) and the Tukey fences (Q1 − 1.5·IQR, Q3 + 1.5·IQR). When outliers grow, median + IQR barely move — that is the superpower of robust statistics.
三、Anscombe Quartet & Datasaurus
Anscombe Quartet
四組 (x, y) 資料,全部有:
· x̄ = 9.0, ȳ = 7.5
· SD(x) = 3.32, SD(y) ≈ 2.03
· 相關係數 r ≈ 0.816
· OLS 線:y = 0.5x + 3
但畫出散佈圖:(I) 真正線性;(II) 明顯曲線;(III) 線性但有一個極端 outlier;(IV) 所有 x 相同除了一個 outlier 點。結論:summary stats 完全無法替代圖。
Four (x, y) datasets all share:
· x̄ = 9.0, ȳ = 7.5
· SD(x) = 3.32, SD(y) ≈ 2.03
· r ≈ 0.816
· OLS line: y = 0.5x + 3
But the scatter plots: (I) truly linear; (II) clearly curved; (III) linear with one extreme outlier; (IV) all x equal except one outlier. The lesson: summary stats are not a substitute for the plot.
Datasaurus Dozen
Matejka & Fitzmaurice(2017 CHI)用模擬退火演算法產生 12 組資料,全部具有:
· mean(x) = 54.26, mean(y) = 47.83
· SD 與 r 相同至小數第二位
但畫出來分別是:恐龍、十字、星星、平行線、圓圈⋯⋯R 套件 datasauRus 可重現。這個演示讓「永遠先畫圖」從口號變成可視鐵證。
Matejka & Fitzmaurice (2017 CHI) used simulated annealing to construct 12 datasets that all share:
· mean(x) = 54.26, mean(y) = 47.83
· SD and r match to two decimals
...but the scatters are: a dinosaur, a cross, a star, parallel lines, a circle… The R package datasauRus reproduces them. The demo turns "always plot first" from a slogan into a visual proof.
四、轉換與穩健統計
🧮 對數轉換
右偏的乘法性質資料(血液濃度、基因表達、細菌計數)取 log 後常變對稱。報告時:log 尺度的 mean 對應原尺度的 geometric mean(幾何平均)。
陷阱:若資料含 0,需用 log(x + c) 或更好的 log1p / asinh。Bland-Altman (2007 BMJ) 指出 c 的選擇會影響 CI。
Right-skewed multiplicative data (concentrations, expression, bacterial counts) often become symmetric after log. Reporting: a mean on the log scale corresponds to the geometric mean on the original scale.
Pitfall: with zeros, use log(x + c), or better log1p / asinh. Bland-Altman (2007 BMJ) showed the choice of c affects CIs.
🛡️ 穩健統計
當 outlier 多到無法忽略:(1) Trimmed mean(去 10% 兩尾後 mean,Wilcox 2017 推薦);(2) Huber's M-estimator(自動降低極端值權重);(3) Median + MAD。
R: mean(x, trim=0.1), MASS::huber()。Python: scipy.stats.trim_mean(x, 0.1)。
When outliers can't be ignored: (1) trimmed mean (drop the top/bottom 10%, Wilcox 2017 recommends); (2) Huber's M-estimator (auto-downweights extremes); (3) median + MAD.
R: mean(x, trim=0.1), MASS::huber(). Python: scipy.stats.trim_mean(x, 0.1).
五、如何選擇?
🌳 描述統計選擇決策樹
六、實作範例
# R: base + tidyverse library(tidyverse) library(e1071) # skewness / kurtosis # Load a biomarker example (right-skewed) x <- c(110, 118, 122, 125, 130, 135, 220) # --- Three legs of descriptive stats --- summary(x) mean(x); median(x) sd(x); IQR(x); mad <- mad(x) # MAD e1071::skewness(x); e1071::kurtosis(x) # --- Robust: trimmed mean & Huber --- mean(x, trim = 0.1) # 10% trimmed MASS::huber(x)$mu # Huber M # --- Geometric mean for fold changes --- exp(mean(log(x))) # --- Visualise: ALWAYS first --- tibble(x = x) %>% ggplot(aes(y = x, x = "")) + geom_boxplot(width = 0.3, outlier.shape = NA) + geom_jitter(width = 0.05, alpha = 0.6) + labs(title = "Always show the points")
import numpy as np import pandas as pd from scipy import stats import seaborn as sns import matplotlib.pyplot as plt x = np.array([110, 118, 122, 125, 130, 135, 220]) # --- Three legs of descriptive stats --- pd.Series(x).describe() np.mean(x), np.median(x) np.std(x, ddof=1), stats.iqr(x), stats.median_abs_deviation(x) stats.skew(x), stats.kurtosis(x) # --- Robust: trimmed mean & Huber --- stats.trim_mean(x, 0.1) # 10% trimmed from statsmodels.robust.scale import Huber Huber()(x)[0] # Huber M # --- Geometric mean --- stats.gmean(x) # --- Visualise --- sns.stripplot(y=x, jitter=0.1, alpha=0.7) plt.title("Always show the points") plt.show()
e1071::skewness() 或 scipy.stats.skew()。若 |γ₁| > 1,就把 mean ± SD 換成 median + IQR,並且在論文 Methods 寫明「Due to right-skewed distribution (skewness = X), data are presented as median (IQR)」。
Exercise: Run e1071::skewness() or scipy.stats.skew() on your latest dataset. If |γ₁| > 1, swap mean ± SD for median + IQR, and write in Methods: "Due to the right-skewed distribution (skewness = X), data are presented as median (IQR)."
七、五個論文最常見的描述統計錯誤
❌ SD vs SEM
SD 描述「資料散布」;SEM = SD/√n 描述「估計精確度」。SEM 永遠較小,所以許多論文偏好報 SEM 讓誤差條看起來「漂亮」。Curran-Everett 2008 Adv Physiol Educ:報告原始變異請用 SD;要表達估計精度用 95% CI(不是 SEM)。
SD describes the spread of the data; SEM = SD/√n describes the precision of the estimate. SEM is always smaller, so papers often pick SEM to make error bars look "tidy". Curran-Everett 2008 Adv Physiol Educ: report SD for variability and a 95% CI (not SEM) for estimate precision.
❌ n 的歧義
「n = 60」是 60 隻老鼠?還是 20 隻老鼠 × 3 個切片?這兩者統計獨立性差很多。Lazic 2010 BMC Neurosci 強調:必須明列「3 mice × 4 wells × 2 technical replicates」,並用混合模型分析(見 Step 13)。
"n = 60" — sixty mice? Or twenty mice × three slices? The independence implications are very different. Lazic 2010 BMC Neurosci insists: spell out "3 mice × 4 wells × 2 technical replicates" and analyse with a mixed model (see Step 13).
❌ SEM 不是離散度
n = 1000 時 SEM ≈ SD/31.6,看起來變異極小——但這是估計的精度,不是資料的離散。要看資料散得多開,永遠用 SD 或 IQR。
With n = 1000, SEM ≈ SD/31.6 — looks tiny, but that's estimate precision, not data spread. To describe how spread the data are, always use SD or IQR.
❌ 比例上用 SD ± 是錯的
對於比例 / 百分比資料,SD 不是合適的離散——用 Wilson 或 Clopper-Pearson 95% CI(見 Step 6)。例如 "60% (95% CI 52–67%)" 而非 "60% ± 8%"。
For proportions / percentages, SD is not the right spread — use a Wilson or Clopper-Pearson 95% CI (see Step 6). Report "60% (95% CI 52–67%)" rather than "60% ± 8%".
❌ 小 n 算 γ₁/γ₂ 不可靠
當 n < 30 時,skewness / kurtosis 的標準誤很大(SE(γ₁) ≈ √(6/n))。Cain, Zhang, Yuan 2017 Behav Res Methods 建議 n ≥ 50 才可信。小 n 直接看 dotplot 比算 γ₁ 更實用。
With n < 30, the SE of skewness/kurtosis is huge (SE(γ₁) ≈ √(6/n)). Cain, Zhang, Yuan 2017 Behav Res Methods recommend n ≥ 50 for trustworthy values. With small n, a dotplot beats computing γ₁.
❌ Bar+SEM 應淘汰
Weissgerber 2015 PLOS Biol「Beyond Bar and Line Graphs」明確主張:連續變數請用 dot plot / box / violin,不要用 bar chart——bar 適合計數,不適合連續資料。
Weissgerber 2015 PLOS Biol ("Beyond Bar and Line Graphs") is unambiguous: for continuous data use dot / box / violin — not bar charts. Bars are for counts, not continuous distributions.
📝 自我檢測
1. 你發現血液 CRP 濃度 skewness = 2.8,下列何者最合適?
1. You find blood CRP skewness = 2.8. What's most appropriate?
2. 同事說「我把所有 > 2 SD 的資料點刪掉」,最佳回應是?
2. A colleague says "I removed all data > 2 SD." Best response?
3. Anscombe's Quartet 的主要啟示是?
3. The main lesson from Anscombe's Quartet?
4. 在報告 RNA-seq 中某基因 fold change 的中央趨勢,何者最合適?
4. To report central tendency of a gene's RNA-seq fold change, what's best?