Step 1: Descriptive Statistics — Biostatistics Tutorial

總覽

為什麼描述統計是「不能跳過」的第一步？

很多研究者把描述統計當成「論文表 1」的應付項目——填一下 mean ± SD 就完事。這是大錯特錯。所有推論統計（t 檢定、ANOVA、迴歸）都建構在描述統計的觀察之上：你選 Welch 還是 Student t、用 mean 還是 median、log 轉換還是原尺度報告，全部取決於你看到的資料形狀。

更危險的是，同一組 summary statistics 可以對應到截然不同的資料——這就是 Anscombe (1973) 提出的「Anscombe's Quartet」與 Matejka & Fitzmaurice (2017)「Datasaurus Dozen」的核心警示。四組資料有完全相同的 mean、SD、相關係數，但畫出散佈圖卻是恐龍、十字、星星。看圖，永遠先看圖。

Many researchers treat descriptive statistics as Table 1 boilerplate — fill in mean ± SD and move on. That is a serious mistake. All inferential statistics (t-tests, ANOVA, regression) are built on top of descriptive observations: whether you pick Welch vs Student t, mean vs median, log scale vs raw — all of it follows from what you saw in the data.

Worse, the same summary statistics can correspond to wildly different datasets — the warning at the heart of Anscombe's Quartet (1973) and Matejka & Fitzmaurice's Datasaurus Dozen (2017). Four datasets share identical mean, SD, and correlation, yet their scatter plots are a dinosaur, a cross, a star. Look at the figure. Always look at the figure first.

💡

EDA 哲學（Tukey 1977）：探索性資料分析（Exploratory Data Analysis）不是要「證明什麼」，而是讓資料自己說話。Tukey 在 1977 年的《Exploratory Data Analysis》就提出：先畫圖、再做檢定。今天，tidyverse::ggplot2 與 seaborn 都是這個哲學的後裔。 The EDA philosophy (Tukey 1977): Exploratory Data Analysis isn't about "proving" something — it's about letting the data speak first. In Exploratory Data Analysis (1977), Tukey told us: draw the picture before you run the test. Today, ggplot2 and seaborn are descendants of this same philosophy.

核心三條腿

一、位置、離散、形狀

任何資料的初步描述都圍繞三個維度：位置（資料中心在哪？）、離散（資料散得多開？）、形狀（資料對稱嗎？尾巴重嗎？）。每個維度都有「敏感」與「穩健」兩派選項，差別在於它對 outlier 的容忍度。

Every data summary lives on three axes: location (where is the centre?), spread (how scattered?), and shape (symmetric? heavy-tailed?). Each axis offers a "sensitive" choice and a "robust" choice — they differ in how much they tolerate outliers.

📍

位置 (Location)

Mean：x̄ = Σxᵢ/n · 對 outlier 敏感
Median：第 50 百分位數 · 對 outlier 穩健
Trimmed mean：去掉前後 5/10% 後的 mean，介於兩者之間
Geometric mean：log 後取 mean 再 exp 回去——適用於比例 / 倍數資料（基因表達 fold-change）

Mean: x̄ = Σxᵢ/n; sensitive to outliers.
Median: 50th percentile; outlier-robust.
Trimmed mean: mean after dropping the top/bottom 5–10%; a middle ground.
Geometric mean: mean of logs, then exp — for ratio/fold-change data (e.g. gene expression).

↔️

離散 (Spread)

SD：s = √(Σ(xᵢ−x̄)²/(n−1)) · 跟 mean 配對
IQR：Q₃ − Q₁ · 跟 median 配對，包含中央 50%
MAD：median(|xᵢ − median(x)|) · 最穩健
Range：max − min · 對 outlier 極敏感，不建議單獨使用

SD: s = √(Σ(xᵢ−x̄)²/(n−1)); pairs with the mean.
IQR: Q₃ − Q₁; pairs with median, covers middle 50%.
MAD: median(|xᵢ − median(x)|); the most robust.
Range: max − min; outlier-driven, not for solo use.

🎨

形狀 (Shape)

Skewness γ₁：> 0 右偏（尾巴在右側）；< 0 左偏
Kurtosis γ₂：> 0 leptokurtic（尾巴重 / 中心尖）；= 0 normal；< 0 platykurtic
看 histogram + QQ plot 比看數值更直觀——γ₁ = 0 不保證對稱（雙峰分布 γ₁ 也可能 = 0）。

Skewness γ₁: > 0 right-skewed (tail on the right); < 0 left-skewed.
Kurtosis γ₂: > 0 leptokurtic (heavy tails / sharp peak); = 0 normal; < 0 platykurtic.
A histogram + QQ plot beats raw numbers — γ₁ = 0 does not guarantee symmetry (bimodal distributions can also have γ₁ = 0).

⌜ x̄ = (1/n)Σxᵢ · s² = Σ(xᵢ−x̄)²/(n−1) · γ₁ = (1/n)Σ((xᵢ−x̄)/s)³ ⌝ 變異數除以 (n−1) 而非 n：這就是「Bessel's correction」——讓樣本變異數成為母體變異數的無偏估計（unbiased estimator）。Fisher 1925。 ⌜ x̄ = (1/n)Σxᵢ · s² = Σ(xᵢ−x̄)²/(n−1) · γ₁ = (1/n)Σ((xᵢ−x̄)/s)³ ⌝ Dividing by (n−1), not n, is Bessel's correction — it makes the sample variance an unbiased estimator of the population variance. Fisher 1925.

互動模擬 ①

偏態如何撕裂 mean 與 median

拖動下方 skew 滑桿。當資料對稱（skew=0），mean ≈ median；當右偏（skew > 0），mean 被高值拉走，median 仍接近資料主體。試試 skew = 3 或 5，這正是血液生物標記、薪資、住院日數的典型形狀。

Drag the skew slider. With symmetric data (skew = 0), mean ≈ median; under right skew (skew > 0), the mean is pulled by the tail while the median stays near the bulk. Try skew = 3 or 5 — that is the typical shape of blood biomarkers, income, and length-of-stay.

偏態 skew 0

樣本數 n 300

紅線 = mean · 藍線 = medianRed = mean · Blue = median

視覺化選擇

二、Box / Violin / Dot 三種圖怎麼選？

論文最常見「五星級錯誤」之一：用 bar chart（長條圖）+ SEM 報告連續資料。Weissgerber 2015 PLOS Biology「Beyond Bar and Line Graphs」的著名警告：相同的 bar + SEM 可以對應 4 種完全不同的資料分布（雙峰、極端值、偏態、合理）。解方：永遠把個別資料點畫出來，這也是 SuperPlots（Lord 2020 JCB）與 ggbeeswarm 流行的原因。

One of the most common visualisation errors in the literature: bar chart + SEM for continuous data. Weissgerber 2015 PLOS Biology ("Beyond Bar and Line Graphs") showed that the same bar+SEM can correspond to four entirely different distributions (bimodal, outlier-driven, skewed, reasonable). The fix: always show the individual points — the same reason SuperPlots (Lord 2020 JCB) and ggbeeswarm exist.

圖型	顯示資訊	優點	風險
Bar + SEM	僅 mean + 一條誤差線	簡潔	隱藏 n、隱藏分布——應避免	Only mean + one error bar	Clean	Hides n and distribution — avoid
Boxplot	5 數字摘要（min, Q1, median, Q3, max）+ outlier	穩健、空間效率高	看不到雙峰或細節分布	5-number summary + outliers	Robust, space-efficient	Hides bimodality
Violin	分布密度 + box 中央	看得到雙峰、形狀	n 小（< 20）時 KDE 平滑會誤導	Density + median/IQR core	Shows bimodality and shape	KDE misleads when n < 20
Dot / Strip / Bee	每個資料點	最透明，沒有摘要扭曲	n > 1000 時重疊嚴重	Every data point	Most transparent	Overplots when n > 1000
SuperPlot	技術重複（小點）+ 生物重複（大點 mean）疊加	區分 biological vs technical replicate	需要設計時就分層	Technical (small) + biological (large) reps overlaid	Separates biological vs technical replication	Requires hierarchical design

實務原則：n ≤ 30 用 dot plot 加 median bar；n = 30–200 用 box + jitter；n > 200 用 violin 並標出 median 與 quartiles。生物實驗務必標出生物重複數（如 "n = 5 mice, 3 wells each"）。 Practical rule: n ≤ 30 → dot plot with a median bar; n = 30–200 → box + jitter; n > 200 → violin with median + quartiles. In biological experiments, always state biological replicate counts (e.g., "n = 5 mice, 3 wells each").

互動模擬 ②

盒鬚圖解剖實驗室

調整 outlier 強度 + 形狀偏態，觀察 5 數字（min/Q1/median/Q3/max）與 outlier whisker（Tukey fences = Q1 − 1.5·IQR, Q3 + 1.5·IQR）的變化。注意當 outlier 增加時，median + IQR 幾乎不動——這就是穩健統計的力量。

Tune outlier strength and skew. Watch the 5-number summary (min/Q1/median/Q3/max) and the Tukey fences (Q1 − 1.5·IQR, Q3 + 1.5·IQR). When outliers grow, median + IQR barely move — that is the superpower of robust statistics.

Outlier 比例 5%

Outlier 強度 3σ

深入討論

三、Anscombe Quartet & Datasaurus

Anscombe Quartet

四組 (x, y) 資料，全部有：
· x̄ = 9.0, ȳ = 7.5
· SD(x) = 3.32, SD(y) ≈ 2.03
· 相關係數 r ≈ 0.816
· OLS 線：y = 0.5x + 3

但畫出散佈圖：(I) 真正線性；(II) 明顯曲線；(III) 線性但有一個極端 outlier；(IV) 所有 x 相同除了一個 outlier 點。結論：summary stats 完全無法替代圖。

Four (x, y) datasets all share:
· x̄ = 9.0, ȳ = 7.5
· SD(x) = 3.32, SD(y) ≈ 2.03
· r ≈ 0.816
· OLS line: y = 0.5x + 3

But the scatter plots: (I) truly linear; (II) clearly curved; (III) linear with one extreme outlier; (IV) all x equal except one outlier. The lesson: summary stats are not a substitute for the plot.

Datasaurus Dozen

Matejka & Fitzmaurice（2017 CHI）用模擬退火演算法產生 12 組資料，全部具有：
· mean(x) = 54.26, mean(y) = 47.83
· SD 與 r 相同至小數第二位

但畫出來分別是：恐龍、十字、星星、平行線、圓圈⋯⋯R 套件 datasauRus 可重現。這個演示讓「永遠先畫圖」從口號變成可視鐵證。

Matejka & Fitzmaurice (2017 CHI) used simulated annealing to construct 12 datasets that all share:
· mean(x) = 54.26, mean(y) = 47.83
· SD and r match to two decimals

...but the scatters are: a dinosaur, a cross, a star, parallel lines, a circle… The R package datasauRus reproduces them. The demo turns "always plot first" from a slogan into a visual proof.

⚠️

給 reviewer / supervisor 的提問：「你看過原始資料的散佈圖嗎？」這一句通常比任何統計討論都更能揭示問題。Anscombe 的論文〈Graphs in statistical analysis〉發表於 1973 年 The American Statistician——超過半世紀後，這個警告仍然有效。 One question to ask the reviewer/supervisor: "Have you seen the scatter plot?" This single sentence reveals more problems than any statistical debate. Anscombe's "Graphs in statistical analysis" appeared in The American Statistician in 1973 — half a century later, the warning still bites.

實務技巧

四、轉換與穩健統計

🧮 對數轉換

右偏的乘法性質資料（血液濃度、基因表達、細菌計數）取 log 後常變對稱。報告時：log 尺度的 mean 對應原尺度的 geometric mean（幾何平均）。

陷阱：若資料含 0，需用 log(x + c) 或更好的 log1p / asinh。Bland-Altman (2007 BMJ) 指出 c 的選擇會影響 CI。

Right-skewed multiplicative data (concentrations, expression, bacterial counts) often become symmetric after log. Reporting: a mean on the log scale corresponds to the geometric mean on the original scale.

Pitfall: with zeros, use log(x + c), or better log1p / asinh. Bland-Altman (2007 BMJ) showed the choice of c affects CIs.

🛡️ 穩健統計

當 outlier 多到無法忽略：(1) Trimmed mean（去 10% 兩尾後 mean，Wilcox 2017 推薦）；(2) Huber's M-estimator（自動降低極端值權重）；(3) Median + MAD。

R: mean(x, trim=0.1), MASS::huber()。Python: scipy.stats.trim_mean(x, 0.1)。

When outliers can't be ignored: (1) trimmed mean (drop the top/bottom 10%, Wilcox 2017 recommends); (2) Huber's M-estimator (auto-downweights extremes); (3) median + MAD.

R: mean(x, trim=0.1), MASS::huber(). Python: scipy.stats.trim_mean(x, 0.1).

陷阱：不要刪 outlier，除非有理由「我把超過 3σ 的點都刪了」是論文常見的紅旗。除非有明確生物 / 技術理由（如儀器故障、樣品污染），否則應該保留 outlier，改用穩健統計或 log 轉換。Tabachnick & Fidell（2019）的建議：(1) 確認資料登錄無誤；(2) 報告含/不含 outlier 的兩種結果；(3) 不要事後選有利的版本。 "I removed everything beyond 3σ" is a frequent red flag in papers. Unless there is a specific biological or technical reason (instrument failure, sample contamination), keep the outliers and use robust statistics or a log transform instead. Tabachnick & Fidell (2019) recommend: (1) verify data entry; (2) report both with and without outliers; (3) do not cherry-pick whichever helps your conclusion.

決策引導

五、如何選擇？

🌳 描述統計選擇決策樹

Q1:

資料對稱嗎？（看 histogram / γ₁ ≈ 0？）→ 是 → mean ± SD 即可。

Q2:

資料明顯偏態？→ 是 → median + IQR；或 log 轉換後再用 mean ± SD（報告時註明 log scale）。

Q3:

有可疑 outlier？→ 是 → median + IQR + MAD；考慮 trimmed mean 或 Huber M-estimator。

Q4:

是「比例 / 倍數」資料（fold change, ratio）？→ 是 → geometric mean 與 CV%，並考慮 log 轉換。

Q5:

n < 10？→ 是 → 別摘要——直接畫每個點，dot plot 即可。

Q1:

Are the data symmetric? (histogram / γ₁ ≈ 0?) → Yes → mean ± SD works.

Q2:

Clearly skewed? → Yes → median + IQR; or log-transform then mean ± SD (state log scale).

Q3:

Suspicious outliers? → Yes → median + IQR + MAD; consider trimmed mean or Huber's M-estimator.

Q4:

Ratio/fold-change data? → Yes → geometric mean and CV%; consider log transform.

Q5:

n < 10? → Yes → Don't summarise — plot every point; a dot plot suffices.

程式碼

六、實作範例

# R: base + tidyverse
library(tidyverse)
library(e1071)        # skewness / kurtosis

# Load a biomarker example (right-skewed)
x <- c(110, 118, 122, 125, 130, 135, 220)

# --- Three legs of descriptive stats ---
summary(x)
mean(x); median(x)
sd(x); IQR(x); mad <- mad(x)   # MAD
e1071::skewness(x); e1071::kurtosis(x)

# --- Robust: trimmed mean & Huber ---
mean(x, trim = 0.1)            # 10% trimmed
MASS::huber(x)$mu              # Huber M

# --- Geometric mean for fold changes ---
exp(mean(log(x)))

# --- Visualise: ALWAYS first ---
tibble(x = x) %>%
  ggplot(aes(y = x, x = "")) +
  geom_boxplot(width = 0.3, outlier.shape = NA) +
  geom_jitter(width = 0.05, alpha = 0.6) +
  labs(title = "Always show the points")

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

x = np.array([110, 118, 122, 125, 130, 135, 220])

# --- Three legs of descriptive stats ---
pd.Series(x).describe()
np.mean(x), np.median(x)
np.std(x, ddof=1), stats.iqr(x), stats.median_abs_deviation(x)
stats.skew(x), stats.kurtosis(x)

# --- Robust: trimmed mean & Huber ---
stats.trim_mean(x, 0.1)               # 10% trimmed
from statsmodels.robust.scale import Huber
Huber()(x)[0]                          # Huber M

# --- Geometric mean ---
stats.gmean(x)

# --- Visualise ---
sns.stripplot(y=x, jitter=0.1, alpha=0.7)
plt.title("Always show the points")
plt.show()

💡

練習：在你最近的資料集上跑 e1071::skewness() 或 scipy.stats.skew()。若 |γ₁| > 1，就把 mean ± SD 換成 median + IQR，並且在論文 Methods 寫明「Due to right-skewed distribution (skewness = X), data are presented as median (IQR)」。 Exercise: Run e1071::skewness() or scipy.stats.skew() on your latest dataset. If |γ₁| > 1, swap mean ± SD for median + IQR, and write in Methods: "Due to the right-skewed distribution (skewness = X), data are presented as median (IQR)."

常見陷阱

七、五個論文最常見的描述統計錯誤

❌ SD vs SEM

SD 描述「資料散布」；SEM = SD/√n 描述「估計精確度」。SEM 永遠較小，所以許多論文偏好報 SEM 讓誤差條看起來「漂亮」。Curran-Everett 2008 Adv Physiol Educ：報告原始變異請用 SD；要表達估計精度用 95% CI（不是 SEM）。

SD describes the spread of the data; SEM = SD/√n describes the precision of the estimate. SEM is always smaller, so papers often pick SEM to make error bars look "tidy". Curran-Everett 2008 Adv Physiol Educ: report SD for variability and a 95% CI (not SEM) for estimate precision.

❌ n 的歧義

「n = 60」是 60 隻老鼠？還是 20 隻老鼠 × 3 個切片？這兩者統計獨立性差很多。Lazic 2010 BMC Neurosci 強調：必須明列「3 mice × 4 wells × 2 technical replicates」，並用混合模型分析（見 Step 13）。

"n = 60" — sixty mice? Or twenty mice × three slices? The independence implications are very different. Lazic 2010 BMC Neurosci insists: spell out "3 mice × 4 wells × 2 technical replicates" and analyse with a mixed model (see Step 13).

❌ SEM 不是離散度

n = 1000 時 SEM ≈ SD/31.6，看起來變異極小——但這是估計的精度，不是資料的離散。要看資料散得多開，永遠用 SD 或 IQR。

With n = 1000, SEM ≈ SD/31.6 — looks tiny, but that's estimate precision, not data spread. To describe how spread the data are, always use SD or IQR.

❌ 比例上用 SD ± 是錯的

對於比例 / 百分比資料，SD 不是合適的離散——用 Wilson 或 Clopper-Pearson 95% CI（見 Step 6）。例如 "60% (95% CI 52–67%)" 而非 "60% ± 8%"。

For proportions / percentages, SD is not the right spread — use a Wilson or Clopper-Pearson 95% CI (see Step 6). Report "60% (95% CI 52–67%)" rather than "60% ± 8%".

❌ 小 n 算 γ₁/γ₂ 不可靠

當 n < 30 時，skewness / kurtosis 的標準誤很大（SE(γ₁) ≈ √(6/n)）。Cain, Zhang, Yuan 2017 Behav Res Methods 建議 n ≥ 50 才可信。小 n 直接看 dotplot 比算 γ₁ 更實用。

With n < 30, the SE of skewness/kurtosis is huge (SE(γ₁) ≈ √(6/n)). Cain, Zhang, Yuan 2017 Behav Res Methods recommend n ≥ 50 for trustworthy values. With small n, a dotplot beats computing γ₁.

❌ Bar+SEM 應淘汰

Weissgerber 2015 PLOS Biol「Beyond Bar and Line Graphs」明確主張：連續變數請用 dot plot / box / violin，不要用 bar chart——bar 適合計數，不適合連續資料。

Weissgerber 2015 PLOS Biol ("Beyond Bar and Line Graphs") is unambiguous: for continuous data use dot / box / violin — not bar charts. Bars are for counts, not continuous distributions.

📝 自我檢測

1. 你發現血液 CRP 濃度 skewness = 2.8，下列何者最合適？

1. You find blood CRP skewness = 2.8. What's most appropriate?

A. 仍用 mean ± SD，因為論文都這樣寫A. Still use mean ± SD because papers do

B. 報告 median + IQR，並考慮 log 轉換後再分析B. Report median + IQR and consider a log transform

C. 刪掉所有 > 3σ 的點再用 mean ± SDC. Delete points > 3σ and use mean ± SD

D. 只報 modeD. Just report the mode

2. 同事說「我把所有 > 2 SD 的資料點刪掉」，最佳回應是？

2. A colleague says "I removed all data > 2 SD." Best response?

A. 很好，這樣資料就乾淨了A. Great, the data are clean now

B. 應該刪到 ±1 SD，更嚴格B. Cut to ±1 SD, even stricter

C. 沒有生物 / 技術理由不該刪——用穩健統計或 log 轉換C. Without a biological/technical reason, don't delete — use robust stats or log transform

D. 沒關係，反正資料夠多D. Doesn't matter if data are plentiful

3. Anscombe's Quartet 的主要啟示是？

3. The main lesson from Anscombe's Quartet?

A. 相關係數最重要A. Correlation is the most important statistic

B. mean 比 median 好B. Mean beats median

C. SD 永遠夠用C. SD is always sufficient

D. 同樣的 summary stats 可能對應完全不同的資料——必須先畫圖D. The same summaries can describe wildly different data — plot first

4. 在報告 RNA-seq 中某基因 fold change 的中央趨勢，何者最合適？

4. To report central tendency of a gene's RNA-seq fold change, what's best?

A. 算術平均 (arithmetic mean)A. Arithmetic mean

B. 幾何平均 (geometric mean)——比例 / 倍數資料的首選B. Geometric mean — the right pick for ratio / fold data

C. Mode（眾數）C. Mode

D. SDD. SD