STEP 7 / 16

生物統計

假說檢定、迴歸、多重檢定校正——生資分析的統計骨幹。

Hypothesis tests, regression, multiple testing — the statistical backbone of bioinformatics.

一、選對檢定法

🔍 該用哪個檢定?

Q1.比較兩組連續變數平均、資料常態? t.test()
Q2.兩組但非常態或樣本小? wilcox.test() (Mann-Whitney)
Q3.三組以上連續? aov()kruskal.test()
Q4.類別關聯? chisq.test() / fisher.test()
Q5.兩連續變數? cor.test() / lm()
Q6.二元預測? glm(..., family = binomial)
Q7.計次? glm(...,family=poisson) / DESeq2

二、t-test 與 Wilcoxon

set.seed(42)
ctrl <- rnorm(30, mean = 5,  sd = 1)
trt  <- rnorm(30, mean = 6.5, sd = 1)

# Welch's two-sample t-test (預設不假設等變異數,最穩健)
t.test(ctrl, trt)

# 公式介面:response ~ group / Formula interface
df <- data.frame(value = c(ctrl, trt),
                 group = rep(c("ctrl","trt"), each = 30))
t.test(value ~ group, data = df)

# 配對 t-test(同一受試者前後測)/ Paired
# t.test(before, after, paired = TRUE)

# 單尾 / One-sided
t.test(value ~ group, data = df, alternative = "less")

# 假設等變異數(傳統 Student's t)/ Equal variance
t.test(value ~ group, data = df, var.equal = TRUE)

# 取出特定欄位 / Pull out specific stats
res <- t.test(value ~ group, data = df)
res$p.value
res$estimate
res$conf.int

# 非常態 → Wilcoxon rank-sum (Mann-Whitney)
wilcox.test(value ~ group, data = df)
⚠️
常態性檢驗:shapiro.test() 看 p < 0.05 表示「拒絕常態」,但樣本越大越容易拒絕。實務上 n > 30 時 t-test 已對非常態頗穩健(中央極限定理);極小樣本或極端偏態才需要 Wilcoxon。 Normality test: shapiro.test() with p < 0.05 rejects normality, but large n almost always rejects. In practice n > 30 makes t-test robust (CLT); use Wilcoxon only for tiny n or extreme skew.

三、ANOVA 與 Kruskal-Wallis

# 單因子 ANOVA / One-way ANOVA
fit <- aov(Sepal.Length ~ Species, data = iris)
summary(fit)

# 假設檢驗:殘差常態?變異數齊一?
# Diagnostic: residual normality? equal variance?
plot(fit, which = 1:2)
shapiro.test(residuals(fit))     # 殘差常態
bartlett.test(Sepal.Length ~ Species, data = iris)  # 變異數齊一

# Post-hoc:哪些組間有差?/ Which pairs differ?
TukeyHSD(fit)
pairwise.t.test(iris$Sepal.Length, iris$Species,
                p.adjust.method = "bonferroni")

# 變異數不齊用 Welch's ANOVA
oneway.test(Sepal.Length ~ Species, data = iris, var.equal = FALSE)

# 非常態 → Kruskal-Wallis
kruskal.test(Sepal.Length ~ Species, data = iris)
# Post-hoc: pairwise.wilcox.test or dunn.test::dunn.test

# 二因子 ANOVA:包含交互作用
# fit2 <- aov(yield ~ fertilizer * variety, data = farm)
# summary(fit2)

四、類別變數:chi-square 與 Fisher

# 列聯表 / Contingency table
#           Disease  Healthy
# Variant      30        10
# WT           50        90
tab <- matrix(c(30, 50, 10, 90), nrow = 2,
              dimnames = list(genotype = c("Variant","WT"),
                              status   = c("Disease","Healthy")))
tab

# Chi-square test (大樣本:每格期望次數 >= 5)
chisq.test(tab)

# Fisher's exact test (小樣本,期望次數 < 5)
fisher.test(tab)
fisher.test(tab)$estimate    # odds ratio
fisher.test(tab)$conf.int    # 95% CI for OR

# 從 data.frame 直接做表 / Build from data.frame
df <- data.frame(genotype = sample(c("V","WT"), 200, TRUE),
                 status   = sample(c("D","H"),  200, TRUE))
table(df$genotype, df$status) |> chisq.test()

# 趨勢檢定(順序類別)/ Trend test
# prop.trend.test()  for ordered categorical

五、相關係數與線性迴歸

## ----- 相關 / Correlation -----
cor(iris$Sepal.Length, iris$Sepal.Width)                         # Pearson
cor(iris$Sepal.Length, iris$Sepal.Width, method = "spearman")    # 排名相關
cor.test(iris$Sepal.Length, iris$Petal.Length)                   # + p-value & CI

# 相關矩陣 / Correlation matrix
M <- iris[, 1:4]
cor(M)
cor(M, method = "spearman")
# 視覺化:corrplot::corrplot() 或 pheatmap()

## ----- 線性迴歸 / Linear regression -----
fit <- lm(Sepal.Length ~ Petal.Length, data = iris)
summary(fit)        # 看 Estimate / Std.Error / Pr(>|t|) / R-squared

# 公式語法 / Formula syntax
# y ~ x          單變數 / single predictor
# y ~ x1 + x2    多變數 (加性) / additive multivariate
# y ~ x1 * x2    含交互作用 / with interaction (== x1 + x2 + x1:x2)
# y ~ x1 + x2 + x1:x2     等同 x1 * x2
# y ~ . - x      全部當預測變數但排除 x

# 多變數 / Multivariate
fit2 <- lm(Sepal.Length ~ Petal.Length + Species, data = iris)
summary(fit2)

# 取出資訊 / Extract pieces
coef(fit2)              # 係數
confint(fit2)           # 95% CI
fitted(fit2)            # 預測值
residuals(fit2)         # 殘差
broom::tidy(fit2)       # 整潔的 data.frame 結果

# 預測新資料 / Predict
new_dat <- data.frame(Petal.Length = c(2, 4, 6),
                      Species = factor("setosa", levels = levels(iris$Species)))
predict(fit2, new_dat, interval = "confidence")

# 模型診斷 / Diagnostics — 4 張圖
par(mfrow = c(2, 2)); plot(fit2); par(mfrow = c(1, 1))

六、廣義線性模型 (GLM)

情境分布連結範例
連續,常態gaussianidentity正常 lm()
0/1 二元binomiallogit疾病 vs 健康預測
計次poisson / quasipoissonlog平均突變次數
過度離散計次negative binomial (MASS::glm.nb)logRNA-seq counts
# 邏輯迴歸:以連續/類別變數預測二元結果
# Logistic regression
df <- data.frame(
  age      = c(35, 42, 51, 28, 60, 45, 55, 33),
  smoking  = c(1,  0,  1,  0,  1,  1,  0,  1),
  disease  = c(0,  0,  1,  0,  1,  1,  0,  0)
)
fit <- glm(disease ~ age + smoking, data = df, family = binomial)
summary(fit)

# 解讀:把 coefficient 取 exp 變 odds ratio
exp(coef(fit))           # OR per unit increase
exp(confint(fit))        # OR 95% CI

# 預測機率 / Predicted probabilities
predict(fit, type = "response")

# Poisson regression 範例 / Poisson example
# fit_p <- glm(n_mutations ~ exposure + age, data = df, family = poisson)

# Negative binomial(過度離散)/ Negative binomial (overdispersion)
# library(MASS); fit_nb <- glm.nb(n_mutations ~ exposure + age, data = df)

七、多重檢定校正──生資的核心

同時對 20,000 個基因做 t-test,即使全部都是 H₀ 為真,也會有 20000 × 0.05 = 1000 個「假性顯著」。多重檢定校正就是處理這個問題。

  • Bonferroni:把 α 除以檢定數。嚴格保證 FWER,但極度保守,生資少用。
  • Benjamini-Hochberg (BH/FDR):控制「偽陽性比例 (False Discovery Rate)」。生資標準做法
  • Storey's q-value:類似 BH 但更敏感(qvalue::qvalue())。

Run a t-test on 20,000 genes simultaneously: even if every H₀ is true, you'll get 20000 × 0.05 = 1000 "significant" hits by chance. Multiple testing correction fixes this.

  • Bonferroni: divide α by # of tests. Guarantees FWER, but very conservative — rare in bio.
  • Benjamini-Hochberg (BH/FDR): controls false discovery rate. The standard in bio.
  • Storey's q-value: BH-like but more powerful (qvalue::qvalue()).
set.seed(7)
# 模擬 1000 個基因的 p-value,其中 50 個真實 DEG
n <- 1000
n_true <- 50
pvals <- c(runif(n - n_true), runif(n_true, 0, 0.01))  # 假 H1 的 p-value 偏小
hist(pvals, breaks = 50)

# Bonferroni
pBonf <- p.adjust(pvals, method = "bonferroni")
sum(pBonf < 0.05)

# BH (FDR) — 生資標準
pBH   <- p.adjust(pvals, method = "BH")
sum(pBH   < 0.05)

# Holm(FWER 但比 Bonferroni 強)
pHolm <- p.adjust(pvals, method = "holm")
sum(pHolm < 0.05)

# 比較:可看出 BH 在 FDR 控制下保留更多 hit
cat("Method          | Significant\n",
    "Bonferroni      |", sum(pBonf  < 0.05), "\n",
    "Holm            |", sum(pHolm  < 0.05), "\n",
    "BH (FDR)        |", sum(pBH    < 0.05), "\n", sep = "")
🚨
論文必看:p-value 報告必須同時報告校正方法。寫「FDR < 0.05」就要說明用 BH;寫「Bonferroni-corrected p < 0.05」就要說檢定總數。未校正的多重檢定 p-value 在生資領域沒有討論價值。 Paper alert: always report the correction method alongside p-values. "FDR < 0.05" implies BH; "Bonferroni-corrected p < 0.05" requires stating the test count. Uncorrected multiple-test p-values are meaningless in bioinformatics.

八、Bonferroni vs FDR 視覺比較

調整真實 DEG 比例與 p-value 嚴格度,看看不同校正法剩下幾個顯著結果。

Adjust the true-positive fraction and threshold; see how each method behaves.

模擬:真實 H1 的 p ~ Beta(0.5,5);H0 為 Uniform(0,1)。

Simulated: true H1 p ~ Beta(0.5,5); H0 ~ Uniform(0,1).

📝 自我檢測

1. 你對 25,000 個基因做 t-test,原始 p < 0.05 有 1,800 個。下列敘述何者正確?

1. You ran t-tests on 25,000 genes; 1,800 have raw p < 0.05. Which is correct?

A. 全部 1,800 個都是真實 DEGA. All 1,800 are real DEGs
B. 用 Bonferroni 校正後就 100% 可信B. Bonferroni makes them 100% trustworthy
C. 約 25,000 × 0.05 ≈ 1,250 個可能是純粹隨機;必須校正多重檢定C. ~1,250 may be pure chance; must correct for multiple testing
D. 不需要校正D. No correction needed

2. 想做配對 t-test(同人前後測),正確寫法?

2. Paired t-test (same subject before/after) — correct call?

A. t.test(value ~ group, data = df)A. t.test(value ~ group, data = df)
B. wilcox.test(before, after)B. wilcox.test(before, after)
C. t.test(before, after, paired = TRUE)C. t.test(before, after, paired = TRUE)
D. aov(value ~ time, data = df)D. aov(value ~ time, data = df)

3. 邏輯迴歸 glm(y ~ x, family=binomial) 的係數要如何解讀?

3. How to interpret coefficients in glm(y ~ x, family=binomial)?

A. 直接看 estimate 就是 odds ratioA. Estimate is the odds ratio directly
B. 取 exp(coef(fit)) 才是 odds ratioB. Take exp(coef(fit)) for the odds ratio
C. 取 log(coef(fit))C. Take log(coef(fit))
D. 邏輯迴歸沒有 odds ratio 概念D. Logistic regression has no OR concept