Step 7: Statistics — R Bioinformatics Tutorial

決策樹

一、選對檢定法

🔍 該用哪個檢定？

Q1.→比較兩組連續變數平均、資料常態？ t.test()

Q2.→兩組但非常態或樣本小？ wilcox.test() (Mann-Whitney)

Q3.→三組以上連續？ aov() 或 kruskal.test()

Q4.→類別關聯？ chisq.test() / fisher.test()

Q5.→兩連續變數？ cor.test() / lm()

Q6.→二元預測？ glm(..., family = binomial)

Q7.→計次？ glm(...,family=poisson) / DESeq2

兩組比較

二、t-test 與 Wilcoxon

set.seed(42)
ctrl <- rnorm(30, mean = 5,  sd = 1)
trt  <- rnorm(30, mean = 6.5, sd = 1)

# Welch's two-sample t-test (預設不假設等變異數，最穩健)
t.test(ctrl, trt)

# 公式介面：response ~ group / Formula interface
df <- data.frame(value = c(ctrl, trt),
                 group = rep(c("ctrl","trt"), each = 30))
t.test(value ~ group, data = df)

# 配對 t-test（同一受試者前後測）/ Paired
# t.test(before, after, paired = TRUE)

# 單尾 / One-sided
t.test(value ~ group, data = df, alternative = "less")

# 假設等變異數（傳統 Student's t）/ Equal variance
t.test(value ~ group, data = df, var.equal = TRUE)

# 取出特定欄位 / Pull out specific stats
res <- t.test(value ~ group, data = df)
res$p.value
res$estimate
res$conf.int

# 非常態 → Wilcoxon rank-sum (Mann-Whitney)
wilcox.test(value ~ group, data = df)

⚠️

常態性檢驗：用 shapiro.test() 看 p < 0.05 表示「拒絕常態」，但樣本越大越容易拒絕。實務上 n > 30 時 t-test 已對非常態頗穩健（中央極限定理）；極小樣本或極端偏態才需要 Wilcoxon。 Normality test: shapiro.test() with p < 0.05 rejects normality, but large n almost always rejects. In practice n > 30 makes t-test robust (CLT); use Wilcoxon only for tiny n or extreme skew.

三組以上

三、ANOVA 與 Kruskal-Wallis

# 單因子 ANOVA / One-way ANOVA
fit <- aov(Sepal.Length ~ Species, data = iris)
summary(fit)

# 假設檢驗：殘差常態？變異數齊一？
# Diagnostic: residual normality? equal variance?
plot(fit, which = 1:2)
shapiro.test(residuals(fit))     # 殘差常態
bartlett.test(Sepal.Length ~ Species, data = iris)  # 變異數齊一

# Post-hoc：哪些組間有差？/ Which pairs differ?
TukeyHSD(fit)
pairwise.t.test(iris$Sepal.Length, iris$Species,
                p.adjust.method = "bonferroni")

# 變異數不齊用 Welch's ANOVA
oneway.test(Sepal.Length ~ Species, data = iris, var.equal = FALSE)

# 非常態 → Kruskal-Wallis
kruskal.test(Sepal.Length ~ Species, data = iris)
# Post-hoc: pairwise.wilcox.test or dunn.test::dunn.test

# 二因子 ANOVA：包含交互作用
# fit2 <- aov(yield ~ fertilizer * variety, data = farm)
# summary(fit2)

類別資料

四、類別變數：chi-square 與 Fisher

# 列聯表 / Contingency table
#           Disease  Healthy
# Variant      30        10
# WT           50        90
tab <- matrix(c(30, 50, 10, 90), nrow = 2,
              dimnames = list(genotype = c("Variant","WT"),
                              status   = c("Disease","Healthy")))
tab

# Chi-square test (大樣本：每格期望次數 >= 5)
chisq.test(tab)

# Fisher's exact test (小樣本，期望次數 < 5)
fisher.test(tab)
fisher.test(tab)$estimate    # odds ratio
fisher.test(tab)$conf.int    # 95% CI for OR

# 從 data.frame 直接做表 / Build from data.frame
df <- data.frame(genotype = sample(c("V","WT"), 200, TRUE),
                 status   = sample(c("D","H"),  200, TRUE))
table(df$genotype, df$status) |> chisq.test()

# 趨勢檢定（順序類別）/ Trend test
# prop.trend.test()  for ordered categorical

相關與回歸

五、相關係數與線性迴歸

## ----- 相關 / Correlation -----
cor(iris$Sepal.Length, iris$Sepal.Width)                         # Pearson
cor(iris$Sepal.Length, iris$Sepal.Width, method = "spearman")    # 排名相關
cor.test(iris$Sepal.Length, iris$Petal.Length)                   # + p-value & CI

# 相關矩陣 / Correlation matrix
M <- iris[, 1:4]
cor(M)
cor(M, method = "spearman")
# 視覺化：corrplot::corrplot() 或 pheatmap()

## ----- 線性迴歸 / Linear regression -----
fit <- lm(Sepal.Length ~ Petal.Length, data = iris)
summary(fit)        # 看 Estimate / Std.Error / Pr(>|t|) / R-squared

# 公式語法 / Formula syntax
# y ~ x          單變數 / single predictor
# y ~ x1 + x2    多變數 (加性) / additive multivariate
# y ~ x1 * x2    含交互作用 / with interaction (== x1 + x2 + x1:x2)
# y ~ x1 + x2 + x1:x2     等同 x1 * x2
# y ~ . - x      全部當預測變數但排除 x

# 多變數 / Multivariate
fit2 <- lm(Sepal.Length ~ Petal.Length + Species, data = iris)
summary(fit2)

# 取出資訊 / Extract pieces
coef(fit2)              # 係數
confint(fit2)           # 95% CI
fitted(fit2)            # 預測值
residuals(fit2)         # 殘差
broom::tidy(fit2)       # 整潔的 data.frame 結果

# 預測新資料 / Predict
new_dat <- data.frame(Petal.Length = c(2, 4, 6),
                      Species = factor("setosa", levels = levels(iris$Species)))
predict(fit2, new_dat, interval = "confidence")

# 模型診斷 / Diagnostics — 4 張圖
par(mfrow = c(2, 2)); plot(fit2); par(mfrow = c(1, 1))

廣義線性模型

六、廣義線性模型 (GLM)

情境	分布	連結	範例
連續，常態	gaussian	identity	正常 lm()
0/1 二元	binomial	logit	疾病 vs 健康預測
計次	poisson / quasipoisson	log	平均突變次數
過度離散計次	negative binomial (MASS::glm.nb)	log	RNA-seq counts

# 邏輯迴歸：以連續/類別變數預測二元結果
# Logistic regression
df <- data.frame(
  age      = c(35, 42, 51, 28, 60, 45, 55, 33),
  smoking  = c(1,  0,  1,  0,  1,  1,  0,  1),
  disease  = c(0,  0,  1,  0,  1,  1,  0,  0)
)
fit <- glm(disease ~ age + smoking, data = df, family = binomial)
summary(fit)

# 解讀：把 coefficient 取 exp 變 odds ratio
exp(coef(fit))           # OR per unit increase
exp(confint(fit))        # OR 95% CI

# 預測機率 / Predicted probabilities
predict(fit, type = "response")

# Poisson regression 範例 / Poisson example
# fit_p <- glm(n_mutations ~ exposure + age, data = df, family = poisson)

# Negative binomial（過度離散）/ Negative binomial (overdispersion)
# library(MASS); fit_nb <- glm.nb(n_mutations ~ exposure + age, data = df)

多重檢定

七、多重檢定校正──生資的核心

同時對 20,000 個基因做 t-test，即使全部都是 H₀ 為真，也會有 20000 × 0.05 = 1000 個「假性顯著」。多重檢定校正就是處理這個問題。

Bonferroni：把 α 除以檢定數。嚴格保證 FWER，但極度保守，生資少用。
Benjamini-Hochberg (BH/FDR)：控制「偽陽性比例 (False Discovery Rate)」。生資標準做法。
Storey's q-value：類似 BH 但更敏感（qvalue::qvalue()）。

Run a t-test on 20,000 genes simultaneously: even if every H₀ is true, you'll get 20000 × 0.05 = 1000 "significant" hits by chance. Multiple testing correction fixes this.

Bonferroni: divide α by # of tests. Guarantees FWER, but very conservative — rare in bio.
Benjamini-Hochberg (BH/FDR): controls false discovery rate. The standard in bio.
Storey's q-value: BH-like but more powerful (qvalue::qvalue()).

set.seed(7)
# 模擬 1000 個基因的 p-value，其中 50 個真實 DEG
n <- 1000
n_true <- 50
pvals <- c(runif(n - n_true), runif(n_true, 0, 0.01))  # 假 H1 的 p-value 偏小
hist(pvals, breaks = 50)

# Bonferroni
pBonf <- p.adjust(pvals, method = "bonferroni")
sum(pBonf < 0.05)

# BH (FDR) — 生資標準
pBH   <- p.adjust(pvals, method = "BH")
sum(pBH   < 0.05)

# Holm（FWER 但比 Bonferroni 強）
pHolm <- p.adjust(pvals, method = "holm")
sum(pHolm < 0.05)

# 比較：可看出 BH 在 FDR 控制下保留更多 hit
cat("Method          | Significant\n",
    "Bonferroni      |", sum(pBonf  < 0.05), "\n",
    "Holm            |", sum(pHolm  < 0.05), "\n",
    "BH (FDR)        |", sum(pBH    < 0.05), "\n", sep = "")

🚨

論文必看：p-value 報告必須同時報告校正方法。寫「FDR < 0.05」就要說明用 BH；寫「Bonferroni-corrected p < 0.05」就要說檢定總數。未校正的多重檢定 p-value 在生資領域沒有討論價值。 Paper alert: always report the correction method alongside p-values. "FDR < 0.05" implies BH; "Bonferroni-corrected p < 0.05" requires stating the test count. Uncorrected multiple-test p-values are meaningless in bioinformatics.

互動

八、Bonferroni vs FDR 視覺比較

調整真實 DEG 比例與 p-value 嚴格度，看看不同校正法剩下幾個顯著結果。

Adjust the true-positive fraction and threshold; see how each method behaves.

總基因數 5000

真實 DEG 比例 (%) 5

校正後顯著閾值 (α) 0.05

— — —

模擬：真實 H1 的 p ~ Beta(0.5,5)；H0 為 Uniform(0,1)。

Simulated: true H1 p ~ Beta(0.5,5); H0 ~ Uniform(0,1).

📝 自我檢測

1. 你對 25,000 個基因做 t-test，原始 p < 0.05 有 1,800 個。下列敘述何者正確？

1. You ran t-tests on 25,000 genes; 1,800 have raw p < 0.05. Which is correct?

A. 全部 1,800 個都是真實 DEGA. All 1,800 are real DEGs

B. 用 Bonferroni 校正後就 100% 可信B. Bonferroni makes them 100% trustworthy

C. 約 25,000 × 0.05 ≈ 1,250 個可能是純粹隨機；必須校正多重檢定C. ~1,250 may be pure chance; must correct for multiple testing

D. 不需要校正D. No correction needed

2. 想做配對 t-test（同人前後測），正確寫法？

2. Paired t-test (same subject before/after) — correct call?

A. t.test(value ~ group, data = df)A. t.test(value ~ group, data = df)

B. wilcox.test(before, after)B. wilcox.test(before, after)

C. t.test(before, after, paired = TRUE)C. t.test(before, after, paired = TRUE)

D. aov(value ~ time, data = df)D. aov(value ~ time, data = df)

3. 邏輯迴歸 glm(y ~ x, family=binomial) 的係數要如何解讀？

3. How to interpret coefficients in glm(y ~ x, family=binomial)?

A. 直接看 estimate 就是 odds ratioA. Estimate is the odds ratio directly

B. 取 exp(coef(fit)) 才是 odds ratioB. Take exp(coef(fit)) for the odds ratio

C. 取 log(coef(fit))C. Take log(coef(fit))

D. 邏輯迴歸沒有 odds ratio 概念D. Logistic regression has no OR concept