STEP 5 / 16

tidyverse 資料整理

dplyr 五大動詞、tidyr 寬窄轉換、purrr 函數式程式──現代 R 分析的核心工具鏈。

dplyr's five verbs, tidyr reshaping, purrr functional programming — the modern R analysis toolchain.

一、pipe (|>)──tidyverse 的脊椎

pipe 把「左邊的結果」當作「右邊函式的第一個參數」傳入。R 4.1+ 內建 |>,magrittr 套件提供更強大的 %>%。新專案建議用內建 |>

The pipe passes the left-hand result as the first argument of the right-hand function. R 4.1+ ships |>; magrittr offers the more powerful %>%. Use |> for new projects.

# 沒有 pipe──巢狀很難讀 / Without pipe — nested mess
result <- head(arrange(filter(iris, Species == "setosa"), Sepal.Length), 5)

# 有 pipe──「先...再...然後」順序閱讀 / With pipe — reads top-to-bottom
result <- iris |>
  filter(Species == "setosa") |>
  arrange(Sepal.Length) |>
  head(5)

# magrittr 的 %>% 額外提供 .(可放任意位置)
library(magrittr)
mtcars %>% lm(mpg ~ wt, data = .)    # . 代表上游結果
⌨️
快捷鍵:Ctrl/Cmd + Shift + M 在 RStudio 自動插入 pipe(預設 %>%,可在 Tools → Global Options → Code → Use native pipe operator 改成 |>)。 Shortcut: Ctrl/Cmd + Shift + M inserts a pipe in RStudio (default %>%; switch to |> via Tools → Global Options → Code → Use native pipe operator).

二、dplyr 五大動詞

🔍

filter()

依條件選列。

Pick rows by condition.

📑

select()

選欄、改順序。

Pick / reorder columns.

mutate()

新增 / 改寫欄。

Add / modify columns.

📊

summarise()

把多列縮成一列摘要。

Collapse rows into summary.

↕️

arrange()

排序。

Sort.

🪣

group_by()

分組(與 summarise 配對使用)。

Group (pairs with summarise).

library(dplyr)

# 內建 iris 資料集 / Built-in iris dataset
iris |> head()

# filter — 條件篩選 / Pick rows
iris |>
  filter(Species == "setosa", Sepal.Length > 5) |>
  head()

# select — 挑欄 / Pick columns
iris |>
  select(Species, Sepal.Length, Sepal.Width) |>
  head()

iris |> select(-Species) |> head()              # 排除某欄
iris |> select(starts_with("Sepal")) |> head()  # 前綴
iris |> select(ends_with("Width")) |> head()    # 後綴
iris |> select(contains("Length")) |> head()

# mutate — 新增欄 / Add columns
iris |>
  mutate(
    Sepal.Area = Sepal.Length * Sepal.Width,
    log_petal  = log(Petal.Length)
  ) |>
  head()

# summarise + group_by — 分組統計 / Group summaries
iris |>
  group_by(Species) |>
  summarise(
    n        = n(),
    mean_SL  = mean(Sepal.Length),
    sd_SL    = sd(Sepal.Length),
    median_PL = median(Petal.Length)
  )

# arrange — 排序 / Sort
iris |> arrange(Sepal.Length) |> head()
iris |> arrange(desc(Sepal.Length)) |> head()
iris |> arrange(Species, desc(Sepal.Length)) |> head()

三、dplyr 進階利器

# case_when — 多分支 ifelse / Multi-branch ifelse
dge <- tibble(gene = c("A","B","C","D"),
              padj = c(0.001, 0.04, 0.20, 0.7),
              log2FC = c(2.5, -1.2, 0.3, 3.8))

dge |>
  mutate(direction = case_when(
    padj < 0.05 & log2FC >  1 ~ "Up",
    padj < 0.05 & log2FC < -1 ~ "Down",
    .default ~ "NS"
  ))
# across() — 對多欄套用同一函式 / Apply same fn to multiple cols
iris |>
  group_by(Species) |>
  summarise(across(where(is.numeric), mean))

iris |>
  mutate(across(starts_with("Sepal"), ~ .x / 10, .names = "{.col}_cm"))

# 同時套用多個函式 / Multiple functions
iris |>
  group_by(Species) |>
  summarise(across(where(is.numeric), list(mean = mean, sd = sd)))
# Joins — 兩表合併 / Joins
samples <- tibble(id = c("S1","S2","S3"), tissue = c("liver","liver","kidney"))
exprs   <- tibble(id = c("S1","S2","S4"), value = c(5.1,7.3,2.8))

samples |> left_join(exprs, by = "id")        # 保留左表全部
samples |> right_join(exprs, by = "id")       # 保留右表全部
samples |> inner_join(exprs, by = "id")       # 兩邊都有
samples |> full_join(exprs, by = "id")        # 兩邊聯集
samples |> anti_join(exprs, by = "id")        # 左有右沒有
samples |> semi_join(exprs, by = "id")        # 左有且右也有,但只回傳左欄

# 多鍵 join / Multi-key
df1 |> left_join(df2, by = c("sample_id", "timepoint"))

# 欄名不同 / Different column names
df1 |> left_join(df2, by = c("sample_id" = "sid"))
# distinct — 去重 / Unique rows
iris |> distinct(Species)
iris |> distinct(Species, .keep_all = TRUE)   # 保留所有欄

# count — 快速計次 / Quick counts
iris |> count(Species)
iris |> count(Species, sort = TRUE)    # 由多到少

# slice 家族 / Slice family
iris |> slice_max(Sepal.Length, n = 3)        # top 3
iris |> slice_min(Sepal.Length, n = 3)        # bottom 3
iris |> slice_sample(n = 5)                  # 隨機 5 列
iris |> slice_head(n = 3)                    # 前 3

四、tidyr:寬↔窄資料轉換

Tidy data 三原則:

  1. 每個變數是一欄
  2. 每個觀察值是一列
  3. 每個是一格

生資原始資料常是「寬表」(基因 × 樣本);要視覺化(ggplot)通常要轉成「長表」。

Three rules of tidy data:

  1. Each variable is a column
  2. Each observation is a row
  3. Each value is a cell

Bio raw data is often "wide" (gene × sample); plotting usually needs the "long" form.

library(tidyr)

# 寬表:基因 × 樣本 / Wide: gene × sample
wide <- tibble(
  gene = c("TP53", "BRCA1", "EGFR"),
  S1   = c(5.2, 3.4, 8.1),
  S2   = c(6.1, 3.0, 7.8),
  S3   = c(4.9, 4.1, 9.2)
)
wide

# pivot_longer — 寬轉長 / Wide → long
long <- wide |>
  pivot_longer(
    cols = -gene,             # 哪些欄要堆疊
    names_to = "sample",
    values_to = "expression"
  )
long

# pivot_wider — 長轉寬 / Long → wide
long |>
  pivot_wider(names_from = sample, values_from = expression)

# 拆解/合併欄 / Split / unite columns
df <- tibble(label = c("S1_ctrl", "S2_ctrl", "S3_trt"))
df |> separate(label, into = c("sample", "condition"), sep = "_")

df2 <- tibble(sample = c("S1","S2"), condition = c("ctrl","trt"))
df2 |> unite("label", sample, condition, sep = "_")

# 處理 NA / Handle NAs
df3 <- tibble(gene = c("A","B","C"), expr = c(5, NA, 8))
df3 |> drop_na()                  # 移除有 NA 的列
df3 |> replace_na(list(expr = 0)) # 用 0 補
df3 |> fill(expr)                 # 用上一個非 NA 值往下填

五、寬窄轉換互動演示

六、purrr:函數式程式設計

purrr 提供一致、可預測的「對 list / vector 套用函式」介面,比 base R 的 *apply 家族更現代化。

purrr provides a consistent, predictable interface for applying functions over lists/vectors — a modern replacement for base R's *apply family.

library(purrr)

# map_* 家族 / map_* family — output type 決定函式名
map(1:5, sqrt)              # 回傳 list
map_dbl(1:5, sqrt)          # 回傳 numeric vector — 強制每個結果是 length-1 double
map_chr(1:5, ~ paste0("S", .x))   # ~ 是匿名函式縮寫,.x 代表元素
map_lgl(1:5, ~ .x > 3)
map_int(1:5, ~ as.integer(.x * 2))

# 對 data.frame 每欄計算 / Per-column
mtcars |> map_dbl(mean)

# 兩個輸入:map2 / Two inputs
samples <- c("S1","S2","S3")
files   <- c("a.csv","b.csv","c.csv")
map2_chr(samples, files, ~ paste(.x, "->", .y))

# 多個輸入:pmap / Multi-input
pmap_chr(list(a = 1:3, b = 4:6, c = 7:9),
         function(a, b, c) sprintf("%d + %d = %d", a, b, a+b))

# 安全執行:safely / Safely
safe_log <- safely(log)
result <- map(list(1, 2, "abc"), safe_log)
# 每個元素都有 $result 與 $error,不會中斷
result[[3]]$error
💡
map 家族口訣:「想要什麼類型,就用 map_<type>」。map_dbl 回 numeric、map_chr 回 character、map_lgl 回 logical、map_dfr/map_dfc 回 data.frame(按列/按欄合併)。 map mnemonic: "What you want out → that's the suffix." map_dbl → numeric, map_chr → character, map_lgl → logical, map_dfr/map_dfc → data.frame (rbind/cbind).

七、生資典型工作流範例

library(dplyr); library(tidyr)

# 模擬:DESeq2 結果表 / Simulated DESeq2 result table
dge <- tibble(
  gene_id  = paste0("ENSG", sprintf("%011d", 1:8)),
  symbol   = c("TP53","BRCA1","EGFR","MYC","KRAS","CDKN2A","ALK","XIST"),
  baseMean = c(120, 80, 500, 250, 90, 30, 60, 1500),
  log2FoldChange = c(2.1, -1.8, 3.2, -0.4, 1.0, -2.5, 0.8, -3.1),
  pvalue   = c(1e-12, 3e-8, 1e-15, 0.04, 0.001, 5e-9, 0.06, 1e-20),
  padj     = c(2e-11, 4e-7, 3e-14, 0.12, 0.005, 6e-8, 0.18, 5e-19)
)

# 「有意義」基因:FDR < 0.05 且 |log2FC| > 1
sig <- dge |>
  filter(padj < 0.05, abs(log2FoldChange) > 1) |>
  mutate(direction = if_else(log2FoldChange > 0, "Up", "Down"),
         neg_log10_p = -log10(padj)) |>
  arrange(padj)
sig

# 上下調分別 top 5 / Top 5 each direction
top10 <- sig |>
  group_by(direction) |>
  slice_max(abs(log2FoldChange), n = 5) |>
  ungroup()
top10

# 計次 / Count
sig |> count(direction)

# 把基因表現長表 ↔ 寬表
expr_long <- tibble(
  sample = rep(c("S1","S2","S3"), each = 3),
  gene   = rep(c("TP53","BRCA1","EGFR"), times = 3),
  value  = c(5.2,3.4,8.1, 6.1,3.0,7.8, 4.9,4.1,9.2)
)
expr_long |>
  pivot_wider(names_from = sample, values_from = value)

📝 自我檢測

1. iris |> filter(Sepal.Length > 5) |> nrow()nrow(filter(iris, Sepal.Length > 5)) 的關係?

1. Relation between iris |> filter(Sepal.Length > 5) |> nrow() and nrow(filter(iris, Sepal.Length > 5))?

A. 結果不同A. Different results
B. 第一個會報錯B. The first one errors
C. 等價──pipe 把左邊傳成右邊函式的第一個參數C. Equivalent — pipe passes left side as first argument
D. 第二個比較快D. Second is faster

2. 想計算每個 Species 的 Sepal.Length 平均,正確寫法?

2. To get the mean Sepal.Length per Species — correct?

A. iris |> mean(Sepal.Length)A. iris |> mean(Sepal.Length)
B. iris |> summarise(mean(Sepal.Length))(沒分組)B. iris |> summarise(mean(Sepal.Length)) (no group)
C. iris |> group_by(Species) |> summarise(m = mean(Sepal.Length))C. iris |> group_by(Species) |> summarise(m = mean(Sepal.Length))
D. tapply(iris, Species, mean)D. tapply(iris, Species, mean)

3. 把 wide 表「基因 × 樣本」轉成 long 表,正確函式?

3. To pivot a wide gene × sample table to long — correct function?

A. pivot_wider()A. pivot_wider()
B. pivot_longer()B. pivot_longer()
C. melt() (只在 reshape2,已停止維護)C. melt() (only in deprecated reshape2)
D. spread()D. spread()