一、pipe (|>)──tidyverse 的脊椎
pipe 把「左邊的結果」當作「右邊函式的第一個參數」傳入。R 4.1+ 內建 |>,magrittr 套件提供更強大的 %>%。新專案建議用內建 |>。
The pipe passes the left-hand result as the first argument of the right-hand function. R 4.1+ ships |>; magrittr offers the more powerful %>%. Use |> for new projects.
# 沒有 pipe──巢狀很難讀 / Without pipe — nested mess result <- head(arrange(filter(iris, Species == "setosa"), Sepal.Length), 5) # 有 pipe──「先...再...然後」順序閱讀 / With pipe — reads top-to-bottom result <- iris |> filter(Species == "setosa") |> arrange(Sepal.Length) |> head(5) # magrittr 的 %>% 額外提供 .(可放任意位置) library(magrittr) mtcars %>% lm(mpg ~ wt, data = .) # . 代表上游結果
%>%,可在 Tools → Global Options → Code → Use native pipe operator 改成 |>)。
Shortcut: Ctrl/Cmd + Shift + M inserts a pipe in RStudio (default %>%; switch to |> via Tools → Global Options → Code → Use native pipe operator).
二、dplyr 五大動詞
filter()
依條件選列。
Pick rows by condition.
select()
選欄、改順序。
Pick / reorder columns.
mutate()
新增 / 改寫欄。
Add / modify columns.
summarise()
把多列縮成一列摘要。
Collapse rows into summary.
arrange()
排序。
Sort.
group_by()
分組(與 summarise 配對使用)。
Group (pairs with summarise).
library(dplyr)
# 內建 iris 資料集 / Built-in iris dataset
iris |> head()
# filter — 條件篩選 / Pick rows
iris |>
filter(Species == "setosa", Sepal.Length > 5) |>
head()
# select — 挑欄 / Pick columns
iris |>
select(Species, Sepal.Length, Sepal.Width) |>
head()
iris |> select(-Species) |> head() # 排除某欄
iris |> select(starts_with("Sepal")) |> head() # 前綴
iris |> select(ends_with("Width")) |> head() # 後綴
iris |> select(contains("Length")) |> head()
# mutate — 新增欄 / Add columns
iris |>
mutate(
Sepal.Area = Sepal.Length * Sepal.Width,
log_petal = log(Petal.Length)
) |>
head()
# summarise + group_by — 分組統計 / Group summaries
iris |>
group_by(Species) |>
summarise(
n = n(),
mean_SL = mean(Sepal.Length),
sd_SL = sd(Sepal.Length),
median_PL = median(Petal.Length)
)
# arrange — 排序 / Sort
iris |> arrange(Sepal.Length) |> head()
iris |> arrange(desc(Sepal.Length)) |> head()
iris |> arrange(Species, desc(Sepal.Length)) |> head()
三、dplyr 進階利器
# case_when — 多分支 ifelse / Multi-branch ifelse dge <- tibble(gene = c("A","B","C","D"), padj = c(0.001, 0.04, 0.20, 0.7), log2FC = c(2.5, -1.2, 0.3, 3.8)) dge |> mutate(direction = case_when( padj < 0.05 & log2FC > 1 ~ "Up", padj < 0.05 & log2FC < -1 ~ "Down", .default ~ "NS" ))
# across() — 對多欄套用同一函式 / Apply same fn to multiple cols iris |> group_by(Species) |> summarise(across(where(is.numeric), mean)) iris |> mutate(across(starts_with("Sepal"), ~ .x / 10, .names = "{.col}_cm")) # 同時套用多個函式 / Multiple functions iris |> group_by(Species) |> summarise(across(where(is.numeric), list(mean = mean, sd = sd)))
# Joins — 兩表合併 / Joins samples <- tibble(id = c("S1","S2","S3"), tissue = c("liver","liver","kidney")) exprs <- tibble(id = c("S1","S2","S4"), value = c(5.1,7.3,2.8)) samples |> left_join(exprs, by = "id") # 保留左表全部 samples |> right_join(exprs, by = "id") # 保留右表全部 samples |> inner_join(exprs, by = "id") # 兩邊都有 samples |> full_join(exprs, by = "id") # 兩邊聯集 samples |> anti_join(exprs, by = "id") # 左有右沒有 samples |> semi_join(exprs, by = "id") # 左有且右也有,但只回傳左欄 # 多鍵 join / Multi-key df1 |> left_join(df2, by = c("sample_id", "timepoint")) # 欄名不同 / Different column names df1 |> left_join(df2, by = c("sample_id" = "sid"))
# distinct — 去重 / Unique rows iris |> distinct(Species) iris |> distinct(Species, .keep_all = TRUE) # 保留所有欄 # count — 快速計次 / Quick counts iris |> count(Species) iris |> count(Species, sort = TRUE) # 由多到少 # slice 家族 / Slice family iris |> slice_max(Sepal.Length, n = 3) # top 3 iris |> slice_min(Sepal.Length, n = 3) # bottom 3 iris |> slice_sample(n = 5) # 隨機 5 列 iris |> slice_head(n = 3) # 前 3
四、tidyr:寬↔窄資料轉換
Tidy data 三原則:
- 每個變數是一欄
- 每個觀察值是一列
- 每個值是一格
生資原始資料常是「寬表」(基因 × 樣本);要視覺化(ggplot)通常要轉成「長表」。
Three rules of tidy data:
- Each variable is a column
- Each observation is a row
- Each value is a cell
Bio raw data is often "wide" (gene × sample); plotting usually needs the "long" form.
library(tidyr)
# 寬表:基因 × 樣本 / Wide: gene × sample
wide <- tibble(
gene = c("TP53", "BRCA1", "EGFR"),
S1 = c(5.2, 3.4, 8.1),
S2 = c(6.1, 3.0, 7.8),
S3 = c(4.9, 4.1, 9.2)
)
wide
# pivot_longer — 寬轉長 / Wide → long
long <- wide |>
pivot_longer(
cols = -gene, # 哪些欄要堆疊
names_to = "sample",
values_to = "expression"
)
long
# pivot_wider — 長轉寬 / Long → wide
long |>
pivot_wider(names_from = sample, values_from = expression)
# 拆解/合併欄 / Split / unite columns
df <- tibble(label = c("S1_ctrl", "S2_ctrl", "S3_trt"))
df |> separate(label, into = c("sample", "condition"), sep = "_")
df2 <- tibble(sample = c("S1","S2"), condition = c("ctrl","trt"))
df2 |> unite("label", sample, condition, sep = "_")
# 處理 NA / Handle NAs
df3 <- tibble(gene = c("A","B","C"), expr = c(5, NA, 8))
df3 |> drop_na() # 移除有 NA 的列
df3 |> replace_na(list(expr = 0)) # 用 0 補
df3 |> fill(expr) # 用上一個非 NA 值往下填
五、寬窄轉換互動演示
六、purrr:函數式程式設計
purrr 提供一致、可預測的「對 list / vector 套用函式」介面,比 base R 的 *apply 家族更現代化。
purrr provides a consistent, predictable interface for applying functions over lists/vectors — a modern replacement for base R's *apply family.
library(purrr)
# map_* 家族 / map_* family — output type 決定函式名
map(1:5, sqrt) # 回傳 list
map_dbl(1:5, sqrt) # 回傳 numeric vector — 強制每個結果是 length-1 double
map_chr(1:5, ~ paste0("S", .x)) # ~ 是匿名函式縮寫,.x 代表元素
map_lgl(1:5, ~ .x > 3)
map_int(1:5, ~ as.integer(.x * 2))
# 對 data.frame 每欄計算 / Per-column
mtcars |> map_dbl(mean)
# 兩個輸入:map2 / Two inputs
samples <- c("S1","S2","S3")
files <- c("a.csv","b.csv","c.csv")
map2_chr(samples, files, ~ paste(.x, "->", .y))
# 多個輸入:pmap / Multi-input
pmap_chr(list(a = 1:3, b = 4:6, c = 7:9),
function(a, b, c) sprintf("%d + %d = %d", a, b, a+b))
# 安全執行:safely / Safely
safe_log <- safely(log)
result <- map(list(1, 2, "abc"), safe_log)
# 每個元素都有 $result 與 $error,不會中斷
result[[3]]$error
map_<type>」。map_dbl 回 numeric、map_chr 回 character、map_lgl 回 logical、map_dfr/map_dfc 回 data.frame(按列/按欄合併)。
map mnemonic: "What you want out → that's the suffix." map_dbl → numeric, map_chr → character, map_lgl → logical, map_dfr/map_dfc → data.frame (rbind/cbind).
七、生資典型工作流範例
library(dplyr); library(tidyr)
# 模擬:DESeq2 結果表 / Simulated DESeq2 result table
dge <- tibble(
gene_id = paste0("ENSG", sprintf("%011d", 1:8)),
symbol = c("TP53","BRCA1","EGFR","MYC","KRAS","CDKN2A","ALK","XIST"),
baseMean = c(120, 80, 500, 250, 90, 30, 60, 1500),
log2FoldChange = c(2.1, -1.8, 3.2, -0.4, 1.0, -2.5, 0.8, -3.1),
pvalue = c(1e-12, 3e-8, 1e-15, 0.04, 0.001, 5e-9, 0.06, 1e-20),
padj = c(2e-11, 4e-7, 3e-14, 0.12, 0.005, 6e-8, 0.18, 5e-19)
)
# 「有意義」基因:FDR < 0.05 且 |log2FC| > 1
sig <- dge |>
filter(padj < 0.05, abs(log2FoldChange) > 1) |>
mutate(direction = if_else(log2FoldChange > 0, "Up", "Down"),
neg_log10_p = -log10(padj)) |>
arrange(padj)
sig
# 上下調分別 top 5 / Top 5 each direction
top10 <- sig |>
group_by(direction) |>
slice_max(abs(log2FoldChange), n = 5) |>
ungroup()
top10
# 計次 / Count
sig |> count(direction)
# 把基因表現長表 ↔ 寬表
expr_long <- tibble(
sample = rep(c("S1","S2","S3"), each = 3),
gene = rep(c("TP53","BRCA1","EGFR"), times = 3),
value = c(5.2,3.4,8.1, 6.1,3.0,7.8, 4.9,4.1,9.2)
)
expr_long |>
pivot_wider(names_from = sample, values_from = value)
📝 自我檢測
1. iris |> filter(Sepal.Length > 5) |> nrow() 與 nrow(filter(iris, Sepal.Length > 5)) 的關係?
1. Relation between iris |> filter(Sepal.Length > 5) |> nrow() and nrow(filter(iris, Sepal.Length > 5))?
2. 想計算每個 Species 的 Sepal.Length 平均,正確寫法?
2. To get the mean Sepal.Length per Species — correct?
3. 把 wide 表「基因 × 樣本」轉成 long 表,正確函式?
3. To pivot a wide gene × sample table to long — correct function?