一、R 的六大資料結構
| 結構 | 維度 | 同質? | 生資典型用途 |
|---|---|---|---|
| vector | 1D | ✓ | 基因表現向量、樣本標籤 |
| matrix | 2D | ✓ | counts matrix、distance matrix |
| array | n-D | ✓ | 3D 影像、tensor |
| list | 1D | ✗ | 任意混合容器、模型輸出 |
| data.frame | 2D | ✗ | 樣本中繼資料 (metadata)、分析結果表 |
| factor | 1D | ✓ + 級別 | 細胞類型、處理條件、疾病狀態 |
SummarizedExperiment),第 9 章再講。
Mnemonic: "Same type, flat → vector/matrix; Mixed type, flat → data.frame/tibble; Anything nested → list; Categorical → factor." Bioconductor stacks S4 (e.g. SummarizedExperiment) on top — chapter 9.
二、Vector(向量)
# 建立 / Create
expr <- c(5.2, 8.1, 3.4, 9.7, 2.1)
genes <- c("TP53", "BRCA1", "EGFR", "MYC", "KRAS")
names(expr) <- genes # 給向量元素命名 / name elements
expr
expr["TP53"] # 用名字索引 / index by name
# 建立序列 / Sequences
1:10
seq(0, 1, by = 0.1) # 0, 0.1, ..., 1
seq(0, 1, length.out = 5) # 5 等距值
rep("ctrl", 3) # "ctrl" "ctrl" "ctrl"
rep(c("ctrl","trt"), each = 3) # "ctrl" "ctrl" "ctrl" "trt" "trt" "trt"
# 索引五大方式 / Five ways to index
expr[2] # 位置
expr[c(1,3,5)] # 多個位置
expr[-1] # 排除位置(負索引)
expr[expr > 5] # 邏輯
expr["TP53"] # 名字
三、Matrix(矩陣)
# 建立:3 個樣本 × 4 個基因 / 3 samples × 4 genes
counts <- matrix(c(120, 80, 95, 10, 22, 18, 500, 480, 510, 3, 5, 2),
nrow = 4, byrow = TRUE,
dimnames = list(c("TP53","ACTB","GAPDH","XIST"),
c("S1","S2","S3")))
counts
dim(counts) # 4 3
nrow(counts); ncol(counts)
rownames(counts); colnames(counts)
# 索引:[row, col] / Indexing
counts[1, ] # 第一列(gene TP53 across samples)
counts[, "S2"] # S2 那一欄
counts["GAPDH", "S1"] # 單格
# 邊際運算 / Margin ops
rowSums(counts) # 每基因總 count
colMeans(counts) # 每樣本平均 count
apply(counts, 1, var) # 每基因 across-sample variance
# 轉置 / Transpose
t(counts)
colSums) 比對列 (rowSums) 更快。如果你的資料是「基因 × 樣本」(生資慣例),常見做法是內部轉置讓樣本成為欄。
Memory trap: R matrices are column-major. Operations on columns (colSums) are faster than on rows (rowSums). If your data is "genes × samples" (the bio convention), often a transpose makes downstream code faster.
四、Matrix 索引互動模擬
選擇要顯示的欄與列,看子集如何切片。
Toggle which rows / columns to keep — see how subsetting carves the matrix.
保留列 (gene)
保留欄 (sample)
五、List(清單)
List 是 R 中最彈性的容器──每個元素可以是任何東西(向量、矩陣、另一個 list、模型物件...)。所有複雜結果(如 lm()、DESeq2 輸出)底層都是 list。
The list is R's most flexible container — each element can be anything (vector, matrix, another list, a model object). All complex results (e.g. lm(), DESeq2 outputs) are lists under the hood.
# 建立 / Create
study <- list(
name = "TCGA-BRCA",
n_samples = 1097,
conditions = c("Tumor", "Normal"),
counts = matrix(rpois(20, lambda = 50), nrow = 5),
metadata = data.frame(sample = paste0("S", 1:4),
age = c(45, 67, 52, 39))
)
# 三種存取方式 / Three ways to access
study$name # by name with $
study[["n_samples"]] # by name with [[ ]]
study[[1]] # by position with [[ ]]
# [[ vs [ 的關鍵差別 / The crucial difference
study["counts"] # 回傳一個「只有 counts 元素的 list」/ returns a SUB-LIST
study[["counts"]] # 回傳真正的 counts matrix / returns the matrix itself
# 巢狀 list / Nested
results <- list()
results$dge <- list(up = c("MYC","TP53"), down = c("CDKN2A"))
results$dge$up # access nested
results[["dge"]][["up"]] # equivalent
# 套用函式 / Apply over list
lapply(study$conditions, toupper)
$ 想起 list:所有 obj$something 寫法都意味著 obj 在 R 內部是個 list(或 list 的特殊版本,如 data.frame、S4 物件)。
See $, think list: any obj$something means obj is a list (or a list-like type — data.frame, S4).
六、data.frame(資料框)──最常用!
# 建立 / Create
sample_info <- data.frame(
sample = c("S1", "S2", "S3", "S4"),
condition = c("ctrl", "ctrl", "trt", "trt"),
batch = c(1, 1, 2, 2),
age = c(45, 52, 38, 61),
stringsAsFactors = FALSE # R < 4.0 必須加;R 4.0+ 預設 FALSE
)
sample_info
str(sample_info) # 看每欄的型別 / inspect column types
summary(sample_info)
nrow(sample_info)
colnames(sample_info)
# 索引:[row, col] 或 $col / Indexing
sample_info$age # 拿 age 欄(向量)
sample_info[ , "age"] # 同上
sample_info[ , c("sample","age")] # 多欄
sample_info[1:2, ] # 前兩列
sample_info[sample_info$age > 50, ] # 條件篩選
# 新增欄 / Add column
sample_info$age_group <- ifelse(sample_info$age > 50, "old", "young")
# 排序 / Sort
sample_info[order(sample_info$age), ] # ascending
sample_info[order(-sample_info$age), ] # descending
sample_info[order(sample_info$condition, sample_info$age), ] # multi-key
七、tibble──data.frame 的升級版
tibble 是 tidyverse 推的 data.frame 替代品。差別不大但更友善:
- 列印更聰明──大資料只顯示前 10 列,並標示型別。
- 不會偷偷把字串轉 factor(這是舊 R 的歷史包袱)。
- 子集嚴格──錯字會報錯,不會默默回傳 NULL。
- 欄名可任意(含空格、特殊字元,用反引號 `column name` 包覆)。
tibble is tidyverse's drop-in replacement for data.frame. Differences are small but nicer:
- Smart printing — shows only first 10 rows + column types.
- No silent string-to-factor coercion (a historical R footgun).
- Strict subsetting — typos error rather than silently return NULL.
- Any column name allowed (spaces, special chars — wrap with backticks).
library(tibble)
tb <- tibble(
sample = c("S1", "S2", "S3", "S4"),
condition = c("ctrl", "ctrl", "trt", "trt"),
age = c(45, 52, 38, 61)
)
tb # 印出時更漂亮 / pretty print
class(tb) # "tbl_df" "tbl" "data.frame" — 仍是 data.frame
as_tibble(iris) # 把 data.frame 升級成 tibble
as.data.frame(tb) # 反向:tibble → data.frame
八、Factor──分類變數
Factor 是「有限類別」的資料型別,內部以整數儲存(節省記憶體),對外顯示為字串標籤。在統計建模、ggplot 排序、survival 分組時都至關重要。
A factor is a fixed-set categorical type, stored as integers (memory-efficient), displayed as string labels. Crucial for stats modeling, ggplot ordering, survival grouping.
cond <- factor(c("ctrl", "trt", "ctrl", "trt", "trt"))
cond
levels(cond) # "ctrl" "trt"
table(cond) # 計次
# 自訂 level 順序(ggplot 圖會跟著這個順序排)/ Custom order
cond2 <- factor(c("ctrl","trt","ctrl"), levels = c("trt", "ctrl"))
levels(cond2) # "trt" "ctrl" ← 反過來
# 設定參考組 (reference level)──線性模型的 baseline
# Set reference — baseline for linear models
cond3 <- relevel(cond, ref = "trt")
levels(cond3) # "trt" 在第一位
# Ordered factor──有先後順序(如 stage I < II < III)
stage <- factor(c("II","I","III","II"),
levels = c("I","II","III"), ordered = TRUE)
stage
stage[1] > stage[2] # TRUE ← II > I
# 字串 vs factor 互轉
as.character(cond) # "ctrl" "trt" ...
factor(c("low","high","mid")) # 預設按字母排序 levels
as.numeric()──會回傳 level 的「位置編號」而非原值!正確寫法是 as.numeric(as.character(x))。
f <- factor(c(10, 20, 30)) as.numeric(f) # 1 2 3 ← 錯!這是 level 編號 as.numeric(as.character(f)) # 10 20 30 ← 對Classic disaster: when converting a factor of numbers back to numeric,
as.numeric() returns the level codes, not the values! Use as.numeric(as.character(x)).
f <- factor(c(10, 20, 30)) as.numeric(f) # 1 2 3 ← WRONG (level codes) as.numeric(as.character(f)) # 10 20 30 ← right
九、S4 物件初探(Bioconductor 之核心)
R 有三大物件導向系統:S3(簡單,最常見)、S4(嚴格,Bioconductor 標準)、R6(類似 Python class,較少見)。Bioconductor 大量使用 S4,所以你會頻繁遇到「插槽 (slot)」這個概念──用 @ 而非 $ 存取。
R has three OO systems: S3 (simple, most common), S4 (strict, Bioconductor standard), R6 (Python-like, rare). Bioconductor uses S4 extensively, so you'll meet "slots" often — accessed with @ rather than $.
# 假裝載入了 SummarizedExperiment 來示範概念
# (Pseudo-code; real packages: BiocManager::install('SummarizedExperiment'))
# se 是個 S4 物件 / se is an S4 object
# se@assays # slot access with @
# se@colData # sample metadata
# se@rowRanges # gene info as GRanges
# slotNames(se) # list all slots
# 但建議用 accessor function (官方推薦)
# Recommended: use accessor functions
# assay(se) # extract counts matrix
# colData(se) # extract sample metadata
# rowRanges(se) # extract gene ranges
# 為什麼?因為 @ 直接讀取會繞過驗證;accessor 比較安全
# Why? Because @ bypasses validation; accessors are safer
# 看物件的結構
# str(se, max.level = 2)
# isVirtualClass('SummarizedExperiment')
cat("S4 demo — install SummarizedExperiment to try real objects.\n")
assay()、colData())就夠了,不必深入內部。第 9 章 Bioconductor 會詳細展開。
Tip: beginners can treat S4 as a black box — you only need to call accessor functions (assay(), colData()...). Internals come in chapter 9.
📝 自我檢測
1. 對 list L <- list(a=1:3, b="hi"),L["a"] 與 L[["a"]] 的差別是?
1. For L <- list(a=1:3, b="hi"), what differs between L["a"] and L[["a"]]?
2. 想把 factor f <- factor(c("10","20","30")) 轉回數字,正確寫法?
2. To convert f <- factor(c("10","20","30")) back to numbers, the correct call is?
3. 想對 data.frame 依 age 升序排列,正確寫法?
3. Sort a data.frame by age ascending — which is correct?