STEP 3 / 16

資料結構

vector、matrix、list、data.frame、tibble、factor 與 S4──選對結構,分析效率倍增。

vector, matrix, list, data.frame, tibble, factor, and S4 — pick the right structure and your analysis flies.

一、R 的六大資料結構

結構 維度 同質? 生資典型用途
vector1D基因表現向量、樣本標籤
matrix2Dcounts matrix、distance matrix
arrayn-D3D 影像、tensor
list1D任意混合容器、模型輸出
data.frame2D樣本中繼資料 (metadata)、分析結果表
factor1D✓ + 級別細胞類型、處理條件、疾病狀態
💡
口訣:同質扁平 → vector / matrix;混合扁平 → data.frame / tibble;巢狀任意 → list;分類資料 → factor」。Bioconductor 在這些之上又疊了 S4(如 SummarizedExperiment),第 9 章再講。 Mnemonic: "Same type, flat → vector/matrix; Mixed type, flat → data.frame/tibble; Anything nested → list; Categorical → factor." Bioconductor stacks S4 (e.g. SummarizedExperiment) on top — chapter 9.

二、Vector(向量)

# 建立 / Create
expr <- c(5.2, 8.1, 3.4, 9.7, 2.1)
genes <- c("TP53", "BRCA1", "EGFR", "MYC", "KRAS")
names(expr) <- genes        # 給向量元素命名 / name elements
expr
expr["TP53"]                # 用名字索引 / index by name

# 建立序列 / Sequences
1:10
seq(0, 1, by = 0.1)          # 0, 0.1, ..., 1
seq(0, 1, length.out = 5)    # 5 等距值
rep("ctrl", 3)               # "ctrl" "ctrl" "ctrl"
rep(c("ctrl","trt"), each = 3)   # "ctrl" "ctrl" "ctrl" "trt" "trt" "trt"

# 索引五大方式 / Five ways to index
expr[2]                      # 位置
expr[c(1,3,5)]               # 多個位置
expr[-1]                     # 排除位置(負索引)
expr[expr > 5]               # 邏輯
expr["TP53"]                 # 名字

三、Matrix(矩陣)

# 建立:3 個樣本 × 4 個基因 / 3 samples × 4 genes
counts <- matrix(c(120, 80, 95,  10, 22, 18,  500, 480, 510,  3, 5, 2),
                 nrow = 4, byrow = TRUE,
                 dimnames = list(c("TP53","ACTB","GAPDH","XIST"),
                                 c("S1","S2","S3")))
counts
dim(counts)            # 4 3
nrow(counts); ncol(counts)
rownames(counts); colnames(counts)

# 索引:[row, col] / Indexing
counts[1, ]            # 第一列(gene TP53 across samples)
counts[, "S2"]         # S2 那一欄
counts["GAPDH", "S1"]  # 單格

# 邊際運算 / Margin ops
rowSums(counts)        # 每基因總 count
colMeans(counts)       # 每樣本平均 count
apply(counts, 1, var)  # 每基因 across-sample variance

# 轉置 / Transpose
t(counts)
⚠️
記憶體陷阱:R 的 matrix 是欄主序 (column-major) 儲存。對欄做運算 (colSums) 比對列 (rowSums) 更快。如果你的資料是「基因 × 樣本」(生資慣例),常見做法是內部轉置讓樣本成為欄。 Memory trap: R matrices are column-major. Operations on columns (colSums) are faster than on rows (rowSums). If your data is "genes × samples" (the bio convention), often a transpose makes downstream code faster.

四、Matrix 索引互動模擬

選擇要顯示的欄與列,看子集如何切片。

Toggle which rows / columns to keep — see how subsetting carves the matrix.

保留列 (gene)

保留欄 (sample)

五、List(清單)

List 是 R 中最彈性的容器──每個元素可以是任何東西(向量、矩陣、另一個 list、模型物件...)。所有複雜結果(如 lm()DESeq2 輸出)底層都是 list。

The list is R's most flexible container — each element can be anything (vector, matrix, another list, a model object). All complex results (e.g. lm(), DESeq2 outputs) are lists under the hood.

# 建立 / Create
study <- list(
  name      = "TCGA-BRCA",
  n_samples = 1097,
  conditions = c("Tumor", "Normal"),
  counts    = matrix(rpois(20, lambda = 50), nrow = 5),
  metadata  = data.frame(sample = paste0("S", 1:4),
                         age    = c(45, 67, 52, 39))
)

# 三種存取方式 / Three ways to access
study$name              # by name with $
study[["n_samples"]]    # by name with [[ ]]
study[[1]]              # by position with [[ ]]

# [[ vs [ 的關鍵差別 / The crucial difference
study["counts"]         # 回傳一個「只有 counts 元素的 list」/ returns a SUB-LIST
study[["counts"]]       # 回傳真正的 counts matrix / returns the matrix itself

# 巢狀 list / Nested
results <- list()
results$dge <- list(up = c("MYC","TP53"), down = c("CDKN2A"))
results$dge$up          # access nested
results[["dge"]][["up"]]  # equivalent

# 套用函式 / Apply over list
lapply(study$conditions, toupper)
💡
當你看到 $ 想起 list:所有 obj$something 寫法都意味著 obj 在 R 內部是個 list(或 list 的特殊版本,如 data.frame、S4 物件)。 See $, think list: any obj$something means obj is a list (or a list-like type — data.frame, S4).

六、data.frame(資料框)──最常用!

# 建立 / Create
sample_info <- data.frame(
  sample    = c("S1", "S2", "S3", "S4"),
  condition = c("ctrl", "ctrl", "trt", "trt"),
  batch     = c(1, 1, 2, 2),
  age       = c(45, 52, 38, 61),
  stringsAsFactors = FALSE   # R < 4.0 必須加;R 4.0+ 預設 FALSE
)
sample_info
str(sample_info)        # 看每欄的型別 / inspect column types
summary(sample_info)
nrow(sample_info)
colnames(sample_info)

# 索引:[row, col] 或 $col / Indexing
sample_info$age                      # 拿 age 欄(向量)
sample_info[ , "age"]                # 同上
sample_info[ , c("sample","age")]    # 多欄
sample_info[1:2, ]                   # 前兩列
sample_info[sample_info$age > 50, ]  # 條件篩選

# 新增欄 / Add column
sample_info$age_group <- ifelse(sample_info$age > 50, "old", "young")

# 排序 / Sort
sample_info[order(sample_info$age), ]                    # ascending
sample_info[order(-sample_info$age), ]                   # descending
sample_info[order(sample_info$condition, sample_info$age), ]  # multi-key

七、tibble──data.frame 的升級版

tibble 是 tidyverse 推的 data.frame 替代品。差別不大但更友善:

  • 列印更聰明──大資料只顯示前 10 列,並標示型別。
  • 不會偷偷把字串轉 factor(這是舊 R 的歷史包袱)。
  • 子集嚴格──錯字會報錯,不會默默回傳 NULL。
  • 欄名可任意(含空格、特殊字元,用反引號 `column name` 包覆)。

tibble is tidyverse's drop-in replacement for data.frame. Differences are small but nicer:

  • Smart printing — shows only first 10 rows + column types.
  • No silent string-to-factor coercion (a historical R footgun).
  • Strict subsetting — typos error rather than silently return NULL.
  • Any column name allowed (spaces, special chars — wrap with backticks).
library(tibble)
tb <- tibble(
  sample    = c("S1", "S2", "S3", "S4"),
  condition = c("ctrl", "ctrl", "trt", "trt"),
  age       = c(45, 52, 38, 61)
)
tb                       # 印出時更漂亮 / pretty print
class(tb)                # "tbl_df" "tbl" "data.frame" — 仍是 data.frame
as_tibble(iris)          # 把 data.frame 升級成 tibble
as.data.frame(tb)        # 反向:tibble → data.frame

八、Factor──分類變數

Factor 是「有限類別」的資料型別,內部以整數儲存(節省記憶體),對外顯示為字串標籤。在統計建模、ggplot 排序、survival 分組時都至關重要。

A factor is a fixed-set categorical type, stored as integers (memory-efficient), displayed as string labels. Crucial for stats modeling, ggplot ordering, survival grouping.

cond <- factor(c("ctrl", "trt", "ctrl", "trt", "trt"))
cond
levels(cond)                  # "ctrl" "trt"
table(cond)                   # 計次

# 自訂 level 順序(ggplot 圖會跟著這個順序排)/ Custom order
cond2 <- factor(c("ctrl","trt","ctrl"), levels = c("trt", "ctrl"))
levels(cond2)                 # "trt" "ctrl"  ← 反過來

# 設定參考組 (reference level)──線性模型的 baseline
# Set reference — baseline for linear models
cond3 <- relevel(cond, ref = "trt")
levels(cond3)                 # "trt" 在第一位

# Ordered factor──有先後順序(如 stage I < II < III)
stage <- factor(c("II","I","III","II"),
                levels = c("I","II","III"), ordered = TRUE)
stage
stage[1] > stage[2]           # TRUE  ← II > I

# 字串 vs factor 互轉
as.character(cond)             # "ctrl" "trt" ...
factor(c("low","high","mid"))  # 預設按字母排序 levels
🚨
常見災難:從 factor 轉 numeric 時不能直接 as.numeric()──會回傳 level 的「位置編號」而非原值!正確寫法是 as.numeric(as.character(x))
f <- factor(c(10, 20, 30))
as.numeric(f)              # 1 2 3   ← 錯!這是 level 編號
as.numeric(as.character(f)) # 10 20 30 ← 對
Classic disaster: when converting a factor of numbers back to numeric, as.numeric() returns the level codes, not the values! Use as.numeric(as.character(x)).
f <- factor(c(10, 20, 30))
as.numeric(f)               # 1 2 3   ← WRONG (level codes)
as.numeric(as.character(f)) # 10 20 30 ← right

九、S4 物件初探(Bioconductor 之核心)

R 有三大物件導向系統:S3(簡單,最常見)、S4(嚴格,Bioconductor 標準)、R6(類似 Python class,較少見)。Bioconductor 大量使用 S4,所以你會頻繁遇到「插槽 (slot)」這個概念──用 @ 而非 $ 存取。

R has three OO systems: S3 (simple, most common), S4 (strict, Bioconductor standard), R6 (Python-like, rare). Bioconductor uses S4 extensively, so you'll meet "slots" often — accessed with @ rather than $.

# 假裝載入了 SummarizedExperiment 來示範概念
# (Pseudo-code; real packages: BiocManager::install('SummarizedExperiment'))

# se 是個 S4 物件 / se is an S4 object
# se@assays            # slot access with @
# se@colData           # sample metadata
# se@rowRanges         # gene info as GRanges
# slotNames(se)        # list all slots

# 但建議用 accessor function (官方推薦)
# Recommended: use accessor functions
# assay(se)            # extract counts matrix
# colData(se)          # extract sample metadata
# rowRanges(se)        # extract gene ranges

# 為什麼?因為 @ 直接讀取會繞過驗證;accessor 比較安全
# Why? Because @ bypasses validation; accessors are safer

# 看物件的結構
# str(se, max.level = 2)
# isVirtualClass('SummarizedExperiment')
cat("S4 demo — install SummarizedExperiment to try real objects.\n")
💡
建議:初學者把 S4 當「黑箱」即可──知道用 accessor 函式(如 assay()colData())就夠了,不必深入內部。第 9 章 Bioconductor 會詳細展開。 Tip: beginners can treat S4 as a black box — you only need to call accessor functions (assay(), colData()...). Internals come in chapter 9.

📝 自我檢測

1. 對 list L <- list(a=1:3, b="hi")L["a"]L[["a"]] 的差別是?

1. For L <- list(a=1:3, b="hi"), what differs between L["a"] and L[["a"]]?

A. 完全一樣A. Identical
B. L["a"] 報錯B. L["a"] errors
C. L["a"] 回傳「子 list」;L[["a"]] 回傳裡面的向量本身C. L["a"] returns a sub-list; L[["a"]] returns the vector itself
D. L[["a"]] 不存在這種寫法D. L[["a"]] isn't valid R

2. 想把 factor f <- factor(c("10","20","30")) 轉回數字,正確寫法?

2. To convert f <- factor(c("10","20","30")) back to numbers, the correct call is?

A. as.numeric(f)A. as.numeric(f)
B. as.integer(f)B. as.integer(f)
C. as.numeric(as.character(f))C. as.numeric(as.character(f))
D. numeric(f)D. numeric(f)

3. 想對 data.frame 依 age 升序排列,正確寫法?

3. Sort a data.frame by age ascending — which is correct?

A. df[sort(df$age), ]A. df[sort(df$age), ]
B. df[order(df$age), ]B. df[order(df$age), ]
C. sort(df, "age")C. sort(df, "age")
D. df.sort(by="age")D. df.sort(by="age")