Step 3: Data Structures — R Bioinformatics Tutorial

概覽

一、R 的六大資料結構

結構	維度	同質？	生資典型用途
vector	1D	✓	基因表現向量、樣本標籤
matrix	2D	✓	counts matrix、distance matrix
array	n-D	✓	3D 影像、tensor
list	1D	✗	任意混合容器、模型輸出
data.frame	2D	✗	樣本中繼資料 (metadata)、分析結果表
factor	1D	✓ + 級別	細胞類型、處理條件、疾病狀態

💡

口訣：「同質扁平 → vector / matrix；混合扁平 → data.frame / tibble；巢狀任意 → list；分類資料 → factor」。Bioconductor 在這些之上又疊了 S4（如 SummarizedExperiment），第 9 章再講。 Mnemonic: "Same type, flat → vector/matrix; Mixed type, flat → data.frame/tibble; Anything nested → list; Categorical → factor." Bioconductor stacks S4 (e.g. SummarizedExperiment) on top — chapter 9.

一維

二、Vector（向量）

# 建立 / Create
expr <- c(5.2, 8.1, 3.4, 9.7, 2.1)
genes <- c("TP53", "BRCA1", "EGFR", "MYC", "KRAS")
names(expr) <- genes        # 給向量元素命名 / name elements
expr
expr["TP53"]                # 用名字索引 / index by name

# 建立序列 / Sequences
1:10
seq(0, 1, by = 0.1)          # 0, 0.1, ..., 1
seq(0, 1, length.out = 5)    # 5 等距值
rep("ctrl", 3)               # "ctrl" "ctrl" "ctrl"
rep(c("ctrl","trt"), each = 3)   # "ctrl" "ctrl" "ctrl" "trt" "trt" "trt"

# 索引五大方式 / Five ways to index
expr[2]                      # 位置
expr[c(1,3,5)]               # 多個位置
expr[-1]                     # 排除位置（負索引）
expr[expr > 5]               # 邏輯
expr["TP53"]                 # 名字

二維

三、Matrix（矩陣）

# 建立：3 個樣本 × 4 個基因 / 3 samples × 4 genes
counts <- matrix(c(120, 80, 95,  10, 22, 18,  500, 480, 510,  3, 5, 2),
                 nrow = 4, byrow = TRUE,
                 dimnames = list(c("TP53","ACTB","GAPDH","XIST"),
                                 c("S1","S2","S3")))
counts
dim(counts)            # 4 3
nrow(counts); ncol(counts)
rownames(counts); colnames(counts)

# 索引：[row, col] / Indexing
counts[1, ]            # 第一列（gene TP53 across samples）
counts[, "S2"]         # S2 那一欄
counts["GAPDH", "S1"]  # 單格

# 邊際運算 / Margin ops
rowSums(counts)        # 每基因總 count
colMeans(counts)       # 每樣本平均 count
apply(counts, 1, var)  # 每基因 across-sample variance

# 轉置 / Transpose
t(counts)

⚠️

記憶體陷阱：R 的 matrix 是欄主序 (column-major) 儲存。對欄做運算 (colSums) 比對列 (rowSums) 更快。如果你的資料是「基因 × 樣本」（生資慣例），常見做法是內部轉置讓樣本成為欄。 Memory trap: R matrices are column-major. Operations on columns (colSums) are faster than on rows (rowSums). If your data is "genes × samples" (the bio convention), often a transpose makes downstream code faster.

互動

四、Matrix 索引互動模擬

選擇要顯示的欄與列，看子集如何切片。

Toggle which rows / columns to keep — see how subsetting carves the matrix.

保留列 (gene)

保留欄 (sample)

混合容器

五、List（清單）

List 是 R 中最彈性的容器──每個元素可以是任何東西（向量、矩陣、另一個 list、模型物件...）。所有複雜結果（如 lm()、DESeq2 輸出）底層都是 list。

The list is R's most flexible container — each element can be anything (vector, matrix, another list, a model object). All complex results (e.g. lm(), DESeq2 outputs) are lists under the hood.

# 建立 / Create
study <- list(
  name      = "TCGA-BRCA",
  n_samples = 1097,
  conditions = c("Tumor", "Normal"),
  counts    = matrix(rpois(20, lambda = 50), nrow = 5),
  metadata  = data.frame(sample = paste0("S", 1:4),
                         age    = c(45, 67, 52, 39))
)

# 三種存取方式 / Three ways to access
study$name              # by name with $
study[["n_samples"]]    # by name with [[ ]]
study[[1]]              # by position with [[ ]]

# [[ vs [ 的關鍵差別 / The crucial difference
study["counts"]         # 回傳一個「只有 counts 元素的 list」/ returns a SUB-LIST
study[["counts"]]       # 回傳真正的 counts matrix / returns the matrix itself

# 巢狀 list / Nested
results <- list()
results$dge <- list(up = c("MYC","TP53"), down = c("CDKN2A"))
results$dge$up          # access nested
results[["dge"]][["up"]]  # equivalent

# 套用函式 / Apply over list
lapply(study$conditions, toupper)

💡

當你看到 $ 想起 list：所有 obj$something 寫法都意味著 obj 在 R 內部是個 list（或 list 的特殊版本，如 data.frame、S4 物件）。 See $, think list: any obj$something means obj is a list (or a list-like type — data.frame, S4).

表格

六、data.frame（資料框）──最常用！

# 建立 / Create
sample_info <- data.frame(
  sample    = c("S1", "S2", "S3", "S4"),
  condition = c("ctrl", "ctrl", "trt", "trt"),
  batch     = c(1, 1, 2, 2),
  age       = c(45, 52, 38, 61),
  stringsAsFactors = FALSE   # R < 4.0 必須加；R 4.0+ 預設 FALSE
)
sample_info
str(sample_info)        # 看每欄的型別 / inspect column types
summary(sample_info)
nrow(sample_info)
colnames(sample_info)

# 索引：[row, col] 或 $col / Indexing
sample_info$age                      # 拿 age 欄（向量）
sample_info[ , "age"]                # 同上
sample_info[ , c("sample","age")]    # 多欄
sample_info[1:2, ]                   # 前兩列
sample_info[sample_info$age > 50, ]  # 條件篩選

# 新增欄 / Add column
sample_info$age_group <- ifelse(sample_info$age > 50, "old", "young")

# 排序 / Sort
sample_info[order(sample_info$age), ]                    # ascending
sample_info[order(-sample_info$age), ]                   # descending
sample_info[order(sample_info$condition, sample_info$age), ]  # multi-key

現代版

七、tibble──data.frame 的升級版

tibble 是 tidyverse 推的 data.frame 替代品。差別不大但更友善：

列印更聰明──大資料只顯示前 10 列，並標示型別。
不會偷偷把字串轉 factor（這是舊 R 的歷史包袱）。
子集嚴格──錯字會報錯，不會默默回傳 NULL。
欄名可任意（含空格、特殊字元，用反引號 `column name` 包覆）。

tibble is tidyverse's drop-in replacement for data.frame. Differences are small but nicer:

Smart printing — shows only first 10 rows + column types.
No silent string-to-factor coercion (a historical R footgun).
Strict subsetting — typos error rather than silently return NULL.
Any column name allowed (spaces, special chars — wrap with backticks).

library(tibble)
tb <- tibble(
  sample    = c("S1", "S2", "S3", "S4"),
  condition = c("ctrl", "ctrl", "trt", "trt"),
  age       = c(45, 52, 38, 61)
)
tb                       # 印出時更漂亮 / pretty print
class(tb)                # "tbl_df" "tbl" "data.frame" — 仍是 data.frame
as_tibble(iris)          # 把 data.frame 升級成 tibble
as.data.frame(tb)        # 反向：tibble → data.frame

分類

八、Factor──分類變數

Factor 是「有限類別」的資料型別，內部以整數儲存（節省記憶體），對外顯示為字串標籤。在統計建模、ggplot 排序、survival 分組時都至關重要。

A factor is a fixed-set categorical type, stored as integers (memory-efficient), displayed as string labels. Crucial for stats modeling, ggplot ordering, survival grouping.

cond <- factor(c("ctrl", "trt", "ctrl", "trt", "trt"))
cond
levels(cond)                  # "ctrl" "trt"
table(cond)                   # 計次

# 自訂 level 順序（ggplot 圖會跟著這個順序排）/ Custom order
cond2 <- factor(c("ctrl","trt","ctrl"), levels = c("trt", "ctrl"))
levels(cond2)                 # "trt" "ctrl"  ← 反過來

# 設定參考組 (reference level)──線性模型的 baseline
# Set reference — baseline for linear models
cond3 <- relevel(cond, ref = "trt")
levels(cond3)                 # "trt" 在第一位

# Ordered factor──有先後順序（如 stage I < II < III）
stage <- factor(c("II","I","III","II"),
                levels = c("I","II","III"), ordered = TRUE)
stage
stage[1] > stage[2]           # TRUE  ← II > I

# 字串 vs factor 互轉
as.character(cond)             # "ctrl" "trt" ...
factor(c("low","high","mid"))  # 預設按字母排序 levels

🚨

常見災難：從 factor 轉 numeric 時不能直接 as.numeric()──會回傳 level 的「位置編號」而非原值！正確寫法是 as.numeric(as.character(x))。

f <- factor(c(10, 20, 30))
as.numeric(f)              # 1 2 3   ← 錯！這是 level 編號
as.numeric(as.character(f)) # 10 20 30 ← 對

Classic disaster: when converting a factor of numbers back to numeric, as.numeric() returns the level codes, not the values! Use as.numeric(as.character(x)).

f <- factor(c(10, 20, 30))
as.numeric(f)               # 1 2 3   ← WRONG (level codes)
as.numeric(as.character(f)) # 10 20 30 ← right

進階

九、S4 物件初探（Bioconductor 之核心）

R 有三大物件導向系統：S3（簡單，最常見）、S4（嚴格，Bioconductor 標準）、R6（類似 Python class，較少見）。Bioconductor 大量使用 S4，所以你會頻繁遇到「插槽 (slot)」這個概念──用 @ 而非 $ 存取。

R has three OO systems: S3 (simple, most common), S4 (strict, Bioconductor standard), R6 (Python-like, rare). Bioconductor uses S4 extensively, so you'll meet "slots" often — accessed with @ rather than $.

# 假裝載入了 SummarizedExperiment 來示範概念
# (Pseudo-code; real packages: BiocManager::install('SummarizedExperiment'))

# se 是個 S4 物件 / se is an S4 object
# se@assays            # slot access with @
# se@colData           # sample metadata
# se@rowRanges         # gene info as GRanges
# slotNames(se)        # list all slots

# 但建議用 accessor function (官方推薦)
# Recommended: use accessor functions
# assay(se)            # extract counts matrix
# colData(se)          # extract sample metadata
# rowRanges(se)        # extract gene ranges

# 為什麼？因為 @ 直接讀取會繞過驗證；accessor 比較安全
# Why? Because @ bypasses validation; accessors are safer

# 看物件的結構
# str(se, max.level = 2)
# isVirtualClass('SummarizedExperiment')
cat("S4 demo — install SummarizedExperiment to try real objects.\n")

💡

建議：初學者把 S4 當「黑箱」即可──知道用 accessor 函式（如 assay()、colData()）就夠了，不必深入內部。第 9 章 Bioconductor 會詳細展開。 Tip: beginners can treat S4 as a black box — you only need to call accessor functions (assay(), colData()...). Internals come in chapter 9.

📝 自我檢測

1. 對 list L <- list(a=1:3, b="hi")，L["a"] 與 L[["a"]] 的差別是？

1. For L <- list(a=1:3, b="hi"), what differs between L["a"] and L[["a"]]?

A. 完全一樣A. Identical

B. L["a"] 報錯B. L["a"] errors

C. L["a"] 回傳「子 list」；L[["a"]] 回傳裡面的向量本身C. L["a"] returns a sub-list; L[["a"]] returns the vector itself

D. L[["a"]] 不存在這種寫法D. L[["a"]] isn't valid R

2. 想把 factor f <- factor(c("10","20","30")) 轉回數字，正確寫法？

2. To convert f <- factor(c("10","20","30")) back to numbers, the correct call is?

A. as.numeric(f)A. as.numeric(f)

B. as.integer(f)B. as.integer(f)

C. as.numeric(as.character(f))C. as.numeric(as.character(f))

D. numeric(f)D. numeric(f)

3. 想對 data.frame 依 age 升序排列，正確寫法？

3. Sort a data.frame by age ascending — which is correct?

A. df[sort(df$age), ]A. df[sort(df$age), ]

B. df[order(df$age), ]B. df[order(df$age), ]

C. sort(df, "age")C. sort(df, "age")

D. df.sort(by="age")D. df.sort(by="age")