Step 2: R Basics — R Bioinformatics Tutorial

運算

一、運算子

類別	運算子	範例
算術	`+ - * /` `^` `%%` `%/%`	`5 %% 3` → 2 (餘數); `5 %/% 3` → 1 (整除)
比較	`== != > < >= <=`	`3 == 3` → `TRUE`
邏輯	`& \| !` `&& \|\|`	`TRUE & FALSE` → `FALSE`
賦值	`<- = ->`	`x <- 5` (建議用 `<-`)
管線	`\|>` (R 4.1+) / `%>%` (magrittr)	`iris \|> head()`
序列	`:`	`1:5` → `1 2 3 4 5`
索引	`[ ] [[ ]] $`	`x[1]`、`list[[1]]`、`df$col`

⚠️

單 vs 雙：& / | 是向量化邏輯運算（逐元素）；&& / || 只看第一個元素，多用於 if 陳述。對向量誤用 && 是 R 中最常見錯誤之一。 Single vs double: & / | are vectorized (element-wise); && / || only check the first element and are meant for if statements. Misusing && on a vector is one of R's most common bugs.

型別

二、原子型別 (Atomic Types)

🔢

numeric / double

實數，預設型別。3.14、1e-5、Inf、-Inf、NaN 都屬此類。

Real numbers, default. 3.14, 1e-5, Inf, -Inf, NaN.

🔟

integer

需在數字後加 L：5L。多數情況不需特別區分；只在記憶體吃緊或需與 C 介接時才用。

Suffix with L: 5L. Rarely needs distinction; useful for memory or C interop.

🔤

character

字串，雙引號 "abc" 或單引號 'abc' 均可。R 中字串永遠是向量。

Strings; double or single quotes. Strings are always vectors in R.

✅

logical

TRUE / FALSE（可簡寫 T/F，不建議──T、F 可被覆寫）。NA 表缺值。

TRUE / FALSE (or T/F, not recommended — they can be reassigned). NA is missing.

🌀

complex

複數：2 + 3i。生資較少用。

Complex numbers: 2 + 3i. Rare in bioinformatics.

📦

raw

原始位元組，處理二進位資料用。

Raw bytes for binary data.

型別檢查與轉換

x <- 3.14
typeof(x)        # "double"
class(x)         # "numeric"
is.numeric(x)    # TRUE

# 強制轉型 / Coerce
as.integer(3.9)        # 3 (truncates, doesn't round!)
as.character(42)       # "42"
as.numeric("3.14")     # 3.14
as.numeric("abc")      # NA + warning

# 自動轉型規則 (從寬到嚴): logical < integer < numeric < character
c(1, 2, "3")          # all become character: "1" "2" "3"
c(TRUE, 5L, 3.14)     # all become numeric:  1.00 5.00 3.14

💡

typeof() vs class()：typeof() 看的是底層儲存方式（double、integer、character...）；class() 看的是 R 的物件類型（numeric、data.frame、lm...）。實務上用 class() 較多。 typeof() vs class(): typeof() reveals storage mode (double, integer...); class() reveals the R object class (numeric, data.frame, lm...). Use class() in practice.

核心概念

三、向量與向量化（R 的靈魂）

在 R 中，沒有純量──連 3.14 都是長度 1 的向量。所有運算預設逐元素 (element-wise)，這就是「向量化」。寫對的話，10 萬個基因表現量的計算只需一行，而且比 for 迴圈快 100 倍。

In R there are no scalars — even 3.14 is a length-1 vector. Operations are element-wise by default — that's vectorization. Done right, computing on 100,000 genes takes one line and is 100× faster than a for loop.

# 建立向量 / Create
gene_expr <- c(5.2, 8.1, 3.4, 9.7, 2.1)
length(gene_expr)            # 5

# 向量化運算 (沒有 for 迴圈！) / Vectorized ops (no loops!)
log2(gene_expr + 1)          # log2 transform of all 5 at once
gene_expr * 2                # multiply each by 2
gene_expr > 5                # logical vector: FALSE TRUE FALSE TRUE FALSE

# 兩個向量逐元素運算 / Two vectors element-wise
treated <- c(6.0, 9.5, 4.0, 11.2, 2.8)
log2_fc <- log2(treated / gene_expr)   # log2 fold-change per gene
log2_fc

# 索引 (1-based, 不是 0-based!) / Indexing (1-based!)
gene_expr[1]                 # first element
gene_expr[c(1, 3, 5)]        # 1st, 3rd, 5th
gene_expr[-1]                # all EXCEPT first (negative drops)
gene_expr[gene_expr > 5]     # logical filtering — most powerful idiom!

⚠️

循環擴張 (Recycling) 陷阱：長度不同的向量運算時，短的會被「重複使用」直到對齊長的。若長度不是倍數會有警告。

c(1,2,3,4) + c(10,20)   # → c(11,22,13,24)  ← 短向量被重複
c(1,2,3) + c(10,20)     # → c(11,22,13) + warning

Recycling trap: when vectors of different lengths combine, the shorter is recycled. Non-multiple lengths emit a warning.

c(1,2,3,4) + c(10,20)   # → c(11,22,13,24)
c(1,2,3) + c(10,20)     # → c(11,22,13) + warning

互動模擬

四、向量化效能比較

調整向量長度，比較「向量化」與「for 迴圈」計算 log2(x+1) 的相對速度。

Adjust vector length and compare vectorized vs for-loop times for log2(x+1).

向量長度 10000

— — —

模擬數據：實際 R 執行時，向量化通常比 for 迴圈快 10–500 倍。

Simulated: real R runs vectorized 10–500× faster than naive for loops.

細節

五、命名規則與缺值處理

變數命名

可用字元：字母、數字、 . 、 _
必須以字母或 . 開頭，不可以數字開頭
不可使用保留字（if, for, TRUE, NULL, NA 等）
大小寫敏感：Gene ≠ gene
慣例：snake_case（如 gene_count）；避免用 .（與 S3 方法衝突）

Variable naming

Allowed: letters, digits, ., _
Must start with a letter or . — not a digit
Avoid reserved words (if, for, TRUE, NULL, NA...)
Case-sensitive: Gene ≠ gene
Convention: snake_case (e.g. gene_count); avoid . (clashes with S3 methods)

NA、NULL、NaN、Inf 的差別

Symbol	意義	判斷函式
`NA`	缺值（未知）──保留位置	`is.na(x)`
`NULL`	不存在──不佔位置，length 0	`is.null(x)`
`NaN`	Not a Number，例如 0/0	`is.nan(x)`
`Inf` / `-Inf`	無窮大，例如 1/0	`is.infinite(x)`

x <- c(1, 2, NA, 4, NaN, Inf)
mean(x)                       # NA  ← any NA poisons the result
mean(x, na.rm = TRUE)         # 1.75 (NaN ignored too)
sum(is.na(x))                 # 2 (NA + NaN both count as NA)
x[!is.na(x)]                  # remove NAs
na.omit(x)                    # same idea, returns no-NA copy

# is.na vs is.nan
is.na(NaN)                    # TRUE  ← NaN is also NA
is.nan(NA)                    # FALSE ← but NA is NOT NaN

流程

六、控制流程

# 標準 if / else / Standard if/else
fdr <- 0.03
if (fdr < 0.05) {
  cat("Significant", "\n")
} else if (fdr < 0.10) {
  cat("Borderline", "\n")
} else {
  cat("Not significant", "\n")
}

# 三元運算（簡潔寫法）/ Inline if
status <- if (fdr < 0.05) "sig" else "ns"

# for 迴圈 / for loop
genes <- c("TP53", "BRCA1", "EGFR")
for (g in genes) {
  cat("Processing:", g, "\n")
}

# seq_along() 是更安全的 1:length() 寫法 / Safer than 1:length()
for (i in seq_along(genes)) {
  cat(i, ":", genes[i], "\n")
}

# while / repeat with break / next
i <- 1
while (i <= 5) {
  if (i == 3) { i <- i + 1; next }   # skip 3
  if (i == 5) break                       # exit at 5
  print(i)
  i <- i + 1
}

# apply 家族：把函式套到結構上 / Apply family
m <- matrix(1:12, nrow = 3)

apply(m, 1, sum)   # MARGIN=1: 每列總和 / row sums
apply(m, 2, mean)  # MARGIN=2: 每欄平均 / column means

# sapply: 對 list/vector 套函式，回傳簡化結果 / simplify result
sapply(1:5, function(x) x^2)        # numeric vector

# lapply: always returns list
lapply(1:5, function(x) x^2)

# mapply: multi-argument apply
mapply(function(a, b) a+b, 1:3, 10:12)  # 11 13 15

# tapply: split by factor, apply
tapply(iris$Sepal.Length, iris$Species, mean)

# ifelse() 是向量化版 if / Vectorized if
fdr <- c(0.001, 0.04, 0.10, 0.50)
ifelse(fdr < 0.05, "sig", "ns")
#> "sig" "sig" "ns" "ns"

# dplyr::case_when 是多分支版本 / Multi-branch version
dplyr::case_when(
  fdr < 0.001 ~ "***",
  fdr < 0.01  ~ "**",
  fdr < 0.05  ~ "*",
  TRUE         ~ "ns"
)

💡

金句：「如果你在 R 裡寫 for 迴圈，先停下來想想能否向量化、能否用 apply 家族、能否用 purrr::map。」for 迴圈不是不能用，但常代表還有更 R-like 的寫法。 Mantra: "Before writing a for loop in R, ask: can I vectorize? use the apply family? use purrr::map?" For-loops aren't forbidden — they often signal a more R-idiomatic alternative exists.

函式

七、自訂函式

# 基本語法 / Basic syntax
log2_fc <- function(treated, control) {
  log2(treated / control)
}
log2_fc(20, 10)              # 1
log2_fc(c(20, 40), c(10, 10))   # 1 2  (vectorized!)

# 預設參數 / Default arguments
filter_dge <- function(df, fdr_cutoff = 0.05, lfc_cutoff = 1) {
  df[df$padj < fdr_cutoff & abs(df$log2FC) > lfc_cutoff, ]
}

# 變動參數 ... / Dots
my_paste <- function(sep = "_", ...) {
  args <- list(...)
  paste(unlist(args), collapse = sep)
}
my_paste(sep = "-", "TP53", "high", "tumor")    # "TP53-high-tumor"

# 匿名函式 (R 4.1+ 簡寫) / Anonymous (R 4.1+ shorthand)
sapply(1:5, \(x) x^2)        # equivalent to function(x) x^2

# 顯式 return 與最後一行隱式回傳 / Explicit return vs last line
f1 <- function(x) { return(x * 2) }
f2 <- function(x) { x * 2 }     # same — last value is returned

# 局部 vs 全域 / Local vs global scope
x <- 10
f <- function() {
  x <- 99    # local — doesn't change outer x
  x
}
f()         # 99
x           # still 10

⚠️

<- vs <<-： <- 在當前環境賦值；<<- 會往上找父環境的同名變數修改（「super-assign」）。初學者請避免 <<-──它讓除錯變地獄。 <- vs <<-: <- assigns in the current scope; <<- walks up to the parent scope ("super-assign"). Beginners: avoid <<- — it makes debugging horrible.

常備工具

八、必須認識的內建函式

任務	函式
總結性統計	`mean()` `median()` `sd()` `var()` `range()` `quantile()` `summary()`
排序	`sort(x)` 排好的值；`order(x)` 排好的位置（用於 data.frame 排序）；`rank(x)` 名次
集合	`union()` `intersect()` `setdiff()` `%in%` `unique()` `duplicated()`
字串	`paste()` `paste0()` `sprintf()` `nchar()` `substr()` `gsub()` `strsplit()` `toupper()`
向量建構	`c()` `seq()` `seq_len()` `seq_along()` `rep()` `rev()`
隨機	`set.seed()` `sample()` `rnorm()` `runif()` `rpois()`
檢查	`str()` `head()` `tail()` `dim()` `nrow()` `ncol()` `names()`

set.seed(42)             # reproducible random
x <- rnorm(1000, mean = 5, sd = 2)
summary(x)
quantile(x, probs = c(0.025, 0.975))   # 95% CI bounds

# 字串黏合 / String pasting
sprintf("Gene %s | log2FC = %.2f | FDR = %.3e", "TP53", 1.85, 3.2e-10)
paste("chr", 1:3, sep = "")           # "chr1" "chr2" "chr3"
paste0("sample_", 1:5)                # "sample_1" ... "sample_5"

# 集合操作 / Sets
up_genes   <- c("TP53", "BRCA1", "EGFR")
down_genes <- c("MYC",  "BRCA1", "ALK")
intersect(up_genes, down_genes)        # "BRCA1" — common
setdiff(up_genes,  down_genes)         # genes only in up
union(up_genes,    down_genes)         # combined
"TP53" %in% up_genes                   # TRUE

# 排序 data.frame / Sort data.frame
df <- data.frame(gene = c("A","B","C"), pval = c(0.1, 0.001, 0.05))
df[order(df$pval), ]                   # sorted by pval ascending

📝 自我檢測

1. c(1, 2, 3) + c(10, 20) 的結果是？

1. What does c(1, 2, 3) + c(10, 20) return?

A. 報錯，不能相加A. Error — can't add different lengths

B. c(11, 22, 3)B. c(11, 22, 3)

C. c(11, 22, 13) 並有警告（recycling）C. c(11, 22, 13) with a warning (recycling)

D. c(11, 22, 13, 23)D. c(11, 22, 13, 23)

2. 對 x <- c(TRUE, FALSE, TRUE)，x && FALSE 的結果為？

2. Given x <- c(TRUE, FALSE, TRUE), what is x && FALSE?

A. c(FALSE, FALSE, FALSE)A. c(FALSE, FALSE, FALSE)

B. FALSE（只看第一個元素）+ R 4.3+ 警告B. FALSE (only first element) + warning in R 4.3+

C. TRUEC. TRUE

D. 報錯D. Error

3. 想對 data.frame 依照某欄排序，下列何者正確？

3. To sort a data.frame by one column, which is correct?

A. sort(df, by = "pval")A. sort(df, by = "pval")

B. df[sort(df$pval), ]B. df[sort(df$pval), ]

C. df[order(df$pval), ]C. df[order(df$pval), ]

D. rank(df$pval)D. rank(df$pval)