STEP 2 / 16

R 語法基礎

運算子、型別、變數、向量化、控制流程、函式──掌握這些就能讀懂 99% 的 R 程式碼。

Operators, types, variables, vectorization, control flow, functions — master these and you'll read 99% of R code.

一、運算子

類別運算子範例
算術+ - * / ^ %% %/%5 %% 3 → 2 (餘數); 5 %/% 3 → 1 (整除)
比較== != > < >= <=3 == 3TRUE
邏輯& | ! && ||TRUE & FALSEFALSE
賦值<- = ->x <- 5 (建議用 <-)
管線|> (R 4.1+) / %>% (magrittr)iris |> head()
序列:1:51 2 3 4 5
索引[ ] [[ ]] $x[1]list[[1]]df$col
⚠️
單 vs 雙:& / |向量化邏輯運算(逐元素);&& / || 只看第一個元素,多用於 if 陳述。對向量誤用 && 是 R 中最常見錯誤之一。 Single vs double: & / | are vectorized (element-wise); && / || only check the first element and are meant for if statements. Misusing && on a vector is one of R's most common bugs.

二、原子型別 (Atomic Types)

🔢

numeric / double

實數,預設型別。3.141e-5Inf-InfNaN 都屬此類。

Real numbers, default. 3.14, 1e-5, Inf, -Inf, NaN.

🔟

integer

需在數字後加 L5L。多數情況不需特別區分;只在記憶體吃緊或需與 C 介接時才用。

Suffix with L: 5L. Rarely needs distinction; useful for memory or C interop.

🔤

character

字串,雙引號 "abc" 或單引號 'abc' 均可。R 中字串永遠是向量

Strings; double or single quotes. Strings are always vectors in R.

logical

TRUE / FALSE(可簡寫 T/F不建議──T、F 可被覆寫)。NA 表缺值。

TRUE / FALSE (or T/F, not recommended — they can be reassigned). NA is missing.

🌀

complex

複數:2 + 3i。生資較少用。

Complex numbers: 2 + 3i. Rare in bioinformatics.

📦

raw

原始位元組,處理二進位資料用。

Raw bytes for binary data.

型別檢查與轉換

x <- 3.14
typeof(x)        # "double"
class(x)         # "numeric"
is.numeric(x)    # TRUE

# 強制轉型 / Coerce
as.integer(3.9)        # 3 (truncates, doesn't round!)
as.character(42)       # "42"
as.numeric("3.14")     # 3.14
as.numeric("abc")      # NA + warning

# 自動轉型規則 (從寬到嚴): logical < integer < numeric < character
c(1, 2, "3")          # all become character: "1" "2" "3"
c(TRUE, 5L, 3.14)     # all become numeric:  1.00 5.00 3.14
💡
typeof() vs class()typeof() 看的是底層儲存方式(double、integer、character...);class() 看的是 R 的物件類型(numeric、data.frame、lm...)。實務上用 class() 較多。 typeof() vs class(): typeof() reveals storage mode (double, integer...); class() reveals the R object class (numeric, data.frame, lm...). Use class() in practice.

三、向量與向量化(R 的靈魂)

在 R 中,沒有純量──連 3.14 都是長度 1 的向量。所有運算預設逐元素 (element-wise),這就是「向量化」。寫對的話,10 萬個基因表現量的計算只需一行,而且比 for 迴圈快 100 倍。

In R there are no scalars — even 3.14 is a length-1 vector. Operations are element-wise by default — that's vectorization. Done right, computing on 100,000 genes takes one line and is 100× faster than a for loop.

# 建立向量 / Create
gene_expr <- c(5.2, 8.1, 3.4, 9.7, 2.1)
length(gene_expr)            # 5

# 向量化運算 (沒有 for 迴圈!) / Vectorized ops (no loops!)
log2(gene_expr + 1)          # log2 transform of all 5 at once
gene_expr * 2                # multiply each by 2
gene_expr > 5                # logical vector: FALSE TRUE FALSE TRUE FALSE

# 兩個向量逐元素運算 / Two vectors element-wise
treated <- c(6.0, 9.5, 4.0, 11.2, 2.8)
log2_fc <- log2(treated / gene_expr)   # log2 fold-change per gene
log2_fc

# 索引 (1-based, 不是 0-based!) / Indexing (1-based!)
gene_expr[1]                 # first element
gene_expr[c(1, 3, 5)]        # 1st, 3rd, 5th
gene_expr[-1]                # all EXCEPT first (negative drops)
gene_expr[gene_expr > 5]     # logical filtering — most powerful idiom!
⚠️
循環擴張 (Recycling) 陷阱:長度不同的向量運算時,短的會被「重複使用」直到對齊長的。若長度不是倍數會有警告。
c(1,2,3,4) + c(10,20)   # → c(11,22,13,24)  ← 短向量被重複
c(1,2,3) + c(10,20)     # → c(11,22,13) + warning
Recycling trap: when vectors of different lengths combine, the shorter is recycled. Non-multiple lengths emit a warning.
c(1,2,3,4) + c(10,20)   # → c(11,22,13,24)
c(1,2,3) + c(10,20)     # → c(11,22,13) + warning

四、向量化效能比較

調整向量長度,比較「向量化」與「for 迴圈」計算 log2(x+1) 的相對速度。

Adjust vector length and compare vectorized vs for-loop times for log2(x+1).

模擬數據:實際 R 執行時,向量化通常比 for 迴圈快 10–500 倍。

Simulated: real R runs vectorized 10–500× faster than naive for loops.

五、命名規則與缺值處理

變數命名

  • 可用字元:字母、數字、 ._
  • 必須以字母或 . 開頭,不可以數字開頭
  • 不可使用保留字(if, for, TRUE, NULL, NA 等)
  • 大小寫敏感:Genegene
  • 慣例:snake_case(如 gene_count);避免用 .(與 S3 方法衝突)

Variable naming

  • Allowed: letters, digits, ., _
  • Must start with a letter or . — not a digit
  • Avoid reserved words (if, for, TRUE, NULL, NA...)
  • Case-sensitive: Genegene
  • Convention: snake_case (e.g. gene_count); avoid . (clashes with S3 methods)

NA、NULL、NaN、Inf 的差別

Symbol意義判斷函式
NA缺值(未知)──保留位置is.na(x)
NULL不存在──不佔位置,length 0is.null(x)
NaNNot a Number,例如 0/0is.nan(x)
Inf / -Inf無窮大,例如 1/0is.infinite(x)
x <- c(1, 2, NA, 4, NaN, Inf)
mean(x)                       # NA  ← any NA poisons the result
mean(x, na.rm = TRUE)         # 1.75 (NaN ignored too)
sum(is.na(x))                 # 2 (NA + NaN both count as NA)
x[!is.na(x)]                  # remove NAs
na.omit(x)                    # same idea, returns no-NA copy

# is.na vs is.nan
is.na(NaN)                    # TRUE  ← NaN is also NA
is.nan(NA)                    # FALSE ← but NA is NOT NaN

六、控制流程

# 標準 if / else / Standard if/else
fdr <- 0.03
if (fdr < 0.05) {
  cat("Significant", "\n")
} else if (fdr < 0.10) {
  cat("Borderline", "\n")
} else {
  cat("Not significant", "\n")
}

# 三元運算(簡潔寫法)/ Inline if
status <- if (fdr < 0.05) "sig" else "ns"
# for 迴圈 / for loop
genes <- c("TP53", "BRCA1", "EGFR")
for (g in genes) {
  cat("Processing:", g, "\n")
}

# seq_along() 是更安全的 1:length() 寫法 / Safer than 1:length()
for (i in seq_along(genes)) {
  cat(i, ":", genes[i], "\n")
}

# while / repeat with break / next
i <- 1
while (i <= 5) {
  if (i == 3) { i <- i + 1; next }   # skip 3
  if (i == 5) break                       # exit at 5
  print(i)
  i <- i + 1
}
# apply 家族:把函式套到結構上 / Apply family
m <- matrix(1:12, nrow = 3)

apply(m, 1, sum)   # MARGIN=1: 每列總和 / row sums
apply(m, 2, mean)  # MARGIN=2: 每欄平均 / column means

# sapply: 對 list/vector 套函式,回傳簡化結果 / simplify result
sapply(1:5, function(x) x^2)        # numeric vector

# lapply: always returns list
lapply(1:5, function(x) x^2)

# mapply: multi-argument apply
mapply(function(a, b) a+b, 1:3, 10:12)  # 11 13 15

# tapply: split by factor, apply
tapply(iris$Sepal.Length, iris$Species, mean)
# ifelse() 是向量化版 if / Vectorized if
fdr <- c(0.001, 0.04, 0.10, 0.50)
ifelse(fdr < 0.05, "sig", "ns")
#> "sig" "sig" "ns" "ns"

# dplyr::case_when 是多分支版本 / Multi-branch version
dplyr::case_when(
  fdr < 0.001 ~ "***",
  fdr < 0.01  ~ "**",
  fdr < 0.05  ~ "*",
  TRUE         ~ "ns"
)
💡
金句:「如果你在 R 裡寫 for 迴圈,先停下來想想能否向量化、能否用 apply 家族、能否用 purrr::map。」for 迴圈不是不能用,但常代表還有更 R-like 的寫法。 Mantra: "Before writing a for loop in R, ask: can I vectorize? use the apply family? use purrr::map?" For-loops aren't forbidden — they often signal a more R-idiomatic alternative exists.

七、自訂函式

# 基本語法 / Basic syntax
log2_fc <- function(treated, control) {
  log2(treated / control)
}
log2_fc(20, 10)              # 1
log2_fc(c(20, 40), c(10, 10))   # 1 2  (vectorized!)

# 預設參數 / Default arguments
filter_dge <- function(df, fdr_cutoff = 0.05, lfc_cutoff = 1) {
  df[df$padj < fdr_cutoff & abs(df$log2FC) > lfc_cutoff, ]
}

# 變動參數 ... / Dots
my_paste <- function(sep = "_", ...) {
  args <- list(...)
  paste(unlist(args), collapse = sep)
}
my_paste(sep = "-", "TP53", "high", "tumor")    # "TP53-high-tumor"

# 匿名函式 (R 4.1+ 簡寫) / Anonymous (R 4.1+ shorthand)
sapply(1:5, \(x) x^2)        # equivalent to function(x) x^2

# 顯式 return 與最後一行隱式回傳 / Explicit return vs last line
f1 <- function(x) { return(x * 2) }
f2 <- function(x) { x * 2 }     # same — last value is returned

# 局部 vs 全域 / Local vs global scope
x <- 10
f <- function() {
  x <- 99    # local — doesn't change outer x
  x
}
f()         # 99
x           # still 10
⚠️
<- vs <<- <- 在當前環境賦值;<<- 會往上找父環境的同名變數修改(「super-assign」)。初學者請避免 <<-──它讓除錯變地獄。 <- vs <<-: <- assigns in the current scope; <<- walks up to the parent scope ("super-assign"). Beginners: avoid <<- — it makes debugging horrible.

八、必須認識的內建函式

任務函式
總結性統計mean() median() sd() var() range() quantile() summary()
排序sort(x) 排好的值;order(x) 排好的位置(用於 data.frame 排序);rank(x) 名次
集合union() intersect() setdiff() %in% unique() duplicated()
字串paste() paste0() sprintf() nchar() substr() gsub() strsplit() toupper()
向量建構c() seq() seq_len() seq_along() rep() rev()
隨機set.seed() sample() rnorm() runif() rpois()
檢查str() head() tail() dim() nrow() ncol() names()
set.seed(42)             # reproducible random
x <- rnorm(1000, mean = 5, sd = 2)
summary(x)
quantile(x, probs = c(0.025, 0.975))   # 95% CI bounds

# 字串黏合 / String pasting
sprintf("Gene %s | log2FC = %.2f | FDR = %.3e", "TP53", 1.85, 3.2e-10)
paste("chr", 1:3, sep = "")           # "chr1" "chr2" "chr3"
paste0("sample_", 1:5)                # "sample_1" ... "sample_5"

# 集合操作 / Sets
up_genes   <- c("TP53", "BRCA1", "EGFR")
down_genes <- c("MYC",  "BRCA1", "ALK")
intersect(up_genes, down_genes)        # "BRCA1" — common
setdiff(up_genes,  down_genes)         # genes only in up
union(up_genes,    down_genes)         # combined
"TP53" %in% up_genes                   # TRUE

# 排序 data.frame / Sort data.frame
df <- data.frame(gene = c("A","B","C"), pval = c(0.1, 0.001, 0.05))
df[order(df$pval), ]                   # sorted by pval ascending

📝 自我檢測

1. c(1, 2, 3) + c(10, 20) 的結果是?

1. What does c(1, 2, 3) + c(10, 20) return?

A. 報錯,不能相加A. Error — can't add different lengths
B. c(11, 22, 3)B. c(11, 22, 3)
C. c(11, 22, 13) 並有警告(recycling)C. c(11, 22, 13) with a warning (recycling)
D. c(11, 22, 13, 23)D. c(11, 22, 13, 23)

2. 對 x <- c(TRUE, FALSE, TRUE)x && FALSE 的結果為?

2. Given x <- c(TRUE, FALSE, TRUE), what is x && FALSE?

A. c(FALSE, FALSE, FALSE)A. c(FALSE, FALSE, FALSE)
B. FALSE(只看第一個元素)+ R 4.3+ 警告B. FALSE (only first element) + warning in R 4.3+
C. TRUEC. TRUE
D. 報錯D. Error

3. 想對 data.frame 依照某欄排序,下列何者正確?

3. To sort a data.frame by one column, which is correct?

A. sort(df, by = "pval")A. sort(df, by = "pval")
B. df[sort(df$pval), ]B. df[sort(df$pval), ]
C. df[order(df$pval), ]C. df[order(df$pval), ]
D. rank(df$pval)D. rank(df$pval)