一、運算子
| 類別 | 運算子 | 範例 |
|---|---|---|
| 算術 | + - * / ^ %% %/% | 5 %% 3 → 2 (餘數); 5 %/% 3 → 1 (整除) |
| 比較 | == != > < >= <= | 3 == 3 → TRUE |
| 邏輯 | & | ! && || | TRUE & FALSE → FALSE |
| 賦值 | <- = -> | x <- 5 (建議用 <-) |
| 管線 | |> (R 4.1+) / %>% (magrittr) | iris |> head() |
| 序列 | : | 1:5 → 1 2 3 4 5 |
| 索引 | [ ] [[ ]] $ | x[1]、list[[1]]、df$col |
& / | 是向量化邏輯運算(逐元素);&& / || 只看第一個元素,多用於 if 陳述。對向量誤用 && 是 R 中最常見錯誤之一。
Single vs double: & / | are vectorized (element-wise); && / || only check the first element and are meant for if statements. Misusing && on a vector is one of R's most common bugs.
二、原子型別 (Atomic Types)
numeric / double
實數,預設型別。3.14、1e-5、Inf、-Inf、NaN 都屬此類。
Real numbers, default. 3.14, 1e-5, Inf, -Inf, NaN.
integer
需在數字後加 L:5L。多數情況不需特別區分;只在記憶體吃緊或需與 C 介接時才用。
Suffix with L: 5L. Rarely needs distinction; useful for memory or C interop.
character
字串,雙引號 "abc" 或單引號 'abc' 均可。R 中字串永遠是向量。
Strings; double or single quotes. Strings are always vectors in R.
logical
TRUE / FALSE(可簡寫 T/F,不建議──T、F 可被覆寫)。NA 表缺值。
TRUE / FALSE (or T/F, not recommended — they can be reassigned). NA is missing.
complex
複數:2 + 3i。生資較少用。
Complex numbers: 2 + 3i. Rare in bioinformatics.
raw
原始位元組,處理二進位資料用。
Raw bytes for binary data.
型別檢查與轉換
x <- 3.14
typeof(x) # "double"
class(x) # "numeric"
is.numeric(x) # TRUE
# 強制轉型 / Coerce
as.integer(3.9) # 3 (truncates, doesn't round!)
as.character(42) # "42"
as.numeric("3.14") # 3.14
as.numeric("abc") # NA + warning
# 自動轉型規則 (從寬到嚴): logical < integer < numeric < character
c(1, 2, "3") # all become character: "1" "2" "3"
c(TRUE, 5L, 3.14) # all become numeric: 1.00 5.00 3.14
typeof() vs class():typeof() 看的是底層儲存方式(double、integer、character...);class() 看的是 R 的物件類型(numeric、data.frame、lm...)。實務上用 class() 較多。
typeof() vs class(): typeof() reveals storage mode (double, integer...); class() reveals the R object class (numeric, data.frame, lm...). Use class() in practice.
三、向量與向量化(R 的靈魂)
在 R 中,沒有純量──連 3.14 都是長度 1 的向量。所有運算預設逐元素 (element-wise),這就是「向量化」。寫對的話,10 萬個基因表現量的計算只需一行,而且比 for 迴圈快 100 倍。
In R there are no scalars — even 3.14 is a length-1 vector. Operations are element-wise by default — that's vectorization. Done right, computing on 100,000 genes takes one line and is 100× faster than a for loop.
# 建立向量 / Create gene_expr <- c(5.2, 8.1, 3.4, 9.7, 2.1) length(gene_expr) # 5 # 向量化運算 (沒有 for 迴圈!) / Vectorized ops (no loops!) log2(gene_expr + 1) # log2 transform of all 5 at once gene_expr * 2 # multiply each by 2 gene_expr > 5 # logical vector: FALSE TRUE FALSE TRUE FALSE # 兩個向量逐元素運算 / Two vectors element-wise treated <- c(6.0, 9.5, 4.0, 11.2, 2.8) log2_fc <- log2(treated / gene_expr) # log2 fold-change per gene log2_fc # 索引 (1-based, 不是 0-based!) / Indexing (1-based!) gene_expr[1] # first element gene_expr[c(1, 3, 5)] # 1st, 3rd, 5th gene_expr[-1] # all EXCEPT first (negative drops) gene_expr[gene_expr > 5] # logical filtering — most powerful idiom!
c(1,2,3,4) + c(10,20) # → c(11,22,13,24) ← 短向量被重複 c(1,2,3) + c(10,20) # → c(11,22,13) + warningRecycling trap: when vectors of different lengths combine, the shorter is recycled. Non-multiple lengths emit a warning.
c(1,2,3,4) + c(10,20) # → c(11,22,13,24) c(1,2,3) + c(10,20) # → c(11,22,13) + warning
四、向量化效能比較
調整向量長度,比較「向量化」與「for 迴圈」計算 log2(x+1) 的相對速度。
Adjust vector length and compare vectorized vs for-loop times for log2(x+1).
模擬數據:實際 R 執行時,向量化通常比 for 迴圈快 10–500 倍。
Simulated: real R runs vectorized 10–500× faster than naive for loops.
五、命名規則與缺值處理
變數命名
- 可用字元:字母、數字、
.、_ - 必須以字母或
.開頭,不可以數字開頭 - 不可使用保留字(
if,for,TRUE,NULL,NA等) - 大小寫敏感:
Gene≠gene - 慣例:snake_case(如
gene_count);避免用.(與 S3 方法衝突)
Variable naming
- Allowed: letters, digits,
.,_ - Must start with a letter or
.— not a digit - Avoid reserved words (
if,for,TRUE,NULL,NA...) - Case-sensitive:
Gene≠gene - Convention: snake_case (e.g.
gene_count); avoid.(clashes with S3 methods)
NA、NULL、NaN、Inf 的差別
| Symbol | 意義 | 判斷函式 |
|---|---|---|
NA | 缺值(未知)──保留位置 | is.na(x) |
NULL | 不存在──不佔位置,length 0 | is.null(x) |
NaN | Not a Number,例如 0/0 | is.nan(x) |
Inf / -Inf | 無窮大,例如 1/0 | is.infinite(x) |
x <- c(1, 2, NA, 4, NaN, Inf) mean(x) # NA ← any NA poisons the result mean(x, na.rm = TRUE) # 1.75 (NaN ignored too) sum(is.na(x)) # 2 (NA + NaN both count as NA) x[!is.na(x)] # remove NAs na.omit(x) # same idea, returns no-NA copy # is.na vs is.nan is.na(NaN) # TRUE ← NaN is also NA is.nan(NA) # FALSE ← but NA is NOT NaN
六、控制流程
# 標準 if / else / Standard if/else fdr <- 0.03 if (fdr < 0.05) { cat("Significant", "\n") } else if (fdr < 0.10) { cat("Borderline", "\n") } else { cat("Not significant", "\n") } # 三元運算(簡潔寫法)/ Inline if status <- if (fdr < 0.05) "sig" else "ns"
# for 迴圈 / for loop genes <- c("TP53", "BRCA1", "EGFR") for (g in genes) { cat("Processing:", g, "\n") } # seq_along() 是更安全的 1:length() 寫法 / Safer than 1:length() for (i in seq_along(genes)) { cat(i, ":", genes[i], "\n") } # while / repeat with break / next i <- 1 while (i <= 5) { if (i == 3) { i <- i + 1; next } # skip 3 if (i == 5) break # exit at 5 print(i) i <- i + 1 }
# apply 家族:把函式套到結構上 / Apply family m <- matrix(1:12, nrow = 3) apply(m, 1, sum) # MARGIN=1: 每列總和 / row sums apply(m, 2, mean) # MARGIN=2: 每欄平均 / column means # sapply: 對 list/vector 套函式,回傳簡化結果 / simplify result sapply(1:5, function(x) x^2) # numeric vector # lapply: always returns list lapply(1:5, function(x) x^2) # mapply: multi-argument apply mapply(function(a, b) a+b, 1:3, 10:12) # 11 13 15 # tapply: split by factor, apply tapply(iris$Sepal.Length, iris$Species, mean)
# ifelse() 是向量化版 if / Vectorized if fdr <- c(0.001, 0.04, 0.10, 0.50) ifelse(fdr < 0.05, "sig", "ns") #> "sig" "sig" "ns" "ns" # dplyr::case_when 是多分支版本 / Multi-branch version dplyr::case_when( fdr < 0.001 ~ "***", fdr < 0.01 ~ "**", fdr < 0.05 ~ "*", TRUE ~ "ns" )
七、自訂函式
# 基本語法 / Basic syntax
log2_fc <- function(treated, control) {
log2(treated / control)
}
log2_fc(20, 10) # 1
log2_fc(c(20, 40), c(10, 10)) # 1 2 (vectorized!)
# 預設參數 / Default arguments
filter_dge <- function(df, fdr_cutoff = 0.05, lfc_cutoff = 1) {
df[df$padj < fdr_cutoff & abs(df$log2FC) > lfc_cutoff, ]
}
# 變動參數 ... / Dots
my_paste <- function(sep = "_", ...) {
args <- list(...)
paste(unlist(args), collapse = sep)
}
my_paste(sep = "-", "TP53", "high", "tumor") # "TP53-high-tumor"
# 匿名函式 (R 4.1+ 簡寫) / Anonymous (R 4.1+ shorthand)
sapply(1:5, \(x) x^2) # equivalent to function(x) x^2
# 顯式 return 與最後一行隱式回傳 / Explicit return vs last line
f1 <- function(x) { return(x * 2) }
f2 <- function(x) { x * 2 } # same — last value is returned
# 局部 vs 全域 / Local vs global scope
x <- 10
f <- function() {
x <- 99 # local — doesn't change outer x
x
}
f() # 99
x # still 10
<- vs <<-: <- 在當前環境賦值;<<- 會往上找父環境的同名變數修改(「super-assign」)。初學者請避免 <<-──它讓除錯變地獄。
<- vs <<-: <- assigns in the current scope; <<- walks up to the parent scope ("super-assign"). Beginners: avoid <<- — it makes debugging horrible.
八、必須認識的內建函式
| 任務 | 函式 |
|---|---|
| 總結性統計 | mean() median() sd() var() range() quantile() summary() |
| 排序 | sort(x) 排好的值;order(x) 排好的位置(用於 data.frame 排序);rank(x) 名次 |
| 集合 | union() intersect() setdiff() %in% unique() duplicated() |
| 字串 | paste() paste0() sprintf() nchar() substr() gsub() strsplit() toupper() |
| 向量建構 | c() seq() seq_len() seq_along() rep() rev() |
| 隨機 | set.seed() sample() rnorm() runif() rpois() |
| 檢查 | str() head() tail() dim() nrow() ncol() names() |
set.seed(42) # reproducible random
x <- rnorm(1000, mean = 5, sd = 2)
summary(x)
quantile(x, probs = c(0.025, 0.975)) # 95% CI bounds
# 字串黏合 / String pasting
sprintf("Gene %s | log2FC = %.2f | FDR = %.3e", "TP53", 1.85, 3.2e-10)
paste("chr", 1:3, sep = "") # "chr1" "chr2" "chr3"
paste0("sample_", 1:5) # "sample_1" ... "sample_5"
# 集合操作 / Sets
up_genes <- c("TP53", "BRCA1", "EGFR")
down_genes <- c("MYC", "BRCA1", "ALK")
intersect(up_genes, down_genes) # "BRCA1" — common
setdiff(up_genes, down_genes) # genes only in up
union(up_genes, down_genes) # combined
"TP53" %in% up_genes # TRUE
# 排序 data.frame / Sort data.frame
df <- data.frame(gene = c("A","B","C"), pval = c(0.1, 0.001, 0.05))
df[order(df$pval), ] # sorted by pval ascending
📝 自我檢測
1. c(1, 2, 3) + c(10, 20) 的結果是?
1. What does c(1, 2, 3) + c(10, 20) return?
2. 對 x <- c(TRUE, FALSE, TRUE),x && FALSE 的結果為?
2. Given x <- c(TRUE, FALSE, TRUE), what is x && FALSE?
3. 想對 data.frame 依照某欄排序,下列何者正確?
3. To sort a data.frame by one column, which is correct?