一、選對檔案格式
| 格式 | 讀/寫 | 跨工具? | 速度 | 情境 |
|---|---|---|---|---|
| .csv / .tsv | read.csv / readr::read_csv | ✅ | 慢 | 與 Excel/Python/合作者交換 |
| .xlsx | readxl::read_xlsx / writexl::write_xlsx | ✅ | 慢 | 老闆/wet lab 用 Excel |
| .rds | readRDS / saveRDS | ❌ R only | 快 | 保留 R 物件型別、單一物件 |
| .RData / .rda | load / save | ❌ R only | 快 | 一次存多個物件 |
| .fst | fst::read_fst / write_fst | ❌ R only | 極快 | 大型 data.frame、需隨機存取 |
| .parquet | arrow::read_parquet | ✅ | 極快 | 與 Python pandas/Spark 交換 |
| .h5 (HDF5) | rhdf5, HDF5Array | ✅ | 極快 | 超大稀疏矩陣 (scRNA、影像) |
| .feather | arrow::read_feather | ✅ | 極快 | 記憶體共享、快速跨語言 |
csv 或 xlsx。只給 R 自己用、單一物件?→ rds。資料 > 1GB 又要快?→ fst 或 parquet。生資大矩陣?→ h5。」
3-second rule: "Sharing? → csv/xlsx. R-only single object? → rds. >1GB and need speed? → fst/parquet. Bio big matrix? → h5."
二、CSV / TSV──最常見格式
# 讀 / Read df <- read.csv("data/counts.csv", header = TRUE, row.names = 1, # 第一欄當 row name (gene id) check.names = FALSE, # 不要把 "1A" 改成 "X1A" stringsAsFactors = FALSE)# R < 4.0 必加 # TSV 用 read.delim (預設 sep="\t") df <- read.delim("data/counts.tsv") # 寫 / Write write.csv(df, "results/dge.csv", row.names = FALSE) write.table(df, "results/dge.tsv", sep = "\t", quote = FALSE, row.names = FALSE)
# 比 base R 快 5–10 倍,自動猜型別更聰明,回傳 tibble library(readr) df <- read_csv("data/counts.csv") df <- read_tsv("data/counts.tsv") # 大檔案用 col_types= 加速並避免猜錯型別 df <- read_csv("big.csv", col_types = cols( gene_id = col_character(), count = col_integer(), expression = col_double(), .default = col_skip() # 其他欄全略 )) # 寫 write_csv(df, "results/dge.csv") # UTF-8, no row names write_tsv(df, "results/dge.tsv")
# 最快選擇──適合 GB 級檔案 / Fastest, GB-scale library(data.table) dt <- fread("data/counts.csv") # 自動偵測 sep / header / 型別 dt <- fread("data/counts.tsv", sep = "\t") dt <- fread("big.csv.gz") # 直接讀 .gz! # 只挑某些欄 / Pick columns dt <- fread("big.csv", select = c("gene_id", "padj", "log2FC")) # 寫 fwrite(dt, "results/dge.csv") fwrite(dt, "results/dge.tsv", sep = "\t")
- 分隔符錯──台灣的 Excel 預設用「分號」(
;) 不是逗號。用read.csv2()或read_csv2()。 - 編碼錯──含中文的 CSV 在 Windows 上可能是 Big5 / GBK,讀進來變亂碼。加
fileEncoding = "UTF-8"或"BIG5"試試。 - 有 BOM──Excel 存出的 UTF-8 CSV 第一個欄名前有 BOM (
)。read_csv會自動處理;read.csv不會。
- Wrong delimiter — European/Asian Excel uses semicolons (
;). Useread.csv2()/read_csv2(). - Wrong encoding — Chinese CSVs on Windows may be Big5/GBK. Add
fileEncoding = "UTF-8"or"BIG5". - BOM — Excel-saved UTF-8 CSVs start with
.read_csvhandles it;read.csvdoesn't.
三、Excel 檔案
# 讀 .xlsx / Read xlsx
library(readxl)
df <- read_xlsx("data/clinical.xlsx",
sheet = "samples", # 工作表名稱
range = "A1:E50", # 範圍
col_names = TRUE, # 第一列當欄名
na = c("", "NA", "N/A")) # 視為 NA 的字串
# 列出所有 sheet 名 / List all sheets
excel_sheets("data/clinical.xlsx")
# 讀全部 sheet 變成 list / Read all sheets at once
sheets <- excel_sheets("data/clinical.xlsx")
all <- lapply(sheets, read_xlsx, path = "data/clinical.xlsx")
names(all) <- sheets
# 寫 .xlsx / Write — writexl 是最簡單的選擇
library(writexl)
write_xlsx(list(
"DEG_results" = dge_table,
"Sample_info" = sample_info
), "results/output.xlsx")
# 進階寫法:openxlsx 可調格式、加公式
# library(openxlsx); write.xlsx(...)
writexl::write_xlsx)──這比寄 5 個 csv 給對方體驗好太多。
Collaborator lives in Excel? Pack multiple results as separate sheets (pass a named list to writexl::write_xlsx) — far better than emailing 5 csv files.
四、RDS vs RData──別搞混!
📌 .rds (saveRDS / readRDS)
儲存「一個」物件,讀回時必須賦值給變數。物件原名不重要。
saveRDS(dds, "dds.rds")
new_name <- readRDS("dds.rds")
優點:明確、可控、不會污染環境。建議優先使用。
Saves one object; you must assign on read. Original name doesn't matter.
saveRDS(dds, "dds.rds")
new_name <- readRDS("dds.rds")
Pros: explicit, controllable, doesn't pollute env. Prefer this.
📌 .RData (save / load)
儲存「多個」物件並保留原名,load() 會直接覆寫同名變數!
save(dds, sample_info, file = "all.RData")
load("all.RData") # ⚠ 直接覆寫
ls() # "dds" "sample_info"
缺點:不可預期,難除錯。盡量少用。
Saves multiple objects with their names; load() overwrites matching names!
save(dds, sample_info, file = "all.RData")
load("all.RData") # ⚠ overwrites
ls() # "dds" "sample_info"
Cons: unpredictable, hard to debug. Avoid.
# 推薦做法:複數物件包成 list 後 saveRDS
# Best practice: bundle multiple objects into a list, then saveRDS
results <- list(
dge = dge_table,
sample_info = metadata,
parameters = list(fdr = 0.05, lfc = 1)
)
saveRDS(results, "results/analysis.rds")
# Read back
analysis <- readRDS("results/analysis.rds")
analysis$dge
analysis$parameters$fdr
五、路徑管理:file.path() 與 here 套件
file.path() 會根據作業系統自動加正確的分隔符號(Windows 用 \、Unix 用 /,但 R 對 / 都接受──所以實際上總是 /)。永遠用它組合路徑,不要用 paste0。
file.path() joins parts with the OS-correct separator (Windows \, Unix /, but R accepts / on both — so effectively always /). Always use it instead of paste0.
# ✅ 推薦寫法 / Preferred
file.path("data", "raw", "counts.csv")
# "data/raw/counts.csv"
# ❌ 不推薦:手動串字串容易漏 /
paste0("data/raw/", "counts.csv") # works, but error-prone
# 取得各部位 / Extract parts
basename("data/raw/counts.csv") # "counts.csv"
dirname("data/raw/counts.csv") # "data/raw"
tools::file_ext("counts.csv") # "csv"
tools::file_path_sans_ext("counts.csv") # "counts"
# 檢查 / Check existence
file.exists("data/counts.csv")
dir.exists("data/raw")
# 列出符合 pattern 的檔 / List files by pattern
list.files("data/", pattern = "\\.csv$", full.names = TRUE, recursive = TRUE)
here 套件──最強路徑解決方案
# 安裝:install.packages('here')
library(here)
# here() 會找到「專案根目錄」(.Rproj 或 .here 所在處),然後組合相對路徑
# Finds the project root (where .Rproj or .here lives), then joins relative parts
here() # "E:/Charlene/.../my_rnaseq_project"
here("data", "raw", "counts.csv") # ".../my_rnaseq_project/data/raw/counts.csv"
# 從此你的腳本不論被放在 R/、reports/ 或專案外,路徑都對!
# Your scripts work regardless of which subfolder you launch from.
# 典型使用方式:
read_csv(here("data", "raw", "counts.csv"))
saveRDS(dds, here("results", "dds.rds"))
ggsave(here("results", "figures", "volcano.png"), p, width = 7, height = 6)
# 沒有 .Rproj 時,可在專案根放空檔 .here:
file.create(".here")
- 不要在腳本裡寫
setwd("C:/Users/Charlene/...")──別人或日後的你開不了。 - 不要寫
~/Desktop/...──Mac/Linux 才有~,Windows 上行為不定。 - 不要用
"\"寫 Windows 路徑──請用"/"或file.path()/here()。
- Don't hard-code
setwd("C:/Users/Charlene/...")— collaborators (and future-you) can't open it. - Don't use
~/Desktop/...— only Mac/Linux know~; Windows behavior varies. - Don't use
"\"for Windows paths — always"/"orfile.path()/here().
六、生物資訊常見檔案格式
| 檔案 | 內容 | R 套件 |
|---|---|---|
| .fasta / .fa | 序列 (DNA/RNA/蛋白) | Biostrings::readDNAStringSet() |
| .fastq / .fq.gz | 定序原始 reads + 品質分數 | ShortRead::readFastq() |
| .bam / .sam | 比對後的 reads | Rsamtools::scanBam() |
| .bed / .gff / .gtf | 基因座標、註解 | rtracklayer::import() |
| .vcf / .vcf.gz | 變異 (SNP/indel) | VariantAnnotation::readVcf() / vcfR::read.vcfR() |
| .bigWig / .bw | 基因座標訊號 (ChIP/ATAC) | rtracklayer::import.bw() |
| .h5 (HDF5) | 大型矩陣 (scRNA、影像) | HDF5Array, rhdf5 |
| .mtx (Matrix Market) | 稀疏矩陣 (10X scRNA) | Matrix::readMM() |
| .idat | Illumina array 原始資料 | minfi::read.metharray() |
# 範例:讀 GTF 註解檔(需先安裝 rtracklayer)
# Example: read GTF annotation
# library(rtracklayer)
# gtf <- rtracklayer::import("Homo_sapiens.GRCh38.110.gtf.gz")
# head(gtf) # GRanges 物件 (見 Bioc 章)
# 範例:讀 FASTA
# library(Biostrings)
# seqs <- readDNAStringSet("genome.fa")
# names(seqs); width(seqs)
# 範例:讀 10X scRNA 三件組
# library(Matrix); library(Seurat)
# counts <- ReadMtx(mtx = "matrix.mtx.gz",
# features = "features.tsv.gz",
# cells = "barcodes.tsv.gz")
cat("Demo only — install Bioconductor pkgs to actually run.\n")
七、壓縮檔與遠端 URL
# 多數 read_* 函式直接支援 .gz / .bz2 / .zip
# Most read_* functions handle .gz / .bz2 / .zip natively
df <- readr::read_csv("data/big.csv.gz") # gzip
dt <- data.table::fread("data/big.csv.bz2") # bzip2
# 直接從 URL 讀 / Read from URL
url <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSEXXXX&file=matrix.txt.gz"
# df <- readr::read_tsv(url) # works for many sites
# 大檔下載建議用 download.file / for big files
# download.file(url, destfile = "data/raw/matrix.tsv.gz", mode = "wb")
# (mode = "wb" 是 Windows 上必須的,否則二進位檔會壞)
# 寫壓縮檔 / Write compressed
gz <- gzfile("results/big.csv.gz", "w")
write.csv(df, gz, row.names = FALSE)
close(gz)
# 或更簡單:data.table::fwrite("results/big.csv.gz") 自動偵測副檔名
download.file() 預設 mode = "w"(文字模式)會把 \n 改成 \r\n,下載 .gz 或 .bam 等二進位檔會壞掉。務必加 mode = "wb"。
Windows alert: download.file() defaults to mode = "w" (text mode), which corrupts binary files like .gz / .bam. Always pass mode = "wb".
八、編碼自我檢測器
當 R 讀檔出現亂碼,常常是因為「檔案實際編碼」與「R 預設編碼」不符。下面列出常見組合:
"Mojibake" usually means the file's actual encoding ≠ R's default. Common combinations:
📝 自我檢測
1. load("results.RData") 與 readRDS("results.rds") 最關鍵的差別是?
1. The crucial difference between load() and readRDS()?
2. 在 Windows 上下載 .bam 檔,正確指令是?
2. Downloading a .bam file on Windows — correct call?
3. 你的 R 腳本被合作者下載後執行,下列哪一種寫法最可能讓對方跑不動?
3. Which path style is most likely to break for a collaborator?