STEP 4 / 16

檔案讀寫與路徑

CSV、TSV、Excel、RDS、RData、fst、HDF5──加上路徑管理,讓你的程式碼搬家也能跑。

CSV, TSV, Excel, RDS, RData, fst, HDF5 — plus rock-solid path handling so your code runs after the move.

一、選對檔案格式

格式 讀/寫 跨工具? 速度 情境
.csv / .tsvread.csv / readr::read_csv與 Excel/Python/合作者交換
.xlsxreadxl::read_xlsx / writexl::write_xlsx老闆/wet lab 用 Excel
.rdsreadRDS / saveRDS❌ R only保留 R 物件型別、單一物件
.RData / .rdaload / save❌ R only一次存多個物件
.fstfst::read_fst / write_fst❌ R only極快大型 data.frame、需隨機存取
.parquetarrow::read_parquet極快與 Python pandas/Spark 交換
.h5 (HDF5)rhdf5, HDF5Array極快超大稀疏矩陣 (scRNA、影像)
.featherarrow::read_feather極快記憶體共享、快速跨語言
💡
3 秒決策法:「要給合作者看?→ csvxlsx。只給 R 自己用、單一物件?→ rds。資料 > 1GB 又要快?→ fstparquet。生資大矩陣?→ h5。」 3-second rule: "Sharing? → csv/xlsx. R-only single object? → rds. >1GB and need speed? → fst/parquet. Bio big matrix? → h5."

二、CSV / TSV──最常見格式

# 讀 / Read
df <- read.csv("data/counts.csv",
               header = TRUE,
               row.names = 1,        # 第一欄當 row name (gene id)
               check.names = FALSE,   # 不要把 "1A" 改成 "X1A"
               stringsAsFactors = FALSE)# R < 4.0 必加

# TSV 用 read.delim (預設 sep="\t")
df <- read.delim("data/counts.tsv")

# 寫 / Write
write.csv(df, "results/dge.csv", row.names = FALSE)
write.table(df, "results/dge.tsv", sep = "\t",
            quote = FALSE, row.names = FALSE)
# 比 base R 快 5–10 倍,自動猜型別更聰明,回傳 tibble
library(readr)

df <- read_csv("data/counts.csv")
df <- read_tsv("data/counts.tsv")

# 大檔案用 col_types= 加速並避免猜錯型別
df <- read_csv("big.csv",
              col_types = cols(
                gene_id    = col_character(),
                count      = col_integer(),
                expression = col_double(),
                .default   = col_skip()   # 其他欄全略
              ))

# 寫
write_csv(df, "results/dge.csv")   # UTF-8, no row names
write_tsv(df, "results/dge.tsv")
# 最快選擇──適合 GB 級檔案 / Fastest, GB-scale
library(data.table)

dt <- fread("data/counts.csv")         # 自動偵測 sep / header / 型別
dt <- fread("data/counts.tsv", sep = "\t")
dt <- fread("big.csv.gz")                  # 直接讀 .gz!

# 只挑某些欄 / Pick columns
dt <- fread("big.csv", select = c("gene_id", "padj", "log2FC"))

# 寫
fwrite(dt, "results/dge.csv")
fwrite(dt, "results/dge.tsv", sep = "\t")
⚠️
常見讀檔錯誤:
  • 分隔符錯──台灣的 Excel 預設用「分號」(;) 不是逗號。用 read.csv2()read_csv2()
  • 編碼錯──含中文的 CSV 在 Windows 上可能是 Big5 / GBK,讀進來變亂碼。加 fileEncoding = "UTF-8""BIG5" 試試。
  • 有 BOM──Excel 存出的 UTF-8 CSV 第一個欄名前有 BOM ()。read_csv 會自動處理;read.csv 不會。
Common pitfalls:
  • Wrong delimiter — European/Asian Excel uses semicolons (;). Use read.csv2() / read_csv2().
  • Wrong encoding — Chinese CSVs on Windows may be Big5/GBK. Add fileEncoding = "UTF-8" or "BIG5".
  • BOM — Excel-saved UTF-8 CSVs start with . read_csv handles it; read.csv doesn't.

三、Excel 檔案

# 讀 .xlsx / Read xlsx
library(readxl)
df <- read_xlsx("data/clinical.xlsx",
                sheet = "samples",       # 工作表名稱
                range = "A1:E50",        # 範圍
                col_names = TRUE,        # 第一列當欄名
                na = c("", "NA", "N/A")) # 視為 NA 的字串

# 列出所有 sheet 名 / List all sheets
excel_sheets("data/clinical.xlsx")

# 讀全部 sheet 變成 list / Read all sheets at once
sheets <- excel_sheets("data/clinical.xlsx")
all <- lapply(sheets, read_xlsx, path = "data/clinical.xlsx")
names(all) <- sheets

# 寫 .xlsx / Write — writexl 是最簡單的選擇
library(writexl)
write_xlsx(list(
  "DEG_results"   = dge_table,
  "Sample_info"   = sample_info
), "results/output.xlsx")

# 進階寫法:openxlsx 可調格式、加公式
# library(openxlsx); write.xlsx(...)
💡
合作者只用 Excel?把多個結果分別放不同 sheet(用 list 傳給 writexl::write_xlsx)──這比寄 5 個 csv 給對方體驗好太多。 Collaborator lives in Excel? Pack multiple results as separate sheets (pass a named list to writexl::write_xlsx) — far better than emailing 5 csv files.

四、RDS vs RData──別搞混!

📌 .rds (saveRDS / readRDS)

儲存「一個」物件,讀回時必須賦值給變數。物件原名不重要。

saveRDS(dds, "dds.rds")
new_name <- readRDS("dds.rds")

優點:明確、可控、不會污染環境。建議優先使用

Saves one object; you must assign on read. Original name doesn't matter.

saveRDS(dds, "dds.rds")
new_name <- readRDS("dds.rds")

Pros: explicit, controllable, doesn't pollute env. Prefer this.

📌 .RData (save / load)

儲存「多個」物件並保留原名load() 會直接覆寫同名變數!

save(dds, sample_info, file = "all.RData")
load("all.RData")    # ⚠ 直接覆寫
ls()                 # "dds" "sample_info"

缺點:不可預期,難除錯。盡量少用。

Saves multiple objects with their names; load() overwrites matching names!

save(dds, sample_info, file = "all.RData")
load("all.RData")    # ⚠ overwrites
ls()                 # "dds" "sample_info"

Cons: unpredictable, hard to debug. Avoid.

# 推薦做法:複數物件包成 list 後 saveRDS
# Best practice: bundle multiple objects into a list, then saveRDS
results <- list(
  dge        = dge_table,
  sample_info = metadata,
  parameters = list(fdr = 0.05, lfc = 1)
)
saveRDS(results, "results/analysis.rds")

# Read back
analysis <- readRDS("results/analysis.rds")
analysis$dge
analysis$parameters$fdr

五、路徑管理:file.path() 與 here 套件

file.path() 會根據作業系統自動加正確的分隔符號(Windows 用 \、Unix 用 /,但 R 對 / 都接受──所以實際上總是 /)。永遠用它組合路徑,不要用 paste0

file.path() joins parts with the OS-correct separator (Windows \, Unix /, but R accepts / on both — so effectively always /). Always use it instead of paste0.

# ✅ 推薦寫法 / Preferred
file.path("data", "raw", "counts.csv")
# "data/raw/counts.csv"

# ❌ 不推薦:手動串字串容易漏 /
paste0("data/raw/", "counts.csv")       # works, but error-prone

# 取得各部位 / Extract parts
basename("data/raw/counts.csv")         # "counts.csv"
dirname("data/raw/counts.csv")          # "data/raw"
tools::file_ext("counts.csv")           # "csv"
tools::file_path_sans_ext("counts.csv") # "counts"

# 檢查 / Check existence
file.exists("data/counts.csv")
dir.exists("data/raw")

# 列出符合 pattern 的檔 / List files by pattern
list.files("data/", pattern = "\\.csv$", full.names = TRUE, recursive = TRUE)

here 套件──最強路徑解決方案

# 安裝:install.packages('here')
library(here)

# here() 會找到「專案根目錄」(.Rproj 或 .here 所在處),然後組合相對路徑
# Finds the project root (where .Rproj or .here lives), then joins relative parts
here()                              # "E:/Charlene/.../my_rnaseq_project"
here("data", "raw", "counts.csv")   # ".../my_rnaseq_project/data/raw/counts.csv"

# 從此你的腳本不論被放在 R/、reports/ 或專案外,路徑都對!
# Your scripts work regardless of which subfolder you launch from.

# 典型使用方式:
read_csv(here("data", "raw", "counts.csv"))
saveRDS(dds, here("results", "dds.rds"))
ggsave(here("results", "figures", "volcano.png"), p, width = 7, height = 6)

# 沒有 .Rproj 時,可在專案根放空檔 .here:
file.create(".here")
🚨
三條不要:
  1. 不要在腳本裡寫 setwd("C:/Users/Charlene/...")──別人或日後的你開不了。
  2. 不要~/Desktop/...──Mac/Linux 才有 ~,Windows 上行為不定。
  3. 不要"\" 寫 Windows 路徑──請用 "/"file.path() / here()
Three don'ts:
  1. Don't hard-code setwd("C:/Users/Charlene/...") — collaborators (and future-you) can't open it.
  2. Don't use ~/Desktop/... — only Mac/Linux know ~; Windows behavior varies.
  3. Don't use "\" for Windows paths — always "/" or file.path() / here().

六、生物資訊常見檔案格式

檔案 內容 R 套件
.fasta / .fa序列 (DNA/RNA/蛋白)Biostrings::readDNAStringSet()
.fastq / .fq.gz定序原始 reads + 品質分數ShortRead::readFastq()
.bam / .sam比對後的 readsRsamtools::scanBam()
.bed / .gff / .gtf基因座標、註解rtracklayer::import()
.vcf / .vcf.gz變異 (SNP/indel)VariantAnnotation::readVcf() / vcfR::read.vcfR()
.bigWig / .bw基因座標訊號 (ChIP/ATAC)rtracklayer::import.bw()
.h5 (HDF5)大型矩陣 (scRNA、影像)HDF5Array, rhdf5
.mtx (Matrix Market)稀疏矩陣 (10X scRNA)Matrix::readMM()
.idatIllumina array 原始資料minfi::read.metharray()
# 範例:讀 GTF 註解檔(需先安裝 rtracklayer)
# Example: read GTF annotation
# library(rtracklayer)
# gtf <- rtracklayer::import("Homo_sapiens.GRCh38.110.gtf.gz")
# head(gtf)             # GRanges 物件 (見 Bioc 章)

# 範例:讀 FASTA
# library(Biostrings)
# seqs <- readDNAStringSet("genome.fa")
# names(seqs); width(seqs)

# 範例:讀 10X scRNA 三件組
# library(Matrix); library(Seurat)
# counts <- ReadMtx(mtx = "matrix.mtx.gz",
#                   features = "features.tsv.gz",
#                   cells    = "barcodes.tsv.gz")

cat("Demo only — install Bioconductor pkgs to actually run.\n")

七、壓縮檔與遠端 URL

# 多數 read_* 函式直接支援 .gz / .bz2 / .zip
# Most read_* functions handle .gz / .bz2 / .zip natively
df <- readr::read_csv("data/big.csv.gz")           # gzip
dt <- data.table::fread("data/big.csv.bz2")       # bzip2

# 直接從 URL 讀 / Read from URL
url <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSEXXXX&file=matrix.txt.gz"
# df <- readr::read_tsv(url)                       # works for many sites

# 大檔下載建議用 download.file / for big files
# download.file(url, destfile = "data/raw/matrix.tsv.gz", mode = "wb")
# (mode = "wb" 是 Windows 上必須的,否則二進位檔會壞)

# 寫壓縮檔 / Write compressed
gz <- gzfile("results/big.csv.gz", "w")
write.csv(df, gz, row.names = FALSE)
close(gz)
# 或更簡單:data.table::fwrite("results/big.csv.gz")  自動偵測副檔名
⚠️
Windows 必看:download.file() 預設 mode = "w"(文字模式)會把 \n 改成 \r\n,下載 .gz 或 .bam 等二進位檔會壞掉。務必加 mode = "wb" Windows alert: download.file() defaults to mode = "w" (text mode), which corrupts binary files like .gz / .bam. Always pass mode = "wb".

八、編碼自我檢測器

當 R 讀檔出現亂碼,常常是因為「檔案實際編碼」與「R 預設編碼」不符。下面列出常見組合:

"Mojibake" usually means the file's actual encoding ≠ R's default. Common combinations:

📝 自我檢測

1. load("results.RData")readRDS("results.rds") 最關鍵的差別是?

1. The crucial difference between load() and readRDS()?

A. load 比較快A. load is faster
B. readRDS 不能存 listB. readRDS can't save lists
C. load 直接覆寫同名變數;readRDS 必須賦值給變數C. load overwrites variables by name; readRDS requires assignment
D. 完全一樣D. They're identical

2. 在 Windows 上下載 .bam 檔,正確指令是?

2. Downloading a .bam file on Windows — correct call?

A. download.file(url, "x.bam")A. download.file(url, "x.bam")
B. download.file(url, "x.bam", mode = "w")B. download.file(url, "x.bam", mode = "w")
C. download.file(url, "x.bam", mode = "wb")C. download.file(url, "x.bam", mode = "wb")
D. read.bam(url)D. read.bam(url)

3. 你的 R 腳本被合作者下載後執行,下列哪一種寫法最可能讓對方跑不動?

3. Which path style is most likely to break for a collaborator?

A. here("data", "counts.csv")A. here("data", "counts.csv")
B. file.path("data", "counts.csv")B. file.path("data", "counts.csv")
C. "data/counts.csv"C. "data/counts.csv"
D. "C:/Users/Charlene/Desktop/proj/data/counts.csv"D. "C:/Users/Charlene/Desktop/proj/data/counts.csv"