REFERENCES

R for Bioinformatics References Index

為 R for Bioinformatics Interactive Tutorial 每章節整理的引用—收錄原始論文、vignettes 與最佳實踐綜述,方便讀者深入查閱。

References, citations, and sources for the R for Bioinformatics Interactive Tutorial — papers, official documents, and benchmarks with DOI / URL links.

How to use this page

針對教學中介紹的每個套件、演算法與概念,本頁列出原始論文與權威文件。標籤說明:

For every package, algorithm, and concept introduced in the tutorial we list the primary literature and authoritative documentation. Tag legend:

📄

Paper

同行評審論文—DOI / PubMed

Peer-reviewed publication — DOI / PubMed

📘

Doc

官方套件文件、vignette、教學

Official package documentation, vignette, tutorial

Best Practice

權威綜述或社群準則

Authoritative review or community guideline

📊

Benchmark

獨立方法比較

Independent method comparison

🗄️

Database

參考資料庫資源

Reference data resource

📚

Book

免費 / 經典教材

Free / classic textbook

📑

Cheatsheet

來自 Posit 的速查卡

Quick-reference card from Posit

Page contents

⭐ Foundational books & reviews

從這裡入門,建立 R、統計與生物資訊的紮實基礎。以下四本都可合法在線免費閱讀。

Start here for a solid base in R, statistics, and bioinformatics. All four are available legally online for free.

⚙️ Setup & environment

R 本體、RStudio (Posit) IDE、專案慣例與 BiocManager 安裝器。

R itself, RStudio (Posit) IDE, project conventions, and the BiocManager installer.

🔤 R language basics

🧱 Data structures

📂 File I/O & path management

🔧 tidyverse

🎨 ggplot2 visualization

📐 Biostatistics

⏳ Survival analysis

🧬 Bioconductor core

🧪 Bulk RNA-seq

🔬 Functional enrichment

🌋 Heatmap & volcano

🧩 Variants & GWAS

🔵 DNA methylation

📍 ChIP-seq & ATAC-seq

📦 Reproducibility & workflow

💬 Community & help

💡
Citation guidance: when you publish work using these packages, please cite the primary references above in addition to the package itself. citation("packageName") in R prints the recommended BibTeX entry.
📖
查找指南: 所有 DOI 連結指向出版社頁面。若需下載 PDF,可搭配 PubMedPMCarXivbioRxiv 找開放近用版本。

📌 教學註記與細節

以下整理 R / Bioconductor 工具與工作流中容易踩坑的細節、版本相依性與情境限制;部分屬於勘誤、多數屬於補充說明。

Notes on R / Bioconductor tool nuances, version dependencies, and context-dependent caveats; some are errata, most are supplementary notes.

補充說明 · Notes

DESeq2、edgeR 與 limma-voom 的選擇

三者皆為 bulk RNA-seq 差異表達主流工具,但底層假設不同:DESeq2 與 edgeR 採負二項分佈(NB-GLM),limma-voom 則先以 voom 將計數轉為連續變量並估計 mean-variance 權重後跑線性模型。Schurch 等人(2016, RNA)的 48 重複實驗指出三者在 n < 12 時表現相當,DESeq2 在小樣本下較保守、edgeR 較敏感;當 n > 12 時 limma-voom 速度顯著占優且結果與 NB 方法高度一致。複雜實驗設計(block / random effects)建議 limma 系列;快速重複跑大量對比則 edgeR 的 quasi-likelihood (glmQLFit) 為主流推薦。請勿將 TPM / FPKM 餵入這三者 — 它們都需要原始整數計數。

All three are mainstream bulk RNA-seq DE tools but rest on different assumptions: DESeq2 and edgeR use a negative-binomial GLM, while limma-voom transforms counts into continuous values via voom mean-variance weights and runs a linear model. Schurch et al. 2016 (RNA), based on 48-replicate experiments, found all three comparable when n < 12, with DESeq2 more conservative and edgeR more sensitive in small samples; for n > 12, limma-voom is dramatically faster while producing results highly concordant with NB methods. For complex designs (blocks / random effects) the limma family is recommended; for fast iteration over many contrasts, edgeR's quasi-likelihood pipeline (glmQLFit) is the current standard. Never feed TPM / FPKM into any of the three — they all require raw integer counts.

來源:Schurch et al. 2016 (RNA) · Law et al. 2016 (F1000 RNA-seq workflow)

Source: Schurch et al. 2016 (RNA) · Law et al. 2016 (F1000 RNA-seq workflow)

補充說明 · Notes

Median-of-ratios 才適合做差異表達;CPM / TPM 不是

DESeq2 的 median-of-ratios(Anders & Huber 2010)與 edgeR 的 TMM(Robinson & Oshlack 2010)估計 size factor / normalization factor,目的是修正樣本間 library size 與組成偏差,僅在 DE 模型內部使用、不會改動原始計數。TPM 與 FPKM 則在「每個樣本內」對基因長度做正規化,使其總和為 10⁶ — 適合視覺化、跨基因比較;但跨樣本不可比,且違反 DESeq2 / edgeR 對整數計數的假設。常見錯誤:把 TPM 矩陣餵入 DESeq2 後再四捨五入為整數 — 此舉會徹底破壞 dispersion 估計。請保留原始整數計數做 DE,僅在繪圖與 heatmap 時使用 vst() 或 rlog() 轉換。

DESeq2's median-of-ratios (Anders & Huber 2010) and edgeR's TMM (Robinson & Oshlack 2010) estimate size factors / normalization factors to correct between-sample library-size and compositional bias; they are used internally by the DE model and never alter the raw counts. TPM and FPKM normalize within a sample by gene length so each sample sums to 10⁶ — suitable for visualization and cross-gene comparison, but not comparable across samples, and violating the integer-count assumption of DESeq2 / edgeR. A common mistake is feeding a TPM matrix into DESeq2 after rounding to integers, which completely breaks dispersion estimation. Keep raw integer counts for DE and use vst() or rlog() only for plotting and heatmaps.

來源:Anders & Huber 2010 (Genome Biol) · Robinson & Oshlack 2010 (TMM)

Source: Anders & Huber 2010 (Genome Biol) · Robinson & Oshlack 2010 (TMM)

補充說明 · Notes

BiocManager 必須鎖定 Bioc 版本:用 BiocManager::valid()

Bioconductor 每半年發新版(4 月 / 10 月),且每個 Bioc release 嚴格綁定一個 R minor version(例如 Bioc 3.19 ⇄ R 4.4)。同時混用不同 release 的套件會出現 namespace 失誤或載入失敗。建議啟動 R 後先跑 BiocManager::valid() 檢查目前裝載的套件是否來自同一 Bioc release;若回傳 not TRUE,請依提示修補。安裝指定版本用 BiocManager::install(version = "3.19");組織需求請考慮搭配 renv 或 Docker (BioContainers / Bioconductor docker images) 凍結整套版本。

Bioconductor releases twice a year (April / October), and each release is strictly tied to one R minor version (e.g. Bioc 3.19 ⇄ R 4.4). Mixing packages from different releases causes namespace failures and load errors. After starting R, run BiocManager::valid() to verify that installed packages come from the same Bioc release; if the result is not TRUE, follow the prompts to repair. Install a specific release with BiocManager::install(version = "3.19"); for team reproducibility, pair this with renv or the official Bioconductor / BioContainers Docker images to pin the entire stack.

來源:Bioconductor Install Guide · Huber et al. 2015 (Nat Methods)

Source: Bioconductor Install Guide · Huber et al. 2015 (Nat Methods)

補充說明 · Notes

S4 DataFrame 不等於 base data.frame

Bioconductor 的 S4Vectors::DataFrame(大寫 D、F)並非 base R 的 data.frame。它可放入任意 S4 物件(包含 List、Rle、GRanges)作為欄位,並支援 metadata() slot;SummarizedExperiment、SingleCellExperiment 的 colData / rowData 預設皆為 DataFrame。常見錯誤:把 DataFrame 用 as.data.frame() 強轉後再寫回,會遺失 S4 metadata 並把 Rle / List 欄位展平為字串。正確的列篩選用 df[df$type == "gene", ](保留 DataFrame),需要做 ggplot2 等 base API 時才轉成 tibble / data.frame,並注意 ggplot2 4.x 才正式支援 S4 columns。

Bioconductor's S4Vectors::DataFrame (capital D, F) is not base R's data.frame. It can hold arbitrary S4 objects (List, Rle, GRanges, etc.) as columns and supports a metadata() slot; the colData and rowData of SummarizedExperiment / SingleCellExperiment are DataFrames by default. A common mistake is coercing a DataFrame via as.data.frame() and writing it back, which drops S4 metadata and flattens Rle / List columns to strings. Use df[df$type == "gene", ] for row subsetting (preserves DataFrame) and only convert to tibble / data.frame when handing off to base APIs like ggplot2 — note that ggplot2 4.x is the first version to formally support S4 columns.

來源:S4Vectors Bioc page · Lawrence et al. 2013 (PLoS Comp Bio, GenomicRanges)

Source: S4Vectors Bioc page · Lawrence et al. 2013 (PLoS Comp Bio, GenomicRanges)

補充說明 · Notes

ggplot2 中 factor 等級順序決定圖例 / 顏色映射順序

當變數作為 fill / color aesthetic 餵入 ggplot2 時,圖例與顏色配對的順序取決於該欄位的 factor levels 而非資料出現順序。若以字串型態送入,ggplot2 會自動按字典序排序(例如 "Control" 排在 "Treatment" 之前看似合理,但 "10mg" 會排在 "2mg" 之前)。請先用 forcats::fct_relevel()factor(x, levels = c(...)) 明確指定順序,並注意 scale_fill_manual(values = c("Control" = "grey", ...)) 的 named vector 寫法可避免依賴 level 順序。stacked bar 由下而上堆疊也遵循同一 level 順序。

When a variable is mapped to ggplot2's fill / color aesthetic, the legend and color assignment follow the column's factor levels, not the data's row order. Strings get lexically sorted automatically — "Control" before "Treatment" looks reasonable, but "10mg" sorts before "2mg". Set the order explicitly with forcats::fct_relevel() or factor(x, levels = c(...)); the named-vector form scale_fill_manual(values = c("Control" = "grey", ...)) avoids reliance on level order. Stacked bars also stack from bottom up in the same level order.

來源:ggplot2 book · Colour scales · forcats docs

Source: ggplot2 book · Colour scales · forcats docs

補充說明 · Notes

色盲友善色票:viridis / Okabe-Ito 而非 rainbow / jet

約 8% 的男性與 0.5% 的女性有紅綠色覺異常,傳統 rainbow() / matplotlib jet 在轉灰階後完全失去順序資訊。連續型量值(log2FC、表達量、p-value)請改用 scale_fill_viridis_c()(含 viridis / magma / cividis 等選項,皆為色盲友善且亮度單調)。類別變數 ≤ 8 類則建議 Okabe & Ito 八色配色(含黑、橘、藍、綠、黃、紫、紅、粉),R 內可由 palette.colors("Okabe-Ito") 取得。檢驗工具:colorBlindness::cvdPlot() 或 colorBrewer 線上模擬器。

About 8% of men and 0.5% of women have red-green color-vision deficiency; the classic rainbow() / matplotlib jet palettes lose all ordinal information when converted to greyscale. For continuous values (log2FC, expression, p-values) use scale_fill_viridis_c() (viridis / magma / cividis are all color-blind safe and monotonic in luminance). For categorical variables with ≤ 8 levels, use the Okabe & Ito 8-color palette (black, orange, blue, green, yellow, purple, red, pink), available via palette.colors("Okabe-Ito"). Test palettes with colorBlindness::cvdPlot() or the online ColorBrewer simulator.

來源:Okabe & Ito · Color Universal Design · viridis CRAN page

Source: Okabe & Ito · Color Universal Design · viridis CRAN page

補充說明 · Notes

tibble 與 data.frame 的細微差異

tibble 修正了 base data.frame 的三個常見坑:(1) 不會自動把字串轉成 factor(base R 4.0 起 stringsAsFactors 預設改為 FALSE,但 4.0 之前的腳本與部分 read.* 仍會踩到);(2) 單欄子集 tbl[, 1] 永遠回傳 tibble,base 的 df[, 1] 預設 drop = TRUE 會回傳 vector,造成下游 nrow()、矩陣運算錯誤;(3) tibble 不允許 partial matching tbl$nam,base data.frame 則會悄悄當成 $name。Bioconductor 函式多預期 base data.frame 或 DataFrame,將 tibble 直接餵入 SummarizedExperiment 等建構子前需以 as.data.frame() 轉換並用 row.names = ... 補回 rownames。

tibble fixes three classic data.frame quirks: (1) it does not coerce strings to factors automatically (base R 4.0+ defaults stringsAsFactors to FALSE, but earlier scripts and some read.* functions still bite); (2) single-column subsetting tbl[, 1] always returns a tibble, whereas base df[, 1] defaults to drop = TRUE and returns a vector, breaking downstream nrow() or matrix operations; (3) tibble disallows partial matching tbl$nam, while base data.frame silently treats it as $name. Many Bioconductor functions expect base data.frame or DataFrame — convert a tibble with as.data.frame() and restore rownames via row.names = ... before feeding it to SummarizedExperiment-style constructors.

來源:tibble · invariants vignette · R4DS · Data import

Source: tibble · invariants vignette · R4DS · Data import

補充說明 · Notes

dplyr 的 *_if / *_at / *_all 已被 across() 取代

自 dplyr 1.0(2020 年 6 月)起,mutate_if()summarise_at()mutate_all() 等 scoped verbs 全部標為 superseded(不棄用但不再開發新功能),改以單一 across() 函式涵蓋三者:mutate(across(where(is.numeric), scale))summarise(across(c(weight, height), mean, na.rm = TRUE))。新教學與 R4DS 第 2 版皆已改用 across;舊腳本仍可運作但建議改寫以提升可讀性與一致性。Lifecycle 標籤可由 ?mutate_if 看到。

Since dplyr 1.0 (June 2020), the scoped verbs mutate_if(), summarise_at(), mutate_all(), and friends are marked superseded (not deprecated, but no further development); a single across() covers all three: mutate(across(where(is.numeric), scale)), summarise(across(c(weight, height), mean, na.rm = TRUE)). R4DS 2e and current tutorials use across exclusively; old scripts still run but should be migrated for readability and consistency. The lifecycle tag is visible on the help page via ?mutate_if.

來源:dplyr · Column-wise operations · dplyr 1.0 release notes

Source: dplyr · Column-wise operations · dplyr 1.0 release notes

補充說明 · Notes

renv::snapshot() 的 implicit 與 explicit 模式

renv 預設使用 type = "implicit":只記錄目前程式碼(R 檔、Rmd、Qmd)實際被 library() / require() 載入的套件。優點是 lockfile 精簡;缺點是只在腳本中以 pkg::fun() 取用、或經由 DESCRIPTION 隱式被載入的套件不會被記錄,部署時容易少裝。若是 R package 開發或 Quarto book,建議改為 renv::snapshot(type = "explicit") 並維護 DESCRIPTION 的 Depends / Imports;想完全凍結整個 library 的研究專案則使用 type = "all"。每次 commit 前跑 renv::status() 檢查 lockfile 是否與安裝同步。

renv defaults to type = "implicit": only packages actually loaded via library() / require() in the current code (R, Rmd, Qmd files) are recorded. The lockfile stays small, but packages used as pkg::fun() or pulled in implicitly via DESCRIPTION may be missed and silently absent at deploy time. For R-package development or Quarto books, prefer renv::snapshot(type = "explicit") while maintaining DESCRIPTION's Depends / Imports; for research projects that want to freeze the entire library, use type = "all". Run renv::status() before each commit to verify the lockfile matches the installed library.

來源:renv official vignette · ?renv::snapshot

Source: renv official vignette · ?renv::snapshot

補充說明 · Notes

sessionInfo() 不夠,請改用 sessioninfo::session_info()

base R 的 sessionInfo() 只列出 attached packages 與 namespace、不顯示套件來源(CRAN / Bioconductor / GitHub)、commit SHA 與 R / Bioconductor 版本對齊資訊;對重現性追蹤不足。sessioninfo::session_info()(亦由 devtools / pkgdown 使用)會額外列出每個套件的 source(如 Bioconductor 3.19Github (user/repo@sha))、Date、Loaded via namespace 並更易閱讀。投稿論文 / 補件給審稿人時建議附 session_info() 而非 sessionInfo();長期專案也可考慮 renv::lockfile() 搭配 sessioninfo 雙軌記錄。

Base R's sessionInfo() lists attached packages and namespaces but does not report package sources (CRAN / Bioconductor / GitHub), commit SHAs, or R / Bioconductor alignment information — insufficient for reproducibility tracking. sessioninfo::session_info() (also used by devtools / pkgdown) additionally reports the source of each package (e.g. Bioconductor 3.19 or Github (user/repo@sha)), the install date, and namespace loads, with much cleaner formatting. When submitting papers or responding to reviewers, attach session_info() rather than sessionInfo(); for long-running projects, pair it with renv::lockfile() for dual tracking.

來源:sessioninfo package site · Sandve et al. 2013 (10 rules reproducible research)

Source: sessioninfo package site · Sandve et al. 2013 (10 rules reproducible research)