How to use this page
針對教學中介紹的每個套件、演算法與概念,本頁列出原始論文與權威文件。標籤說明:
For every package, algorithm, and concept introduced in the tutorial we list the primary literature and authoritative documentation. Tag legend:
Paper
同行評審論文—DOI / PubMed
Peer-reviewed publication — DOI / PubMed
Doc
官方套件文件、vignette、教學
Official package documentation, vignette, tutorial
Best Practice
權威綜述或社群準則
Authoritative review or community guideline
Benchmark
獨立方法比較
Independent method comparison
Database
參考資料庫資源
Reference data resource
Book
免費 / 經典教材
Free / classic textbook
Cheatsheet
來自 Posit 的速查卡
Quick-reference card from Posit
Page contents
⭐ Foundational books & reviews
從這裡入門,建立 R、統計與生物資訊的紮實基礎。以下四本都可合法在線免費閱讀。
Start here for a solid base in R, statistics, and bioinformatics. All four are available legally online for free.
- 📚 BOOK R for Data Science (2e). O'Reilly (2023).
- 📚 BOOK Advanced R (2e). CRC Press (2019).
- 📚 BOOK Modern Statistics for Modern Biology. Cambridge University Press (2019).
- 📚 BOOK Bioinformatics Data Skills. O'Reilly (2015).
- ⭐ REVIEW Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5, R80 (2004).
- ⭐ REVIEW Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods 12, 115–121 (2015).
⚙️ Setup & environment
R 本體、RStudio (Posit) IDE、專案慣例與 BiocManager 安裝器。
R itself, RStudio (Posit) IDE, project conventions, and the BiocManager installer.
- DOC R Project: download & install (CRAN).
- DOC Posit (formerly RStudio): RStudio Desktop free download.
- DOC R Installation and Administration manual.
- DOC An Introduction to R (W.N. Venables, D.M. Smith and the R Core Team).
- DOC BiocManager — installing & managing Bioconductor packages.
- DOC Rtools — Windows toolchain for compiling source packages.
- ⭐ TIPS What They Forgot to Teach You About R. online book (2022).
- PAPER Good Enough Practices in Scientific Computing. PLOS Comput Biol 13(6):e1005510 (2017).
🔤 R language basics
- DOC The R Language Definition (R Core Team).
- 📚 BOOK Advanced R — chapters on vectors, subsetting, control flow, functions.
- ⭐ STYLE The tidyverse style guide.
- ⭐ STYLE Google's R Style Guide (formerly internal).
- 📑 CHEAT Base R cheatsheet (Mhairi McNeill).
🧱 Data structures
- 📚 BOOK Advanced R — Vectors, Lists, Subsetting, S3 / S4 / R6.
-
DOC
tibble套件文件。tibblepackage documentation. - PAPER Object-Oriented Programming, Functional Programming and R. Statistical Science 29(2):167–180 (2014).
-
DOC
methods套件—正式類別與方法 (S4)。methodspackage — formal classes & methods (S4). -
DOC
data.table入門 vignette。data.tableintroduction vignette.
📂 File I/O & path management
-
DOC
readr—快速友善的檔案 I/O(tidyverse)。readr— fast and friendly file I/O (tidyverse). -
DOC
readxl、writexl—Excel I/O。readxl,writexl— Excel I/O. -
DOC
data.table::fread—最快的 CSV 讀取器。data.table::fread— fastest CSV reader. -
DOC
arrow—R 上的 Apache Arrow / Parquet / Feather。arrow— Apache Arrow / Parquet / Feather in R. -
DOC
here—以專案根目錄為基準的穩健路徑。here— robust paths via project root. -
DOC
fst—超快隨機存取大型資料框。fst— ultra-fast random access of large data frames. - PAPER rhdf5 — HDF5 interface for R. Bioconductor (2010+).
🔧 tidyverse
- PAPER Welcome to the tidyverse. J Open Source Softw 4(43):1686 (2019).
- PAPER Tidy Data. J Stat Softw 59(10) (2014).
-
DOC
dplyr—資料操作文法。dplyr— A Grammar of Data Manipulation. -
DOC
tidyr—轉置與 tidy data。tidyr— pivoting & tidy data. -
DOC
purrr—針對 list / vector 的函式式程式設計。purrr— functional programming over lists/vectors. -
DOC
magrittr—pipe 運算子(%>%)。magrittr— pipe operator (%>%). - 📑 CHEAT dplyr / tidyr / purrr cheatsheets (Posit).
🎨 ggplot2 visualization
- PAPER ggplot2: elegant graphics for data analysis. Springer-Verlag (2016, 3rd ed in progress).
- PAPER The Grammar of Graphics (2nd ed). Springer (2005).
-
DOC
ggplot2參考網站。ggplot2reference site. -
DOC
patchwork—以+//組合多個 ggplot。patchwork— compose multiple ggplots with+//. -
DOC
ggrepel—自動避免重疊的文字標籤。ggrepel— automatic non-overlapping text labels. -
DOC
ggpubr、cowplot、ggsci、hrbrthemes—出版級主題與輔助函式。ggpubr,cowplot,ggsci,hrbrthemes— publication themes & helpers. - ⭐ COLOR viridis — color-blind-safe perceptually uniform palettes.
- ⭐ COLOR ColorBrewer — palette design for cartography & data viz.
📐 Biostatistics
- 📚 BOOK Modern Statistics for Modern Biology.
- 📚 BOOK Generalized Additive Models: An Introduction with R (2nd ed). CRC Press (2017).
- PAPER Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. JRSS B 57(1):289–300 (1995).
- PAPER Statistical significance for genome-wide studies. PNAS 100(16):9440–9445 (2003).
-
DOC
stats—base R 統計:t.test()、aov()、lm()、glm()、p.adjust()。stats— base R statistics:t.test(),aov(),lm(),glm(),p.adjust(). -
DOC
broom—整理模型輸出(lm、glm、cox…)。broom— tidy model output (lm, glm, cox, ...). -
DOC
qvalue—Storey q-value(Bioconductor)。qvalue— Storey q-values (Bioconductor).
⏳ Survival analysis
- PAPER Nonparametric Estimation from Incomplete Observations. JASA 53(282):457–481 (1958).
- PAPER Regression Models and Life-Tables. JRSS B 34(2):187–202 (1972).
- 📚 BOOK Modeling Survival Data: Extending the Cox Model. Springer (2000). ISBN 978-0-387-98784-2.
-
DOC
survivalR 套件(Therneau)。survivalR package (Therneau). -
DOC
survminer—KM 圖與森林圖。survminer— KM plots, forest plots. - ⭐ TUT Survival Analysis in R (online tutorial).
🧬 Bioconductor core
- PAPER SummarizedExperiment: a container for matrix-like assays. Bioconductor package (2016+).
- PAPER Software for Computing and Annotating Genomic Ranges. PLOS Comput Biol 9(8):e1003118 (2013).
- PAPER Biostrings: Efficient manipulation of biological strings. Bioconductor package.
-
DOC
AnnotationDbi與org.*.eg.db—基因 ID 對應架構。AnnotationDbi&org.*.eg.db— gene ID mapping framework. -
DOC
GenomicFeatures/TxDb.*與EnsDb.*—轉錄本資料庫。GenomicFeatures/TxDb.*&EnsDb.*— transcript databases. -
DOC
rtracklayer—匯入 / 匯出 GTF / BED / bigWig。rtracklayer— import/export GTF / BED / bigWig. -
DOC
AnnotationHub與ExperimentHub—雲端註解 / 實驗資料。AnnotationHub&ExperimentHub— cloud annotation/experiment data. - 📚 BOOK Bioconductor Course Materials.
🧪 Bulk RNA-seq
- PAPER Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15:550 (2014).
- PAPER edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140 (2010).
- PAPER limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43(7):e47 (2015).
- PAPER voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology 15:R29 (2014).
- PAPER Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research 4:1521 (2015).
- PAPER Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417–419 (2017).
- 📊 BENCH How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA 22(6):839–851 (2016).
- DOC RNA-seq workflow: gene-level exploratory analysis and differential expression (Bioconductor).
🔬 Functional enrichment
- PAPER clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation 2(3):100141 (2021).
- PAPER clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS 16(5):284–287 (2012).
- PAPER Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102(43):15545–15550 (2005).
- PAPER Fast gene set enrichment analysis (fgsea). bioRxiv 060012 (2021).
- 🗄️ DB The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst 1(6):417–425 (2015).
- 🗄️ DB Gene Ontology: tool for the unification of biology. Nat Genet 25:25–29 (2000).
- 🗄️ DB KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28(1):27–30 (2000).
- 🗄️ DB The Reactome pathway knowledgebase. Nucleic Acids Res 50(D1):D687–D692 (2022).
-
DOC
msigdbr—以 tidy R 格式提供 MSigDB 基因集。msigdbr— MSigDB gene sets in tidy R format.
🌋 Heatmap & volcano
- PAPER Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32(18):2847–2849 (2016).
- 📚 BOOK ComplexHeatmap Complete Reference.
-
DOC
pheatmap—一行產生漂亮熱圖。pheatmap— pretty heatmaps with one call. -
DOC
EnhancedVolcano—出版級火山圖。EnhancedVolcano— publication-ready volcano plots. -
DOC
circlize—環狀視覺化(弦圖、Circos 圖)。circlize— circular visualization (chord diagrams, Circos plots). - ⭐ DESIGN Fundamentals of Data Visualization. O'Reilly (2019).
🧩 Variants & GWAS
- PAPER The variant call format and VCFtools. Bioinformatics 27(15):2156–2158 (2011).
- PAPER vcfR: a package to manipulate and visualize variant call format data in R. Mol Ecol Resour 17(1):44–53 (2017).
-
DOC
VariantAnnotation—Bioconductor VCF S4 架構。VariantAnnotation— Bioconductor VCF S4 framework. - PAPER The Ensembl Variant Effect Predictor. Genome Biology 17:122 (2016).
- PAPER PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559–575 (2007).
- PAPER A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28(24):3326–3328 (2012).
- ⭐ REVIEW Genome-wide association studies. Nat Rev Methods Primers 1:59 (2021).
- ⭐ TUT A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int J Methods Psychiatr Res 27:e1608 (2018).
-
DOC
qqman—R 上的曼哈頓圖與 Q-Q 圖。qqman— Manhattan & Q-Q plots in R. - 🗄️ DB The NHGRI-EBI GWAS Catalog. Nucleic Acids Res 51(D1):D977–D985 (2023).
🔵 DNA methylation
- PAPER Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30(10):1363–1369 (2014).
- PAPER ChAMP: updated methylation analysis pipeline for Illumina BeadChips. Bioinformatics 33(24):3982–3984 (2017).
- PAPER De novo identification of differentially methylated regions in the human genome. Epigenetics & Chromatin 8:6 (2015).
- PAPER Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11:587 (2010).
- PAPER methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biology 13:R87 (2012).
- PAPER BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology 13:R83 (2012).
- PAPER DNA methylation age of human tissues and cell types. Genome Biology 14:R115 (2013).
- PAPER methylclock: a Bioconductor package to estimate DNA methylation age. Bioinformatics 37(12):1759–1760 (2021).
- ⭐ TUT A cross-package Bioconductor workflow for analysing methylation array data. F1000Research 5:1281 (2016).
📍 ChIP-seq & ATAC-seq
- PAPER ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31(14):2382–2383 (2015).
- PAPER DiffBind: differential binding analysis of ChIP-seq peak data. Bioconductor / Cancer Research UK Cambridge Institute (2011).
- PAPER Model-based Analysis of ChIP-Seq (MACS). Genome Biology 9:R137 (2008).
- PAPER Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 10:1213–1218 (2013).
- PAPER chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat Methods 14:975–978 (2017).
- 🗄️ DB JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res 52(D1):D174–D182 (2024).
- PAPER Single-cell chromatin state analysis with Signac. Nat Methods 18:1333–1341 (2021).
- PAPER ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat Genet 53:403–411 (2021).
- ⭐ REVIEW From reads to insight: a hitchhiker's guide to ATAC-seq data analysis. Genome Biology 21:22 (2020).
- ⭐ STD ENCODE ChIP-seq / ATAC-seq pipeline standards.
📦 Reproducibility & workflow
-
DOC
renv—可重現的專案級 R 環境。renv— reproducible per-project R environments. - DOC RStudio Projects.
- 📚 BOOK R Markdown: The Definitive Guide. CRC Press (2018).
- DOC Quarto — next-generation literate programming (Posit).
- 📚 BOOK Happy Git and GitHub for the useR.
-
DOC
targets—R 上的 Make 式管線。targets— Make-like pipelines for R. -
DOC
testthat—單元測試架構。testthat— unit testing framework. - DOC Rocker — Docker images for R / RStudio / Bioconductor.
- PAPER Ten Simple Rules for Reproducible Computational Research. PLOS Comput Biol 9(10):e1003285 (2013).
- PAPER An introduction to Docker for reproducible research. SIGOPS Oper. Syst. Rev. 49(1):71–79 (2015).
💬 Community & help
- DOC Bioconductor support forum — package authors typically reply within 24h.
-
DOC
Stack Overflow—
[r]標籤。Stack Overflow —[r]tag. - DOC Posit Community forum.
- DOC Bioconductor Slack workspace (community.bioconductor.org).
- DOC R-Ladies global / chapters — peer-led R community.
- DOC Bioinformatics Stack Exchange.
-
DOC
browseVignettes("<package>")— built-in package tutorials, accessible offline.
citation("packageName") in R prints the recommended BibTeX entry.
📌 教學註記與細節
以下整理 R / Bioconductor 工具與工作流中容易踩坑的細節、版本相依性與情境限制;部分屬於勘誤、多數屬於補充說明。
Notes on R / Bioconductor tool nuances, version dependencies, and context-dependent caveats; some are errata, most are supplementary notes.
DESeq2、edgeR 與 limma-voom 的選擇
三者皆為 bulk RNA-seq 差異表達主流工具,但底層假設不同:DESeq2 與 edgeR 採負二項分佈(NB-GLM),limma-voom 則先以 voom 將計數轉為連續變量並估計 mean-variance 權重後跑線性模型。Schurch 等人(2016, RNA)的 48 重複實驗指出三者在 n < 12 時表現相當,DESeq2 在小樣本下較保守、edgeR 較敏感;當 n > 12 時 limma-voom 速度顯著占優且結果與 NB 方法高度一致。複雜實驗設計(block / random effects)建議 limma 系列;快速重複跑大量對比則 edgeR 的 quasi-likelihood (glmQLFit) 為主流推薦。請勿將 TPM / FPKM 餵入這三者 — 它們都需要原始整數計數。
All three are mainstream bulk RNA-seq DE tools but rest on different assumptions: DESeq2 and edgeR use a negative-binomial GLM, while limma-voom transforms counts into continuous values via voom mean-variance weights and runs a linear model. Schurch et al. 2016 (RNA), based on 48-replicate experiments, found all three comparable when n < 12, with DESeq2 more conservative and edgeR more sensitive in small samples; for n > 12, limma-voom is dramatically faster while producing results highly concordant with NB methods. For complex designs (blocks / random effects) the limma family is recommended; for fast iteration over many contrasts, edgeR's quasi-likelihood pipeline (glmQLFit) is the current standard. Never feed TPM / FPKM into any of the three — they all require raw integer counts.
來源:Schurch et al. 2016 (RNA) · Law et al. 2016 (F1000 RNA-seq workflow)
Source: Schurch et al. 2016 (RNA) · Law et al. 2016 (F1000 RNA-seq workflow)
Median-of-ratios 才適合做差異表達;CPM / TPM 不是
DESeq2 的 median-of-ratios(Anders & Huber 2010)與 edgeR 的 TMM(Robinson & Oshlack 2010)估計 size factor / normalization factor,目的是修正樣本間 library size 與組成偏差,僅在 DE 模型內部使用、不會改動原始計數。TPM 與 FPKM 則在「每個樣本內」對基因長度做正規化,使其總和為 10⁶ — 適合視覺化、跨基因比較;但跨樣本不可比,且違反 DESeq2 / edgeR 對整數計數的假設。常見錯誤:把 TPM 矩陣餵入 DESeq2 後再四捨五入為整數 — 此舉會徹底破壞 dispersion 估計。請保留原始整數計數做 DE,僅在繪圖與 heatmap 時使用 vst() 或 rlog() 轉換。
DESeq2's median-of-ratios (Anders & Huber 2010) and edgeR's TMM (Robinson & Oshlack 2010) estimate size factors / normalization factors to correct between-sample library-size and compositional bias; they are used internally by the DE model and never alter the raw counts. TPM and FPKM normalize within a sample by gene length so each sample sums to 10⁶ — suitable for visualization and cross-gene comparison, but not comparable across samples, and violating the integer-count assumption of DESeq2 / edgeR. A common mistake is feeding a TPM matrix into DESeq2 after rounding to integers, which completely breaks dispersion estimation. Keep raw integer counts for DE and use vst() or rlog() only for plotting and heatmaps.
來源:Anders & Huber 2010 (Genome Biol) · Robinson & Oshlack 2010 (TMM)
Source: Anders & Huber 2010 (Genome Biol) · Robinson & Oshlack 2010 (TMM)
BiocManager 必須鎖定 Bioc 版本:用 BiocManager::valid()
Bioconductor 每半年發新版(4 月 / 10 月),且每個 Bioc release 嚴格綁定一個 R minor version(例如 Bioc 3.19 ⇄ R 4.4)。同時混用不同 release 的套件會出現 namespace 失誤或載入失敗。建議啟動 R 後先跑 BiocManager::valid() 檢查目前裝載的套件是否來自同一 Bioc release;若回傳 not TRUE,請依提示修補。安裝指定版本用 BiocManager::install(version = "3.19");組織需求請考慮搭配 renv 或 Docker (BioContainers / Bioconductor docker images) 凍結整套版本。
Bioconductor releases twice a year (April / October), and each release is strictly tied to one R minor version (e.g. Bioc 3.19 ⇄ R 4.4). Mixing packages from different releases causes namespace failures and load errors. After starting R, run BiocManager::valid() to verify that installed packages come from the same Bioc release; if the result is not TRUE, follow the prompts to repair. Install a specific release with BiocManager::install(version = "3.19"); for team reproducibility, pair this with renv or the official Bioconductor / BioContainers Docker images to pin the entire stack.
來源:Bioconductor Install Guide · Huber et al. 2015 (Nat Methods)
Source: Bioconductor Install Guide · Huber et al. 2015 (Nat Methods)
S4 DataFrame 不等於 base data.frame
Bioconductor 的 S4Vectors::DataFrame(大寫 D、F)並非 base R 的 data.frame。它可放入任意 S4 物件(包含 List、Rle、GRanges)作為欄位,並支援 metadata() slot;SummarizedExperiment、SingleCellExperiment 的 colData / rowData 預設皆為 DataFrame。常見錯誤:把 DataFrame 用 as.data.frame() 強轉後再寫回,會遺失 S4 metadata 並把 Rle / List 欄位展平為字串。正確的列篩選用 df[df$type == "gene", ](保留 DataFrame),需要做 ggplot2 等 base API 時才轉成 tibble / data.frame,並注意 ggplot2 4.x 才正式支援 S4 columns。
Bioconductor's S4Vectors::DataFrame (capital D, F) is not base R's data.frame. It can hold arbitrary S4 objects (List, Rle, GRanges, etc.) as columns and supports a metadata() slot; the colData and rowData of SummarizedExperiment / SingleCellExperiment are DataFrames by default. A common mistake is coercing a DataFrame via as.data.frame() and writing it back, which drops S4 metadata and flattens Rle / List columns to strings. Use df[df$type == "gene", ] for row subsetting (preserves DataFrame) and only convert to tibble / data.frame when handing off to base APIs like ggplot2 — note that ggplot2 4.x is the first version to formally support S4 columns.
來源:S4Vectors Bioc page · Lawrence et al. 2013 (PLoS Comp Bio, GenomicRanges)
Source: S4Vectors Bioc page · Lawrence et al. 2013 (PLoS Comp Bio, GenomicRanges)
ggplot2 中 factor 等級順序決定圖例 / 顏色映射順序
當變數作為 fill / color aesthetic 餵入 ggplot2 時,圖例與顏色配對的順序取決於該欄位的 factor levels 而非資料出現順序。若以字串型態送入,ggplot2 會自動按字典序排序(例如 "Control" 排在 "Treatment" 之前看似合理,但 "10mg" 會排在 "2mg" 之前)。請先用 forcats::fct_relevel() 或 factor(x, levels = c(...)) 明確指定順序,並注意 scale_fill_manual(values = c("Control" = "grey", ...)) 的 named vector 寫法可避免依賴 level 順序。stacked bar 由下而上堆疊也遵循同一 level 順序。
When a variable is mapped to ggplot2's fill / color aesthetic, the legend and color assignment follow the column's factor levels, not the data's row order. Strings get lexically sorted automatically — "Control" before "Treatment" looks reasonable, but "10mg" sorts before "2mg". Set the order explicitly with forcats::fct_relevel() or factor(x, levels = c(...)); the named-vector form scale_fill_manual(values = c("Control" = "grey", ...)) avoids reliance on level order. Stacked bars also stack from bottom up in the same level order.
來源:ggplot2 book · Colour scales · forcats docs
Source: ggplot2 book · Colour scales · forcats docs
色盲友善色票:viridis / Okabe-Ito 而非 rainbow / jet
約 8% 的男性與 0.5% 的女性有紅綠色覺異常,傳統 rainbow() / matplotlib jet 在轉灰階後完全失去順序資訊。連續型量值(log2FC、表達量、p-value)請改用 scale_fill_viridis_c()(含 viridis / magma / cividis 等選項,皆為色盲友善且亮度單調)。類別變數 ≤ 8 類則建議 Okabe & Ito 八色配色(含黑、橘、藍、綠、黃、紫、紅、粉),R 內可由 palette.colors("Okabe-Ito") 取得。檢驗工具:colorBlindness::cvdPlot() 或 colorBrewer 線上模擬器。
About 8% of men and 0.5% of women have red-green color-vision deficiency; the classic rainbow() / matplotlib jet palettes lose all ordinal information when converted to greyscale. For continuous values (log2FC, expression, p-values) use scale_fill_viridis_c() (viridis / magma / cividis are all color-blind safe and monotonic in luminance). For categorical variables with ≤ 8 levels, use the Okabe & Ito 8-color palette (black, orange, blue, green, yellow, purple, red, pink), available via palette.colors("Okabe-Ito"). Test palettes with colorBlindness::cvdPlot() or the online ColorBrewer simulator.
來源:Okabe & Ito · Color Universal Design · viridis CRAN page
Source: Okabe & Ito · Color Universal Design · viridis CRAN page
tibble 與 data.frame 的細微差異
tibble 修正了 base data.frame 的三個常見坑:(1) 不會自動把字串轉成 factor(base R 4.0 起 stringsAsFactors 預設改為 FALSE,但 4.0 之前的腳本與部分 read.* 仍會踩到);(2) 單欄子集 tbl[, 1] 永遠回傳 tibble,base 的 df[, 1] 預設 drop = TRUE 會回傳 vector,造成下游 nrow()、矩陣運算錯誤;(3) tibble 不允許 partial matching tbl$nam,base data.frame 則會悄悄當成 $name。Bioconductor 函式多預期 base data.frame 或 DataFrame,將 tibble 直接餵入 SummarizedExperiment 等建構子前需以 as.data.frame() 轉換並用 row.names = ... 補回 rownames。
tibble fixes three classic data.frame quirks: (1) it does not coerce strings to factors automatically (base R 4.0+ defaults stringsAsFactors to FALSE, but earlier scripts and some read.* functions still bite); (2) single-column subsetting tbl[, 1] always returns a tibble, whereas base df[, 1] defaults to drop = TRUE and returns a vector, breaking downstream nrow() or matrix operations; (3) tibble disallows partial matching tbl$nam, while base data.frame silently treats it as $name. Many Bioconductor functions expect base data.frame or DataFrame — convert a tibble with as.data.frame() and restore rownames via row.names = ... before feeding it to SummarizedExperiment-style constructors.
dplyr 的 *_if / *_at / *_all 已被 across() 取代
自 dplyr 1.0(2020 年 6 月)起,mutate_if()、summarise_at()、mutate_all() 等 scoped verbs 全部標為 superseded(不棄用但不再開發新功能),改以單一 across() 函式涵蓋三者:mutate(across(where(is.numeric), scale))、summarise(across(c(weight, height), mean, na.rm = TRUE))。新教學與 R4DS 第 2 版皆已改用 across;舊腳本仍可運作但建議改寫以提升可讀性與一致性。Lifecycle 標籤可由 ?mutate_if 看到。
Since dplyr 1.0 (June 2020), the scoped verbs mutate_if(), summarise_at(), mutate_all(), and friends are marked superseded (not deprecated, but no further development); a single across() covers all three: mutate(across(where(is.numeric), scale)), summarise(across(c(weight, height), mean, na.rm = TRUE)). R4DS 2e and current tutorials use across exclusively; old scripts still run but should be migrated for readability and consistency. The lifecycle tag is visible on the help page via ?mutate_if.
來源:dplyr · Column-wise operations · dplyr 1.0 release notes
Source: dplyr · Column-wise operations · dplyr 1.0 release notes
renv::snapshot() 的 implicit 與 explicit 模式
renv 預設使用 type = "implicit":只記錄目前程式碼(R 檔、Rmd、Qmd)實際被 library() / require() 載入的套件。優點是 lockfile 精簡;缺點是只在腳本中以 pkg::fun() 取用、或經由 DESCRIPTION 隱式被載入的套件不會被記錄,部署時容易少裝。若是 R package 開發或 Quarto book,建議改為 renv::snapshot(type = "explicit") 並維護 DESCRIPTION 的 Depends / Imports;想完全凍結整個 library 的研究專案則使用 type = "all"。每次 commit 前跑 renv::status() 檢查 lockfile 是否與安裝同步。
renv defaults to type = "implicit": only packages actually loaded via library() / require() in the current code (R, Rmd, Qmd files) are recorded. The lockfile stays small, but packages used as pkg::fun() or pulled in implicitly via DESCRIPTION may be missed and silently absent at deploy time. For R-package development or Quarto books, prefer renv::snapshot(type = "explicit") while maintaining DESCRIPTION's Depends / Imports; for research projects that want to freeze the entire library, use type = "all". Run renv::status() before each commit to verify the lockfile matches the installed library.
來源:renv official vignette · ?renv::snapshot
Source: renv official vignette · ?renv::snapshot
sessionInfo() 不夠,請改用 sessioninfo::session_info()
base R 的 sessionInfo() 只列出 attached packages 與 namespace、不顯示套件來源(CRAN / Bioconductor / GitHub)、commit SHA 與 R / Bioconductor 版本對齊資訊;對重現性追蹤不足。sessioninfo::session_info()(亦由 devtools / pkgdown 使用)會額外列出每個套件的 source(如 Bioconductor 3.19 或 Github (user/repo@sha))、Date、Loaded via namespace 並更易閱讀。投稿論文 / 補件給審稿人時建議附 session_info() 而非 sessionInfo();長期專案也可考慮 renv::lockfile() 搭配 sessioninfo 雙軌記錄。
Base R's sessionInfo() lists attached packages and namespaces but does not report package sources (CRAN / Bioconductor / GitHub), commit SHAs, or R / Bioconductor alignment information — insufficient for reproducibility tracking. sessioninfo::session_info() (also used by devtools / pkgdown) additionally reports the source of each package (e.g. Bioconductor 3.19 or Github (user/repo@sha)), the install date, and namespace loads, with much cleaner formatting. When submitting papers or responding to reviewers, attach session_info() rather than sessionInfo(); for long-running projects, pair it with renv::lockfile() for dual tracking.
來源:sessioninfo package site · Sandve et al. 2013 (10 rules reproducible research)
Source: sessioninfo package site · Sandve et al. 2013 (10 rules reproducible research)