為什麼要把工具裝在「環境」裡?
生物資訊分析常見的崩潰場景:
- 專案 A 用 samtools 1.9,專案 B 用 samtools 1.20,兩者結果不同
- 新裝的 BWA 拉錯了 zlib 版本,把舊環境的 STAR 弄壞了
- 論文寫完一年後想重跑,發現工具升級到後續版本,結果差了一截
解法是 每個專案隔離一個 conda 環境,並把環境寫進 environment.yml。即使 5 年後,一行 conda env create -f environment.yml 就能重建。
Common bioinformatics meltdowns:
- Project A uses samtools 1.9, project B uses 1.20 — results differ.
- A fresh BWA install drags in the wrong zlib and breaks STAR in another env.
- One year later you can't reproduce a paper because every tool has been silently upgraded.
The fix: one isolated conda env per project, captured in environment.yml. Even five years later, conda env create -f environment.yml recreates it exactly.
安裝 Miniconda / Mamba
Miniconda (最小化的 conda)
不要裝完整的 Anaconda!它包含太多不必要的套件,環境很重。Miniconda 只含 conda + Python,再依需求安裝。
Avoid the full Anaconda — it ships with way too much. Miniconda is just conda + Python; install everything else as needed.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh source ~/.bashrc
Mamba (更快的 conda)
conda 解依賴超慢,mamba 用 C++ 重寫,速度快 10 倍以上。是現代 NGS 環境的標準配置。
conda's solver is slow; mamba is a C++ rewrite ~10× faster — the modern NGS standard.
conda install -n base -c conda-forge mamba -y # 之後 mamba 與 conda 用法幾乎一樣 / use mamba like conda
Channels — Bioconda 是生資的家
Conda 從「channel」下載套件,常見三個:
Conda fetches packages from "channels". The three you need:
| Channel | 提供 |
|---|---|
conda-forge | 高品質通用套件 |
bioconda | 生資工具 |
defaults | 官方 channel(建議移除) |
# Bioconda 官方推薦 channel 順序:strict + conda-forge 優先 # Bioconda's official setup: strict + conda-forge priority conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge conda config --set channel_priority strict
conda config --show channels # channels: # - conda-forge # - bioconda # - defaults
一個 RNA-seq 環境的標準做法
一行建好 RNA-seq 標準工具集
mamba create -n rnaseq -c conda-forge -c bioconda \ python=3.11 \ fastqc multiqc \ fastp trim-galore \ star hisat2 salmon kallisto \ samtools bcftools bedtools \ subread \ -y conda activate rnaseq conda env export > environment.yml
name: rnaseq channels: - conda-forge - bioconda dependencies: - python=3.11 - fastqc=0.12.1 - multiqc=1.21 - fastp=0.23.4 - star=2.7.11b - salmon=1.10.2 - samtools=1.20 - bedtools=2.31.1 - subread=2.0.6 - pip - pip: - # pip-only packages here
實務建議:把 environment.yml 跟 scripts/ 一起放進 git;論文發表時連同分析資料一起提供,這就是符合 FAIR + 可重現性的標準做法。
Best practice: commit environment.yml alongside scripts/ in git, and ship it with your paper's data release — this is the FAIR-compliant reproducibility baseline.
新手最常踩的 5 個坑
🌳 conda 排錯流程
base 了 → 永遠新建 env,base 只裝 mamba。mamba;或加 channel_priority strict。conda activate 了;用 which samtools 查路徑。environment.yml 重建。$HOME/miniconda3,加進 ~/.bashrc,再 conda init bash。base → always create a new env. Keep base for mamba only.mamba, or add channel_priority strict.conda activate? Check which samtools.environment.yml.$HOME/miniconda3, add to ~/.bashrc, then conda init bash.互動:建立並切換 conda 環境
📝 自我檢測
1. 為什麼 environment.yml 對可重現性這麼重要?
1. Why is environment.yml so critical for reproducibility?
2. mamba 與 conda 的關係?
2. Relationship between mamba and conda?
3. 設定 conda channel 的標準推薦順序?
3. Standard channel priority order?
4. 安裝 samtools 後,which samtools 找不到。最可能的原因?
4. After installing samtools, which samtools finds nothing. Most likely cause?