Bash 與 Workflow 的差別
| 情境 | Bash script | Workflow manager |
|---|---|---|
| 新增一個樣本 | 改 script、整批重跑 | 自動偵測,只跑新樣本 |
| 中途失敗 resume | 自己標記哪些已完成 | 自動續跑 |
| 平行化 | 自己處理 | DAG 自動平行 |
| cluster 部署 | 得寫 sbatch wrapper | 一個 flag 切換 |
| 可重現 / 共享 | 可,但需自律 | 天然可重現 |
Snakemake 還是 Nextflow?
🐍 Snakemake
Python-friendly。語法類似 Make + Python,rule 描述「input → output → shell」。適合:個人專案、研究室內 pipeline、想用 Python 寫驗證/分析的場景。
Python-friendly. Make-meets-Python — each rule says "input → output → shell". Great for personal projects, lab pipelines, anything you want to validate or post-process in Python.
🌊 Nextflow
跨平台、業界標準。Groovy DSL,process 串成 channel 流。適合:HPC / 雲端部署、想接觸 nf-core 標準 pipeline。
Cross-platform, industry standard. Groovy DSL with processes connected by channels. Great for HPC / cloud deployment and the nf-core ecosystem.
Snakemake 例:FASTQ → FastQC → MultiQC
# Snakefile SAMPLES = ["sampleA_R1", "sampleA_R2", "sampleB_R1", "sampleB_R2"] rule all: input: "results/multiqc_report.html" rule fastqc: input: "raw_data/{sample}.fastq.gz" output: "results/fastqc/{sample}_fastqc.html" threads: 4 shell: "fastqc {input} -o results/fastqc --threads {threads}" rule multiqc: input: expand("results/fastqc/{s}_fastqc.html", s=SAMPLES) output: "results/multiqc_report.html" shell: "multiqc results/fastqc -o results -n multiqc_report.html"
# 預覽 DAG(不執行)/ Preview the DAG snakemake --dag | dot -Tpng > dag.png # 跑 — 8 條工作平行 / Run — 8 parallel jobs snakemake --cores 8 # dry-run,看會跑哪些 rule / Dry run — list rules to be executed snakemake -n # 換到 Slurm cluster — 加 profile 即可 / Switch to Slurm snakemake --profile profiles/slurm --jobs 100 # 用 conda 自動建立每個 rule 的環境 # Auto-create per-rule conda envs snakemake --use-conda --cores 8
┌──────────────────────────┐
│ rule all (multiqc.html) │
└──────────────┬───────────┘
│
┌──────────────▼───────────┐
│ rule multiqc │
└──┬────┬────┬──────┬─────┘
│ │ │ │
┌────────┘ │ │ └────────┐
▼ ▼ ▼ ▼
fastqc fastqc fastqc fastqc
sampleA_R1 sampleA_R2 sampleB_R1 sampleB_R2
# Snakemake 自動推算依賴並把 4 個 fastqc 平行跑Nextflow 例:同樣的 FASTQ → FastQC → MultiQC
// main.nf — DSL2 / Nextflow 22+ nextflow.enable.dsl=2 params.reads = "raw_data/*.fastq.gz" process FASTQC { tag "$sample" cpus 4 publishDir "results/fastqc", mode: 'copy' input: tuple val(sample), path(reads) output: path "*.html", emit: html path "*.zip", emit: zip script: """ fastqc $reads --threads $task.cpus """ } process MULTIQC { publishDir "results", mode: 'copy' input: path qc_files output: path "multiqc_report.html" script: "multiqc . -n multiqc_report.html" } workflow { Channel.fromPath(params.reads) .map { fq -> tuple(fq.simpleName, fq) } .set { reads_ch } FASTQC(reads_ch) MULTIQC(FASTQC.out.zip.collect()) }
# 跑 — 預設 local 平行 / Run locally nextflow run main.nf # 用 docker / singularity 容器 # Use docker / singularity nextflow run main.nf -profile docker nextflow run main.nf -profile singularity # 部署到 Slurm — 換一個 profile / Slurm deployment nextflow run main.nf -profile slurm # resume — 失敗後從中間自動續跑 / Resume after failure nextflow run main.nf -resume # 直接跑 nf-core 標準 pipeline / Run an nf-core pipeline nextflow run nf-core/rnaseq -profile test,docker
一個 workflow 化專案的標準長相
project/ ├── config/ │ ├── samples.tsv # 樣本表 │ └── config.yaml # pipeline 參數 ├── workflow/ │ ├── Snakefile # 主 workflow │ ├── rules/ # 拆檔的 rule │ └── envs/ # 每個 rule 的 env ├── profiles/ │ ├── local/ # 本機跑 │ └── slurm/ # HPC 部署 ├── environment.yml # 主環境 ├── README.md └── .gitignore
把 workflow/ + config/ + environment.yml 一起放進 git,論文發表時就能 ship 一個完整可重現的 pipeline。這也是 nf-core 與 snakemake-workflows 社群的標準做法。
Commit workflow/, config/ and environment.yml together — your paper ships a reproducible pipeline. This is the standard pattern in the nf-core and snakemake-workflows communities.
互動:跑一個 mini Snakemake
📝 自我檢測
1. 與 Bash script 相比,workflow manager 最關鍵的優勢是?
1. The single most important advantage over a Bash script?
2. Snakemake 中 rule all: 的角色是?
2. The role of rule all: in Snakemake?
3. Nextflow 的 -resume 用途是?
3. What does Nextflow's -resume do?
4. nf-core/rnaseq 是什麼?
4. What is nf-core/rnaseq?