CHAPTER 13 / 14

Workflow Manager — Snakemake / Nextflow

從 Bash script 升級到 workflow:宣告式描述「輸入→輸出」,自動處理依賴、平行化、resume 與部署。

Beyond Bash scripts: declarative input→output rules with built-in dependency, parallelism, resume and deployment.

Bash 與 Workflow 的差別

情境Bash scriptWorkflow manager
新增一個樣本 改 script、整批重跑 自動偵測,只跑新樣本
中途失敗 resume 自己標記哪些已完成 自動續跑
平行化 自己處理 DAG 自動平行
cluster 部署 得寫 sbatch wrapper 一個 flag 切換
可重現 / 共享 可,但需自律 天然可重現

Snakemake 還是 Nextflow?

🐍 Snakemake

Python-friendly。語法類似 Make + Python,rule 描述「input → output → shell」。適合:個人專案、研究室內 pipeline、想用 Python 寫驗證/分析的場景。

Python-friendly. Make-meets-Python — each rule says "input → output → shell". Great for personal projects, lab pipelines, anything you want to validate or post-process in Python.

🌊 Nextflow

跨平台、業界標準。Groovy DSL,process 串成 channel 流。適合:HPC / 雲端部署、想接觸 nf-core 標準 pipeline。

Cross-platform, industry standard. Groovy DSL with processes connected by channels. Great for HPC / cloud deployment and the nf-core ecosystem.

💡

選擇建議:不確定就選 Snakemake(學習曲線較緩)。要部署 HPC、雲端或想接 nf-core 的標準 pipeline,就學 Nextflow。

Recommendation: default to Snakemake if undecided (gentler learning curve). Pick Nextflow if you need HPC / cloud deployment or want to use nf-core standard pipelines.

Snakemake 例:FASTQ → FastQC → MultiQC

# Snakefile
SAMPLES = ["sampleA_R1", "sampleA_R2", "sampleB_R1", "sampleB_R2"]

rule all:
    input:
        "results/multiqc_report.html"

rule fastqc:
    input:
        "raw_data/{sample}.fastq.gz"
    output:
        "results/fastqc/{sample}_fastqc.html"
    threads: 4
    shell:
        "fastqc {input} -o results/fastqc --threads {threads}"

rule multiqc:
    input:
        expand("results/fastqc/{s}_fastqc.html", s=SAMPLES)
    output:
        "results/multiqc_report.html"
    shell:
        "multiqc results/fastqc -o results -n multiqc_report.html"
# 預覽 DAG(不執行)/ Preview the DAG
snakemake --dag | dot -Tpng > dag.png

# 跑 — 8 條工作平行 / Run — 8 parallel jobs
snakemake --cores 8

# dry-run,看會跑哪些 rule / Dry run — list rules to be executed
snakemake -n

# 換到 Slurm cluster — 加 profile 即可 / Switch to Slurm
snakemake --profile profiles/slurm --jobs 100

# 用 conda 自動建立每個 rule 的環境
# Auto-create per-rule conda envs
snakemake --use-conda --cores 8
          ┌──────────────────────────┐
          │  rule all (multiqc.html) │
          └──────────────┬───────────┘
                         │
          ┌──────────────▼───────────┐
          │ rule multiqc            │
          └──┬────┬────┬──────┬─────┘
             │    │    │      │
    ┌────────┘    │    │      └────────┐
    ▼             ▼    ▼               ▼
fastqc        fastqc fastqc        fastqc
sampleA_R1    sampleA_R2 sampleB_R1 sampleB_R2

# Snakemake 自動推算依賴並把 4 個 fastqc 平行跑

Nextflow 例:同樣的 FASTQ → FastQC → MultiQC

// main.nf — DSL2 / Nextflow 22+

nextflow.enable.dsl=2

params.reads = "raw_data/*.fastq.gz"

process FASTQC {
    tag "$sample"
    cpus 4
    publishDir "results/fastqc", mode: 'copy'

    input:
    tuple val(sample), path(reads)

    output:
    path "*.html", emit: html
    path "*.zip",  emit: zip

    script:
    """
    fastqc $reads --threads $task.cpus
    """
}

process MULTIQC {
    publishDir "results", mode: 'copy'
    input: path qc_files
    output: path "multiqc_report.html"
    script: "multiqc . -n multiqc_report.html"
}

workflow {
    Channel.fromPath(params.reads)
           .map { fq -> tuple(fq.simpleName, fq) }
           .set { reads_ch }

    FASTQC(reads_ch)
    MULTIQC(FASTQC.out.zip.collect())
}
# 跑 — 預設 local 平行 / Run locally
nextflow run main.nf

# 用 docker / singularity 容器
# Use docker / singularity
nextflow run main.nf -profile docker
nextflow run main.nf -profile singularity

# 部署到 Slurm — 換一個 profile / Slurm deployment
nextflow run main.nf -profile slurm

# resume — 失敗後從中間自動續跑 / Resume after failure
nextflow run main.nf -resume

# 直接跑 nf-core 標準 pipeline / Run an nf-core pipeline
nextflow run nf-core/rnaseq -profile test,docker

一個 workflow 化專案的標準長相

project/
├── config/
│   ├── samples.tsv          # 樣本表
│   └── config.yaml          # pipeline 參數
├── workflow/
│   ├── Snakefile            # 主 workflow
│   ├── rules/               # 拆檔的 rule
│   └── envs/                # 每個 rule 的 env
├── profiles/
│   ├── local/               # 本機跑
│   └── slurm/               # HPC 部署
├── environment.yml          # 主環境
├── README.md
└── .gitignore
💡

workflow/ + config/ + environment.yml 一起放進 git,論文發表時就能 ship 一個完整可重現的 pipeline。這也是 nf-core 與 snakemake-workflows 社群的標準做法。

Commit workflow/, config/ and environment.yml together — your paper ships a reproducible pipeline. This is the standard pattern in the nf-core and snakemake-workflows communities.

互動:跑一個 mini Snakemake

📝 自我檢測

1. 與 Bash script 相比,workflow manager 最關鍵的優勢是?

1. The single most important advantage over a Bash script?

A. 比較漂亮A. Prettier code
B. 自動依賴解析、增量重跑、cluster 部署B. Automatic dependency resolution, incremental rerun, cluster deployment
C. 不需要寫 shell 指令C. You don't need shell commands
D. 跑得比較快D. Faster execution

2. Snakemake 中 rule all: 的角色是?

2. The role of rule all: in Snakemake?

A. 第一個被執行的 ruleA. The first rule to execute
B. 包含所有 shell 指令B. Holds all shell commands
C. Snakemake 規範的保留字C. A Snakemake reserved word
D. 列出最終產出,當 input → 推回所需 ruleD. Lists final outputs; Snakemake works backwards from there

3. Nextflow 的 -resume 用途是?

3. What does Nextflow's -resume do?

A. 從失敗點繼續,已成功的 process 直接從 cache 取結果A. Continue from the failed step; reuse cached results
B. 重新開始 pipelineB. Restart from scratch
C. 從第二個樣本繼續C. Continue from the second sample
D. 預設行為D. Default behaviour

4. nf-core/rnaseq 是什麼?

4. What is nf-core/rnaseq?

A. Snakemake 的範例A. A Snakemake example
B. RNA-seq 軟體B. An RNA-seq tool
C. nf-core 提供的標準 Nextflow RNA-seq pipelineC. The community-standard Nextflow RNA-seq pipeline from nf-core
D. R 套件D. An R package