HPC 的世界 — Login node、Compute node、Storage
HPC(High-Performance Computing cluster)是一群連在一起的機器,由排程器(如 Slurm)統一管理。常見組成:
- Login node:你 SSH 連進去的機器,不要在這裡跑大型計算!只用來瀏覽檔案、提交工作。
- Compute node:實際跑分析的機器,必須透過 Slurm 申請。
- Storage:通常分多區,
/home容量小但永久、/scratch容量大但會清除、/data共享/長期儲存。
An HPC cluster is many machines glued together by a scheduler (e.g. Slurm). Typical layers:
- Login node: where SSH lands you. Never run heavy compute here. Use it only for browsing and job submission.
- Compute nodes: where real work runs — you must request them via Slurm.
- Storage: usually tiered —
/homesmall but permanent,/scratchlarge but ephemeral,/datashared/long-term.
HPC 第一條規則:不要在 login node 上跑 fastp、STAR、BWA 等。Login node 是和所有人共用的入口,會被你的工作拖垮,可能還收到管理員警告信。
HPC rule #1: never run fastp / STAR / BWA on the login node. It's a shared entry point — you'll grind it for everyone and probably get a "please don't" email from sysadmins.
Slurm — 最普及的 HPC 排程器
標準的 sbatch script 長這樣
#!/usr/bin/env bash # run_job.sh — submit with: sbatch run_job.sh #SBATCH --job-name=fastqc #SBATCH --partition=normal #SBATCH --cpus-per-task=8 #SBATCH --mem=16G #SBATCH --time=04:00:00 #SBATCH --output=logs/fastqc_%j.out #SBATCH --error=logs/fastqc_%j.err #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=you@uni.edu # %j 會被替換為 job ID set -euo pipefail # 啟用 conda env / Activate conda env source ~/miniconda3/etc/profile.d/conda.sh conda activate rnaseq # 跑流程 / Run pipeline bash scripts/run_qc.sh
# 提交 / Submit sbatch run_job.sh # Submitted batch job 12345678 # 看狀態 / Check status squeue -u $USER # 即時看 stdout(job 跑完前)/ Tail stdout while running tail -f logs/fastqc_12345678.out # 完成後看實際資源使用 / Post-mortem resource usage sacct -j 12345678 --format=JobID,State,Elapsed,MaxRSS,ReqMem,NCPUS
資源規劃心法:第一次跑時申請寬鬆(CPU 8、mem 32G、time 4h),跑完用 sacct 看實際 MaxRSS(記憶體峰值)和 Elapsed,下次精確調整。Cluster 的隊列權重和記憶體使用率有關,過度申請會排很久。
Resource planning tip: first run with generous limits (8 CPUs, 32 G, 4 h). After completion check sacct for actual MaxRSS and Elapsed, then tune. Schedulers penalise over-requesting — your jobs queue longer.
容器:把整個分析環境封裝成一個檔
conda 解決「軟體版本」問題;容器解決「整個 OS + 系統函式庫 + 軟體」問題。在科研重現性上,容器是金本位:你今天打包的 BWA + samtools,5 年後在另一台機器上還是逐位相同。
conda solves "software version"; containers solve "entire OS + system libraries + software". For scientific reproducibility, containers are the gold standard — the BWA + samtools you bundle today is bit-for-bit identical 5 years later, on any machine.
| Tool | 場景 | 關鍵特性 |
|---|---|---|
| Docker | 筆電 / 雲端 | 需 root,HPC 不友善 |
| Apptainer / Singularity | HPC 標準 | 不需 root、可跑 docker image |
| BioContainers | 生資工具官方 image | 與 Bioconda 同步 |
實戰範例
# 從 BioContainers 拉 samtools 1.20 image # Pull a BioContainers image docker pull quay.io/biocontainers/samtools:1.20--h50ea8bc_0 # 跑指令(mount 當前目錄到容器內 /data) # Run a command (mount $PWD into /data) docker run --rm -v $PWD:/data -w /data \ quay.io/biocontainers/samtools:1.20--h50ea8bc_0 \ samtools flagstat sampleA.bam
# 把 docker image 轉成 .sif(HPC 上常見做法) # Convert docker image to .sif (typical HPC workflow) apptainer pull samtools.sif \ docker://quay.io/biocontainers/samtools:1.20--h50ea8bc_0 # 跑指令(家目錄會自動 mount 進去) # Run a command (your $HOME is auto-mounted) apptainer exec samtools.sif samtools flagstat sampleA.bam # 在 sbatch script 中使用 / Inside an sbatch script #SBATCH --cpus-per-task=8 apptainer exec --bind /scratch samtools.sif \ samtools sort -@ 8 in.bam -o out.bam
實務黃金組合:Snakemake / Nextflow + Apptainer + Slurm。每個 rule / process 跑在一個 container 裡,由 workflow 自動拉 image、Slurm 自動排程資源——這就是現代生資 production pipeline 的標準長相。
Production golden combo: Snakemake / Nextflow + Apptainer + Slurm. Each rule / process runs in its container, the workflow handles image fetch, and Slurm schedules resources — the canonical modern bioinformatics production stack.
總結:可重現分析的 5 層保障
① 資料層 (raw_data)
md5 / sha256 checksums 確保檔案沒有被悄悄改動或下載損壞。
md5 / sha256 checksums prove the files weren't silently changed or corrupted.
② 程式碼層
放進 git,每個分析 commit 有對應 hash,可追到每行的歷史。
Tracked in git — every analysis points to a commit hash with full history.
③ 環境層
conda env 描述軟體版本依賴,conda env create -f environment.yml 一鍵重建。
Conda env captures software versions; conda env create -f environment.yml rebuilds it.
④ 容器層
把整個 OS + 工具凍結成 .sif 或 image,跨年跨機器都能跑。
Freeze the entire OS + tools into a .sif / image — runs identically across years and machines.
⑤ 流程層
workflow 把 rule、依賴、執行環境綁在一起;他人 clone repo + snakemake --use-singularity 就能完整重跑。
The workflow binds rules, dependencies and runtime; collaborators clone the repo and run snakemake --use-singularity to reproduce.
互動:模擬 Slurm 提交與監控
📝 自我檢測
1. SSH 連到 cluster 後,下面哪一個是絕對不該做的?
1. After SSH-ing to a cluster, which is absolutely wrong?
2. sbatch 與 srun 的差別?
2. Difference between sbatch and srun?
3. 為什麼 HPC 上多用 Apptainer / Singularity 而非 Docker?
3. Why is Apptainer / Singularity preferred on HPC over Docker?
4. 想精確調整下次 sbatch 的 --mem 參數,最該看?
4. To tune --mem for next sbatch, which value matters most?
🎉 你完成了 14 章 Linux for Bioinformatics
從第一個 pwd 到部署到 Slurm + Apptainer + Snakemake,你已經具備:
- 整理 NGS 專案、走讀檔案系統、用權限與 symlink 管理大資料
- 用 grep / cut / awk / sed 串成 mini pipeline 處理 GTF / BED / VCF
- 用 conda 管理可重現環境、跑 FastQC / MultiQC / samtools 標準流程
- 把指令包成 Bash script、再升級成 Snakemake / Nextflow workflow
- 把 pipeline 部署到 HPC,搭配容器確保跨機器、跨年代的可重現性
建議下一步:把學到的東西,套用到一個你自己的小型專案——下載一份 SRA 公開資料、寫 environment.yml、跑 FastQC + alignment + flagstat、最後封裝成一個 Snakemake pipeline,存進 git。等你完成這件事,你已經是一位獨立的初階生資工程師了。
From your first pwd to deploying on Slurm + Apptainer + Snakemake, you can now:
- Organise NGS projects, navigate the filesystem, and use permissions / symlinks to manage big data.
- Build mini pipelines with grep / cut / awk / sed for GTF / BED / VCF.
- Manage reproducible environments with conda; run FastQC / MultiQC / samtools as standard.
- Wrap commands into Bash scripts, then upgrade to Snakemake / Nextflow workflows.
- Deploy pipelines on HPC with containers — reproducible across machines and years.
Next step: apply this to a small project of your own — pull a SRA dataset, write an environment.yml, run FastQC + alignment + flagstat, and wrap everything in a Snakemake pipeline tracked in git. Once that's done, you are an independent junior bioinformatician.
進階方向:差異表達 (DESeq2 / edgeR) → 多體學整合 → 單細胞 (scRNA-seq) → 雲端原生 (AWS Batch / Terra) → Git + GitHub Actions for analyses。
Where to go next: differential expression (DESeq2 / edgeR) → multi-omics integration → single cell (scRNA-seq) → cloud-native (AWS Batch / Terra) → Git + GitHub Actions for analyses.