為什麼要寫 script?
❌ 沒寫 script 的痛
100 個樣本要跑 fastp,複製貼上 100 次。隔週老闆問:當時用什麼參數?日後想重跑——記不起來。
100 samples × fastp → 100 copy-pastes. A week later your PI asks "what parameters?" — you can't remember. Trying to rerun the analysis a month later? Good luck.
✅ 寫 script 的好處
所有指令、參數、版本記錄成 scripts/run_qc.sh,一行 bash run_qc.sh 就能 100 樣本全跑。重跑、分享、除錯都簡單。
Commands, parameters, versions live in scripts/run_qc.sh. One bash run_qc.sh reruns everything. Sharing, debugging, time-traveling — all easy.
生資 Bash script 的標準骨架
#!/usr/bin/env bash # run_fastqc.sh — batch FastQC + MultiQC for an NGS project set -euo pipefail IFS=$'\n\t' # ---- Config ---- RAW_DIR="raw_data" OUT_DIR="results/fastqc" LOG_DIR="logs" THREADS=8 # ---- Setup ---- mkdir -p "$OUT_DIR" "$LOG_DIR" # ---- Run ---- for fq in "$RAW_DIR"/*.fastq.gz; do sample=$(basename "$fq" .fastq.gz) echo "[$(date)] FastQC on $sample" fastqc "$fq" -o "$OUT_DIR" --threads "$THREADS" \ >> "$LOG_DIR/fastqc.log" 2>&1 done multiqc "$OUT_DIR" -o "$OUT_DIR" 2>&1 | tee "$LOG_DIR/multiqc.log" echo "[$(date)] All done."
#!/usr/bin/env bash ← shebang,告訴系統用 bash 執行 # 標題與用途註解 ← header / what this script does set -e # 任一指令失敗 → 立刻退出 / fail-fast on any error set -u # 用未定義變數 → 退出(救你不被 $undef_var 害死) set -o pipefail # pipe 中任一段失敗 → 整個 pipeline 算失敗 IFS=$'\n\t' # 防止 spaces 把檔名意外切開(生資檔名常含空白) RAW_DIR=... ← 把所有路徑與參數放最上面,方便修改 / variables on top mkdir -p ... # 產出目錄與 log 目錄都先建好 for fq in ... # 對每個 fastq 跑 FastQC basename # 拿掉路徑與副檔名 → 樣本名 echo + date # log 第一件事就是寫時間 >> log 2>&1 # stdout / stderr 都進 log multiqc ... | tee # tee:螢幕 + log 同時看到
三句話救命:
set -e :失敗就停(不要繼續處理錯誤資料)。
set -u :拼錯變數名就停(避免 rm -rf $TYPO/)。
set -o pipefail :pipe 任一段失敗就算失敗(默認只看最後一段)。
Three life-savers:
set -e — stop on any failure (don't keep processing broken data).
set -u — stop on typoed variables (avoid rm -rf $TYPO/).
set -o pipefail — fail if any stage of a pipe fails (default only sees the last stage).
Bash 必懂語法
# 設變數(不能有空白!) / Set variables — NO spaces around = THREADS=8 GENOME="reference/genome.fa" # 用變數一定加雙引號 / Always quote when using echo "Using $THREADS threads" # 命令展開 / Command substitution NOW=$(date +%Y%m%d) NREADS=$(zcat sample.fq.gz | wc -l) # 預設值 / Default value if undefined THREADS=${THREADS:-4} # 沒設過就用 4
# 對每個 fastq.gz 做事 / Iterate over fastq.gz files for fq in raw_data/*.fastq.gz; do sample=$(basename "$fq" .fastq.gz) echo "Processing $sample..." fastqc "$fq" -o results/fastqc done # 從 sample sheet 讀(while 更安全) / From a sample sheet while IFS=$'\t' read -r sample fq1 fq2; do echo "=== $sample ===" fastp -i "$fq1" -I "$fq2" ... done < samples.tsv
# 一個 file path、一個一個讀 / Read line-by-line while read -r line; do echo "line: $line" done < samples.txt # 重要:read -r 防止 \ 被解釋;IFS 設定可避免空白問題
# 檔案存在嗎? / Does the file exist? if [[ -f "$fq" ]]; then echo "$fq exists" else echo "$fq missing — abort" exit 1 fi # 多重條件 / Multiple conditions if [[ "$THREADS" -gt 8 && -d "$OUT_DIR" ]]; then echo "high-thread run starting" fi # 常見測試運算子 / Common test operators # -f file,-d dir,-e exists,-s 非空 # -eq -ne -gt -lt(數字); == != (字串)
function run_fastqc() { local fq="$1" local outdir="$2" echo "[$(date)] FastQC on $fq" fastqc "$fq" -o "$outdir" --threads 4 } # 呼叫 / Call run_fastqc raw_data/sampleA.fastq.gz results/fastqc
# 取命令列參數 / Read CLI args # $1, $2, ... 第 1, 2 ... 個參數 # $0 腳本名;$# 參數數量;$@ 全部參數 if [[ "$#" -lt 2 ]]; then echo "Usage: $0 <input.fastq.gz> <outdir>" exit 1 fi INPUT="$1" OUTDIR="$2" echo "Running on $INPUT, output to $OUTDIR"
當 script 出錯時的 4 個救命招
實戰心法:把 echo "[$(date)] step X" 撒在每個關鍵步驟前後,等於免費的 log。配合 >> logs/run.log 2>&1,你永遠都知道分析在哪一步死掉。
Pro tip: sprinkle echo "[$(date)] step X" around every key step — free logs. Combined with >> logs/run.log 2>&1, you always know where the run died.
互動:跑一個 mini Bash script
試試:bash run_qc.sh、bash -x run_qc.sh、echo $?
Try: bash run_qc.sh, bash -x run_qc.sh, echo $?
📝 自我檢測
1. 寫 Bash script 時為什麼第一行常是 set -euo pipefail?
1. Why does almost every Bash script start with set -euo pipefail?
2. NREADS=$(zcat sample.fq.gz | wc -l) 在做什麼?
2. What does NREADS=$(zcat sample.fq.gz | wc -l) do?
3. 用什麼指令最能看清楚 script 是在哪一行卡住?
3. Which command best shows where a script is hanging?
4. 為什麼在 Bash 裡用變數要加雙引號("$fq" 而不是 $fq)?
4. Why quote a variable as "$fq" instead of $fq?