CHAPTER 12 / 14

Bash Scripting — 把指令變成可重跑流程

變數、loop、條件、函式、set -euo pipefail、log、debugging — 從手動跑指令的初學者,跨進中階 NGS 分析。

Variables, loops, conditionals, functions, set -euo pipefail, logging, debugging — the leap from hand-running commands to intermediate NGS practice.

為什麼要寫 script?

❌ 沒寫 script 的痛

100 個樣本要跑 fastp,複製貼上 100 次。隔週老闆問:當時用什麼參數?日後想重跑——記不起來。

100 samples × fastp → 100 copy-pastes. A week later your PI asks "what parameters?" — you can't remember. Trying to rerun the analysis a month later? Good luck.

✅ 寫 script 的好處

所有指令、參數、版本記錄成 scripts/run_qc.sh,一行 bash run_qc.sh 就能 100 樣本全跑。重跑、分享、除錯都簡單。

Commands, parameters, versions live in scripts/run_qc.sh. One bash run_qc.sh reruns everything. Sharing, debugging, time-traveling — all easy.

生資 Bash script 的標準骨架

#!/usr/bin/env bash
# run_fastqc.sh — batch FastQC + MultiQC for an NGS project

set -euo pipefail
IFS=$'\n\t'

# ---- Config ----
RAW_DIR="raw_data"
OUT_DIR="results/fastqc"
LOG_DIR="logs"
THREADS=8

# ---- Setup ----
mkdir -p "$OUT_DIR" "$LOG_DIR"

# ---- Run ----
for fq in "$RAW_DIR"/*.fastq.gz; do
    sample=$(basename "$fq" .fastq.gz)
    echo "[$(date)] FastQC on $sample"
    fastqc "$fq" -o "$OUT_DIR" --threads "$THREADS" \
        >> "$LOG_DIR/fastqc.log" 2>&1
done

multiqc "$OUT_DIR" -o "$OUT_DIR" 2>&1 | tee "$LOG_DIR/multiqc.log"

echo "[$(date)] All done."
#!/usr/bin/env bash      ← shebang,告訴系統用 bash 執行
# 標題與用途註解
                          ← header / what this script does

set -e        # 任一指令失敗 → 立刻退出 / fail-fast on any error
set -u        # 用未定義變數 → 退出(救你不被 $undef_var 害死)
set -o pipefail # pipe 中任一段失敗 → 整個 pipeline 算失敗
IFS=$'\n\t'  # 防止 spaces 把檔名意外切開(生資檔名常含空白)

RAW_DIR=...     ← 把所有路徑與參數放最上面,方便修改 / variables on top

mkdir -p ...    # 產出目錄與 log 目錄都先建好

for fq in ...   # 對每個 fastq 跑 FastQC
  basename     # 拿掉路徑與副檔名 → 樣本名
  echo + date  # log 第一件事就是寫時間
  >> log 2>&1 # stdout / stderr 都進 log

multiqc ... | tee # tee:螢幕 + log 同時看到
⚠️

三句話救命:
set -e :失敗就停(不要繼續處理錯誤資料)。
set -u :拼錯變數名就停(避免 rm -rf $TYPO/)。
set -o pipefail :pipe 任一段失敗就算失敗(默認只看最後一段)。

Three life-savers:
set -e — stop on any failure (don't keep processing broken data).
set -u — stop on typoed variables (avoid rm -rf $TYPO/).
set -o pipefail — fail if any stage of a pipe fails (default only sees the last stage).

Bash 必懂語法

# 設變數(不能有空白!) / Set variables — NO spaces around =
THREADS=8
GENOME="reference/genome.fa"

# 用變數一定加雙引號 / Always quote when using
echo "Using $THREADS threads"

# 命令展開 / Command substitution
NOW=$(date +%Y%m%d)
NREADS=$(zcat sample.fq.gz | wc -l)

# 預設值 / Default value if undefined
THREADS=${THREADS:-4}  # 沒設過就用 4
# 對每個 fastq.gz 做事 / Iterate over fastq.gz files
for fq in raw_data/*.fastq.gz; do
    sample=$(basename "$fq" .fastq.gz)
    echo "Processing $sample..."
    fastqc "$fq" -o results/fastqc
done

# 從 sample sheet 讀(while 更安全) / From a sample sheet
while IFS=$'\t' read -r sample fq1 fq2; do
    echo "=== $sample ==="
    fastp -i "$fq1" -I "$fq2" ...
done < samples.tsv
# 一個 file path、一個一個讀 / Read line-by-line
while read -r line; do
    echo "line: $line"
done < samples.txt

# 重要:read -r 防止 \ 被解釋;IFS 設定可避免空白問題
# 檔案存在嗎? / Does the file exist?
if [[ -f "$fq" ]]; then
    echo "$fq exists"
else
    echo "$fq missing — abort"
    exit 1
fi

# 多重條件 / Multiple conditions
if [[ "$THREADS" -gt 8 && -d "$OUT_DIR" ]]; then
    echo "high-thread run starting"
fi

# 常見測試運算子 / Common test operators
# -f file,-d dir,-e exists,-s 非空
# -eq -ne -gt -lt(數字); == != (字串)
function run_fastqc() {
    local fq="$1"
    local outdir="$2"
    echo "[$(date)] FastQC on $fq"
    fastqc "$fq" -o "$outdir" --threads 4
}

# 呼叫 / Call
run_fastqc raw_data/sampleA.fastq.gz results/fastqc
# 取命令列參數 / Read CLI args
# $1, $2, ... 第 1, 2 ... 個參數
# $0 腳本名;$# 參數數量;$@ 全部參數

if [[ "$#" -lt 2 ]]; then
    echo "Usage: $0 <input.fastq.gz> <outdir>"
    exit 1
fi

INPUT="$1"
OUTDIR="$2"
echo "Running on $INPUT, output to $OUTDIR"

當 script 出錯時的 4 個救命招

bash -x run.sh
逐行追蹤展開
bash -n run.sh
只檢查語法
echo $?
看 exit code
trap 'echo "ERR at line $LINENO"' ERR
失敗時印行號
💡

實戰心法:echo "[$(date)] step X" 撒在每個關鍵步驟前後,等於免費的 log。配合 >> logs/run.log 2>&1,你永遠都知道分析在哪一步死掉。

Pro tip: sprinkle echo "[$(date)] step X" around every key step — free logs. Combined with >> logs/run.log 2>&1, you always know where the run died.

互動:跑一個 mini Bash script

💡

試試:bash run_qc.shbash -x run_qc.shecho $?

Try: bash run_qc.sh, bash -x run_qc.sh, echo $?

📝 自我檢測

1. 寫 Bash script 時為什麼第一行常是 set -euo pipefail

1. Why does almost every Bash script start with set -euo pipefail?

A. 任一錯誤、未定義變數、pipe 失敗,都立刻退出A. Bail out on any error, undefined variable, or pipe failure
B. 啟用詳細模式B. Enables verbose mode
C. 自動裝缺少的工具C. Auto-installs missing tools
D. 是 Bash 必須語法D. It's required syntax

2. NREADS=$(zcat sample.fq.gz | wc -l) 在做什麼?

2. What does NREADS=$(zcat sample.fq.gz | wc -l) do?

A. 設 NREADS 為字串 "zcat sample.fq.gz | wc -l"A. Sets NREADS to the literal string
B. 沒效,缺空格B. Errors — missing spaces
C. 把 zcat | wc -l 的結果存到 NREADSC. Captures the result of zcat | wc -l into NREADS
D. 把指令結果輸出到 NREADS 檔案D. Writes the result to a file named NREADS

3. 用什麼指令最能看清楚 script 是在哪一行卡住?

3. Which command best shows where a script is hanging?

bash -e run.sh
bash -x run.sh
bash -n run.sh
bash --quiet run.sh

4. 為什麼在 Bash 裡用變數要加雙引號("$fq" 而不是 $fq)?

4. Why quote a variable as "$fq" instead of $fq?

A. 雙引號比較好看A. It looks nicer
B. 雙引號加快 script 執行B. Quotes make scripts faster
C. Bash 規定必須加C. Bash requires it
D. 防止檔名含空白被當成多個參數D. Prevents space-containing filenames from splitting into multiple args