Ch 1: Why Linux for Bioinformatics

為什麼

為什麼生物資訊幾乎全部跑在 Linux 上？

當你拿到一份 NGS 資料，常見場景是：FASTQ 檔總計 80 GB、reference genome 3 GB、要跑 alignment、samtools sort、picard、GATK，整段流程要跑 6 小時、要記錄 log、要重跑、要在 cluster 上跑——這些工作幾乎全部依賴 Linux。

Linux 在生物資訊中如此普及，並不是偶然，而是因為它在四個面向上特別契合 NGS 分析的需求：

You receive an NGS dataset: 80 GB of FASTQ, a 3 GB reference genome. You need to run alignment, samtools sort, Picard, GATK — six hours of pipeline with logs, reruns, and possibly a cluster. Almost all of this lives on Linux.

Linux dominates bioinformatics for four reasons that align almost perfectly with what NGS analysis demands:

🛠️

① 工具生態系

幾乎所有主流工具（BWA、samtools、bedtools、GATK、Salmon、STAR、Bowtie2、bcftools…）都以 Linux 為主要平台，許多甚至只有 Linux 版本。

Nearly every major tool (BWA, samtools, bedtools, GATK, Salmon, STAR, Bowtie2, bcftools…) is built first — and often only — for Linux.

⚡

② 文字處理威力

FASTQ、SAM、VCF、GTF 本質上都是大型文字檔。Linux 的 grep / awk / sed / pipe 能用一行指令處理 TB 級資料。

FASTQ, SAM, VCF, GTF are all big text files. Linux's grep / awk / sed / pipes process TB-scale data in a single line.

🖧

③ HPC 與雲端原生

所有主流超級電腦與生物資訊計算節點 99% 跑 Linux。Slurm、PBS、SGE、AWS Batch 都是 Linux 環境。

99% of supercomputers and bioinformatics compute nodes run Linux. Slurm, PBS, SGE, AWS Batch — all Linux.

🔁

④ 可重現性

conda、Docker、Apptainer、Snakemake、Nextflow 都圍繞 Linux 設計，是科研可重現的核心基礎建設。

conda, Docker, Apptainer, Snakemake, Nextflow are all designed around Linux — the core infrastructure of reproducible science.

💡

核心觀念：學 Linux 不是為了「會用 Linux」，而是為了「能在現代生物資訊環境中工作」。Wet-lab 出身、第一次看到 terminal 的研究人員，往往最大的障礙不是指令，而是「不知道為什麼要打字而不是用滑鼠點」。當你理解了上面四個原因，學習動機就會穩定下來。

The key idea: learning Linux is not about knowing Linux — it's about being able to work in a modern bioinformatics environment. The biggest obstacle for wet-lab researchers seeing a terminal for the first time is rarely the commands; it's the question "why type when I can click?" Once those four reasons click, the motivation stabilises.

觀念

GUI 與 CLI 的差別 — 為什麼要用「黑黑的視窗」？

面向	GUI（圖形介面）	CLI（命令列）
操作方式	滑鼠點擊、視覺化	輸入文字指令
批次處理	需重複手動點擊	用 loop 一行搞定 100 個樣本
可重現性	很難記錄「按了哪些」	指令本身就是文件 + 紀錄
遠端執行	通常無法或極慢	SSH 連線即可，輕量高速
自動化	需要額外巨集或軟體	天然可寫成 script、cron、pipeline

⚠️

許多研究人員希望「用 Excel 處理 FASTQ」——但 Excel 連 1,048,576 列以上的資料都打不開，更別提 30 GB 的 fastq.gz。Linux 命令列不是復古，而是大型生物資料的唯一可行處理方式之一。

Many researchers wish they could open FASTQ in Excel — but Excel chokes past 1,048,576 rows, never mind a 30 GB fastq.gz. The Linux command line isn't retro: for large biological data it's one of the only viable interfaces.

實作場景

一個典型的 NGS 分析專案長什麼樣？

從第一週開始，我們就會用「真實的 NGS 專案結構」來教 Linux，而不是抽象的 file1.txt、file2.txt。下面這個資料夾結構，會貫穿整門課程：

From day one, we teach Linux with a real NGS project layout — not abstract file1.txt, file2.txt. The directory tree below is the spine of the entire course:

project/
├── raw_data/           # 原始 FASTQ（不要動！）
│   ├── sampleA_R1.fastq.gz
│   ├── sampleA_R2.fastq.gz
│   ├── sampleB_R1.fastq.gz
│   └── sampleB_R2.fastq.gz
├── metadata/           # 樣本資訊、實驗設計
│   └── samples.tsv
├── reference/          # 參考基因體與 annotation
│   ├── genome.fa
│   └── genes.gtf
├── scripts/            # 自己的分析腳本
│   └── run_fastqc.sh
├── results/            # 所有產出
│   ├── fastqc/
│   ├── bam/
│   └── multiqc_report.html
├── logs/               # 工具執行紀錄
└── README.md           # 說明、版本、命令

💡

核心原則：raw_data 永遠唯讀，所有產出都進 results/，所有 log 都進 logs/，所有指令版本都記到 README.md。沒有這個習慣，再強的工具都會被混亂的資料夾吃掉。

Core principle: keep raw_data read-only, send all outputs to results/, all logs to logs/, and record every command in README.md. Without this habit, even the best tools drown in directory chaos.

課程地圖

14 章學習地圖

基礎篇 (Ch 1–4)：命令列與檔案系統

學會在 Linux 中找路、看檔、改名、設定權限、用 pipe 處理大型純文字檔。

Learn to navigate, read, rename, set permissions and pipe large text files.

中階篇 (Ch 5–8)：文字工具與環境管理

grep / awk / sed 處理 GTF、BED、sample sheet；用 conda + Bioconda 管理可重現環境。

grep / awk / sed on GTF, BED and sample sheets; conda + Bioconda for reproducible environments.

生資應用篇 (Ch 9–11)：真實 NGS 工作流

理解 FASTQ / FASTA / GTF / BED / SAM / BAM / VCF 格式；FastQC + MultiQC + samtools 實作。

FASTQ / FASTA / GTF / BED / SAM / BAM / VCF formats; FastQC + MultiQC + samtools in practice.

進階篇 (Ch 12–14)：自動化、Workflow、HPC

把指令寫成 Bash script，再升級到 Snakemake / Nextflow，最後部署到 Slurm 與 container。

Promote commands to Bash scripts, then to Snakemake / Nextflow, and finally to Slurm + containers.

第一次互動

先試試看：你的第一個 terminal

下方是一個模擬 terminal。試著輸入 echo hello bioinformatics 然後按 Enter，再輸入 whoami、date。每一個指令的「形狀」都是：命令 + 空格 + 參數。

Below is a simulated terminal. Type echo hello bioinformatics and press Enter, then try whoami and date. Every command has the same shape: command + space + arguments.

📝 自我檢測

1. 為什麼幾乎所有 NGS 工具優先支援 Linux？

1. Why do nearly all NGS tools prioritise Linux?

A. Linux 比較漂亮A. Linux looks nicer

B. Linux 是免費的B. Linux is free

C. HPC、雲端、容器、開源工具生態系都圍繞 Linux 建置C. HPC, cloud, containers and the open-source toolchain are all Linux-centric

D. Linux 沒有病毒D. Linux has no viruses

2. 在 NGS 專案中，raw_data/ 應該如何處理？

2. How should raw_data/ be treated in an NGS project?

A. 直接在裡面解壓、改名、刪檔A. Decompress, rename and delete files in place

B. 視為唯讀（read-only），所有產出都導到 results/B. Treat as read-only; route all outputs to results/

C. 跟 results/ 混在一起方便比較C. Mix it with results/ for convenience

D. 上傳到桌面備份D. Copy to the desktop as backup

3. 為什麼 GUI 對於批次處理 100 個 FASTQ 檔不適合？

3. Why is a GUI a poor fit for processing 100 FASTQ files in batch?

A. GUI 不能讀檔案A. GUIs can't read files

B. GUI 沒有滑鼠B. GUIs lack a mouse

C. 無法精確自動化、難以重現、需手動重複點擊C. They can't reliably automate, are hard to reproduce and require repeated clicking

D. GUI 比 CLI 更慢處理數字D. GUIs are slower at math than CLIs