為什麼這一章是 Linux 對生資最有價值的部分?
FASTQ、SAM、VCF、GTF、BED — 通通都是大型純文字檔。Linux 的設計哲學「每個工具只做一件事,並用 pipe 串起來」就是為這種資料而生:
- 不需要把 30 GB 的 fastq.gz 全部解壓再讀
- 用「邊讀邊處理」的串流方式,幾乎只用一點記憶體
- 可以把 5 個小工具串成一條一行的「mini pipeline」
FASTQ, SAM, VCF, GTF, BED — all large plain-text files. The Unix philosophy "do one thing well, then pipe" was made for this:
- You never need to decompress an entire 30 GB fastq.gz before reading it
- Streaming reads/processes data with negligible memory
- Five small tools chain into a one-line mini pipeline
檢視大型文字檔的工具箱
千萬別在 30 GB FASTQ 上用 cat 或 vim!terminal 會卡死、記憶體可能爆掉。FASTQ 必用 head、less、或加上 zcat | head。
Never cat or vim a 30 GB FASTQ. The terminal hangs and memory may explode. Always use head, less, or zcat | head.
Pipe (|) 與 Redirect (> >>) — 文字串流的命脈
| (Pipe,管線)
把左邊指令的「標準輸出」傳給右邊指令的「標準輸入」。完全不需要中間檔案。
Send the stdout of the left command into the stdin of the right command — no intermediate files.
zcat sample.fastq.gz | head -n 8
> (覆寫)
把指令輸出寫進檔案,覆蓋既有內容。
Write command output to a file, overwriting the existing content.
grep "^>" genome.fa > chrom_headers.txt
>> (追加)
寫進檔案,但追加到結尾,常用於 log。
Append to a file (don't overwrite). Common for logs.
echo "[$(date)] step done" >> run.log
< (從檔案讀入)
把檔案內容當成標準輸入餵給指令。較少用,但在 here-document 與某些工具中很重要。
Use a file as stdin. Rare in everyday work but key for here-documents and some tools.
wc -l < samples.tsv
stdout、stderr、與 2>&1
每個 Linux 程式都有三個「流」:stdin (0)、stdout (1)、stderr (2)。預設 > 只導 stdout,錯誤訊息會留在螢幕。
Every Linux program has three streams: stdin (0), stdout (1), stderr (2). By default > redirects stdout only — errors stay on the screen.
解 FASTQ 的經典 one-liners
下面這些 one-liner 是生資工程師每週都用的:
The one-liners below are used weekly by every bioinformatician:
# FASTQ:每筆 read = 4 行;read 總數 = 行數 / 4 # Count reads in a FASTQ — total lines / 4 zcat sample.fastq.gz | wc -l # 直接得到 read 數: zcat sample.fastq.gz | awk 'END {print NR/4 " reads"}'
# 看第一筆 read(4 行) # Inspect the first read (4 lines) zcat sample.fastq.gz | head -n 4 # 結果範例 / sample output: # @SRR12345.1 1/1 # AGCTACGTACGTAGCTAGCTACGT... # + # IIIIIIIIIIIIIIIIIIIIIIII...
# 計算前 10000 reads 的平均長度 # Average length of the first 10000 reads zcat sample.fastq.gz | head -n 40000 \ | awk 'NR%4==2 {sum+=length($0); n++} END {print "avg:", sum/n}'
# 確認 R1 與 R2 的 read 數量一致(paired-end) # Verify R1 and R2 have the same read count echo "R1: $(zcat sampleA_R1.fastq.gz | wc -l) lines" echo "R2: $(zcat sampleA_R2.fastq.gz | wc -l) lines"
觀察重點:沒有任何一行用了「先解壓再讀」,全部都是 zcat 流式處理。30 GB 的 fastq.gz 在這條 one-liner 下只用幾 MB 記憶體即可跑完。
Notice: none of these decompressed first; zcat streams everything. A 30 GB fastq.gz finishes in these one-liners with only a few MB of RAM.
Pipe Builder — 拼出你的第一條 one-liner
選擇下方每個階段的指令,看完整 pipeline 與最終輸出:
Pick a command for each stage below to see the full pipeline and its output:
互動終端機:分析一個 FASTQ
試試:cat sample.fastq | head -n 4、cat sample.fastq | wc -l、cat sample.fastq | awk 'END{print NR/4 " reads"}'、grep "^@SRR" sample.fastq | wc -l
Try: cat sample.fastq | head -n 4, cat sample.fastq | wc -l, cat sample.fastq | awk 'END{print NR/4 " reads"}', grep "^@SRR" sample.fastq | wc -l
📝 自我檢測
1. 用 wc -l 計算 FASTQ 檔的行數,要怎麼換算成 read 數?
1. After wc -l on a FASTQ, how do you convert to read count?
2. fastqc *.fastq.gz > run.log 2>&1 的效果是?
2. What does fastqc *.fastq.gz > run.log 2>&1 do?
3. 為什麼處理 .fastq.gz 時 不要先 gunzip 再分析?
3. Why not gunzip first when analysing a .fastq.gz?
4. 把每次執行的時間追加到 run.log,正確寫法是?
4. To append a timestamp to run.log on each run: