CHAPTER 4 / 14

文字檔與串流處理(Pipes)

cat、less、head、tail、wc、zcat 加上 pipe (|) 與 redirect (> >>),30 GB FASTQ 也能秒答「有多少 reads」。

cat, less, head, tail, wc, zcat — plus pipes (|) and redirects (> >>) — answer "how many reads?" on a 30 GB FASTQ in seconds.

為什麼這一章是 Linux 對生資最有價值的部分?

FASTQ、SAM、VCF、GTF、BED — 通通都是大型純文字檔。Linux 的設計哲學「每個工具只做一件事,並用 pipe 串起來」就是為這種資料而生:

  • 不需要把 30 GB 的 fastq.gz 全部解壓再讀
  • 用「邊讀邊處理」的串流方式,幾乎只用一點記憶體
  • 可以把 5 個小工具串成一條一行的「mini pipeline」

FASTQ, SAM, VCF, GTF, BED — all large plain-text files. The Unix philosophy "do one thing well, then pipe" was made for this:

  • You never need to decompress an entire 30 GB fastq.gz before reading it
  • Streaming reads/processes data with negligible memory
  • Five small tools chain into a one-line mini pipeline

檢視大型文字檔的工具箱

cat file.txt
把整個檔案印出來(小檔才用!)
less file.txt
互動翻頁瀏覽
head -n 4 sample.fastq
看前幾行
tail -n 20 run.log
看最後幾行
tail -f run.log
持續追蹤 log
wc -l file.txt
計算行數
wc -c file.fa
計算 bytes
zcat file.fastq.gz
不解壓直接讀 .gz
zcat | head
只看 .gz 的前幾行
🛑

千萬別在 30 GB FASTQ 上用 catvimterminal 會卡死、記憶體可能爆掉。FASTQ 必用 headless、或加上 zcat | head

Never cat or vim a 30 GB FASTQ. The terminal hangs and memory may explode. Always use head, less, or zcat | head.

Pipe (|) 與 Redirect (> >>) — 文字串流的命脈

| (Pipe,管線)

把左邊指令的「標準輸出」傳給右邊指令的「標準輸入」。完全不需要中間檔案。

Send the stdout of the left command into the stdin of the right command — no intermediate files.

zcat sample.fastq.gz | head -n 8

> (覆寫)

把指令輸出寫進檔案,覆蓋既有內容

Write command output to a file, overwriting the existing content.

grep "^>" genome.fa > chrom_headers.txt

>> (追加)

寫進檔案,但追加到結尾,常用於 log。

Append to a file (don't overwrite). Common for logs.

echo "[$(date)] step done" >> run.log

< (從檔案讀入)

把檔案內容當成標準輸入餵給指令。較少用,但在 here-document 與某些工具中很重要。

Use a file as stdin. Rare in everyday work but key for here-documents and some tools.

wc -l < samples.tsv

stdout、stderr、與 2>&1

每個 Linux 程式都有三個「流」:stdin (0)stdout (1)stderr (2)。預設 > 只導 stdout,錯誤訊息會留在螢幕。

Every Linux program has three streams: stdin (0), stdout (1), stderr (2). By default > redirects stdout only — errors stay on the screen.

cmd > out.txt 2> err.txt
正常輸出與錯誤分開存
cmd > run.log 2>&1
兩者都進同一個 log
cmd &> run.log
bash 簡寫
cmd 2> /dev/null
把錯誤訊息丟掉

解 FASTQ 的經典 one-liners

下面這些 one-liner 是生資工程師每週都用的:

The one-liners below are used weekly by every bioinformatician:

# FASTQ:每筆 read = 4 行;read 總數 = 行數 / 4
# Count reads in a FASTQ — total lines / 4
zcat sample.fastq.gz | wc -l
# 直接得到 read 數:
zcat sample.fastq.gz | awk 'END {print NR/4 " reads"}'
# 看第一筆 read(4 行)
# Inspect the first read (4 lines)
zcat sample.fastq.gz | head -n 4

# 結果範例 / sample output:
# @SRR12345.1 1/1
# AGCTACGTACGTAGCTAGCTACGT...
# +
# IIIIIIIIIIIIIIIIIIIIIIII...
# 計算前 10000 reads 的平均長度
# Average length of the first 10000 reads
zcat sample.fastq.gz | head -n 40000 \
  | awk 'NR%4==2 {sum+=length($0); n++} END {print "avg:", sum/n}'
# 確認 R1 與 R2 的 read 數量一致(paired-end)
# Verify R1 and R2 have the same read count
echo "R1: $(zcat sampleA_R1.fastq.gz | wc -l) lines"
echo "R2: $(zcat sampleA_R2.fastq.gz | wc -l) lines"
💡

觀察重點:沒有任何一行用了「先解壓再讀」,全部都是 zcat 流式處理。30 GB 的 fastq.gz 在這條 one-liner 下只用幾 MB 記憶體即可跑完。

Notice: none of these decompressed first; zcat streams everything. A 30 GB fastq.gz finishes in these one-liners with only a few MB of RAM.

Pipe Builder — 拼出你的第一條 one-liner

選擇下方每個階段的指令,看完整 pipeline 與最終輸出:

Pick a command for each stage below to see the full pipeline and its output:

輸入
|
第一步
|
輸出
command preview here

互動終端機:分析一個 FASTQ

💡

試試:cat sample.fastq | head -n 4cat sample.fastq | wc -lcat sample.fastq | awk 'END{print NR/4 " reads"}'grep "^@SRR" sample.fastq | wc -l

Try: cat sample.fastq | head -n 4, cat sample.fastq | wc -l, cat sample.fastq | awk 'END{print NR/4 " reads"}', grep "^@SRR" sample.fastq | wc -l

📝 自我檢測

1. 用 wc -l 計算 FASTQ 檔的行數,要怎麼換算成 read 數?

1. After wc -l on a FASTQ, how do you convert to read count?

A. 行數 = read 數A. lines = reads
B. 行數 / 4 = read 數B. lines / 4 = reads
C. 行數 / 2 = read 數C. lines / 2 = reads
D. 行數 × 4 = read 數D. lines × 4 = reads

2. fastqc *.fastq.gz > run.log 2>&1 的效果是?

2. What does fastqc *.fastq.gz > run.log 2>&1 do?

A. 只記錄錯誤A. Records only errors
B. 只記錄正常輸出B. Records only normal output
C. 把 stdout 與 stderr 都寫入 run.logC. Records both stdout and stderr into run.log
D. 同時印到螢幕與檔案D. Prints to screen and file at the same time

3. 為什麼處理 .fastq.gz 時 不要gunzip 再分析?

3. Why not gunzip first when analysing a .fastq.gz?

A. zcat 流式解壓不需多耗硬碟空間,速度也夠快A. zcat streams without extra disk; fast enough
B. gunzip 會破壞 fastq 檔B. gunzip corrupts fastq files
C. .gz 檔不能解壓C. .gz files can't be decompressed
D. 先解壓比較快D. Decompressing first is faster

4. 把每次執行的時間追加到 run.log,正確寫法是?

4. To append a timestamp to run.log on each run:

date | run.log
date < run.log
date > run.log
date >> run.log