Ch 5: Search, Filter & Columns

搜尋

`grep` — 文字搜尋的主力

grep "TP53" genes.gtf

找包含 TP53 的所有行

grep -i "tp53" genes.gtf

忽略大小寫

grep -v "^#" file.vcf

排除註解行

grep -c "^@" sample.fastq

只回傳匹配行數

grep -E "(TP53|BRCA1)" genes.gtf

擴充 regex

grep -n "ERROR" run.log

顯示行號

grep -A3 "NM_001" file.txt

顯示前後文

grep -r "TODO" scripts/

遞迴搜尋

zgrep "@SRR" sample.fastq.gz

直接在 .gz 上搜尋

💡

實戰： grep -v "^#" cohort.vcf | wc -l 計算 VCF 中除了 header 外有幾筆變異，是 VCF 入門最常用的 one-liner。

Practical: grep -v "^#" cohort.vcf | wc -l counts non-header variants in a VCF — a classic one-liner.

切欄位

`cut` — 欄位切片器

cut -f1 genes.gtf

取第 1 欄

cut -f1,3 genes.gtf

取第 1 與第 3 欄

cut -f1-4 file.bed

取第 1 到 4 欄

cut -d',' -f2 samples.csv

逗號分隔

cut -c1-10 sequence.fa

字元位置

⚠️

常見坑：cut 預設用 tab 分隔（一個 tab 字元）。如果檔案是「多個空白」分隔，要改用 awk：awk '{print $1, $3}'。

Common pitfall: cut defaults to a single TAB. For files separated by multiple spaces, use awk instead: awk '{print $1, $3}'.

排序與去重

`sort` + `uniq` — 統計利器

sort file.txt

字串排序

sort -n nums.txt

按數字排序

sort -k2 file.tsv

按指定欄位排序

sort -k2,2 -k4,4n file.bed

多欄排序

sort -u file.txt

排序 + 去重

sort | uniq -c | sort -nr

頻率統計

經典：統計 GTF 中每種 feature 數量

# 統計 GTF 中 gene/transcript/exon/CDS 的數量
# Count feature types in a GTF
grep -v "^#" genes.gtf | cut -f3 | sort | uniq -c | sort -nr

 1234567 exon
  256789 CDS
   89456 transcript
   23456 gene
    9870 five_prime_utr
    8765 three_prime_utr
    1234 start_codon
    1100 stop_codon

合併

`paste` & `join` — 合併兩個檔案

`paste`

把兩個檔案以「左右並排」方式合併（按行對齊）。常見於合併 R1 與 R2 的 read ID 做檢查。

Merge two files side by side (line-by-line). Useful e.g. to compare R1 and R2 read IDs.

paste R1_ids.txt R2_ids.txt | head

`join`

類似 SQL 的 inner join，根據共同欄位合併。要求兩個檔案先按該欄位排序。

Like a SQL inner join: merge by a common key. Both files must be sorted on that key first.

join -t$'\t' -1 1 -2 1 \
  <(sort sampleA_counts.tsv) \
  <(sort sampleB_counts.tsv)

實戰

從 GTF 抽出所有 protein-coding gene 名

# 1. 排除 header（# 開頭）
# 2. 只取第 3 欄是 gene 的行
# 3. 篩出 protein_coding
# 4. 從 attribute 欄抽出 gene_name "..."
# 5. 排序去重

grep -v "^#" genes.gtf \
  | awk '$3=="gene"' \
  | grep "protein_coding" \
  | grep -oP 'gene_name "\K[^"]+' \
  | sort -u \
  > protein_coding_genes.txt

# 確認結果 / verify
wc -l protein_coding_genes.txt
head protein_coding_genes.txt

💡

這條 5 行的 pipeline 在 30,000 個基因的 GTF 上幾秒就跑完，且不需要任何額外工具。這就是 Linux 文字處理的威力。

This 5-line pipeline finishes in seconds on a 30,000-gene GTF without any extra tools. That's Linux's text power.

動手做

互動：在 mini GTF 上練習

💡

📝 自我檢測

1. grep -v "^#" cohort.vcf | wc -l 在做什麼？

1. What is grep -v "^#" cohort.vcf | wc -l doing?

A. 找包含 # 的行A. Find lines containing #

B. 計算註解行數B. Count comment lines

C. 排除 header（# 開頭），計算實際變異數C. Exclude header lines, count actual variants

D. 看 VCF 的最後一行D. Show the last VCF line

2. 處理 CSV 用 cut -f2 取不到正確的第 2 欄，為什麼？

2. Why doesn't cut -f2 give the right column on a CSV?

A. 預設分隔符是 tab，需 -d','A. Default delimiter is TAB; use -d','

B. cut 不能處理 CSVB. cut can't handle CSV

C. cut 從 0 開始計數C. cut counts from 0

D. 必須先 sortD. You must sort first

3. 想統計 BED 檔每條染色體的 interval 數，並按頻率高→低排序，正確 pipeline 是？

3. To count intervals per chromosome in a BED file and sort by frequency desc:

cut -f1 a.bed | uniq

cut -f1 a.bed | uniq -c | sort -nr

cut -f1 a.bed | sort | uniq | sort -nr

cut -f1 a.bed | sort | uniq -c | sort -nr

4. 用 join 合併兩個 sample 表前最重要的步驟是？

4. The most important step before join-ing two sample tables?

A. 先用 cut 拆欄位A. cut to split columns

B. 把空白變成 tabB. Replace spaces with tabs

C. 兩個檔案都要先 sort（按 join 的欄位）C. sort both files on the join key first

D. 先 head 取前 10 行D. head the first 10 lines

grep — 文字搜尋的主力

cut — 欄位切片器

sort + uniq — 統計利器