CHAPTER 11 / 14

Mapping 與 SAM/BAM 處理

bwa-mem2 / hisat2 / minimap2 — alignment 三大工具;samtools sort / index / flagstat — BAM 整理三件套。

bwa-mem2 / hisat2 / minimap2 — the alignment trio; samtools sort / index / flagstat — the BAM tidy trio.

aligner 該選哪一個?

工具適用優點
bwa-mem2DNA short reads準確、社群最廣
HISAT2 / STARRNA-seq支援 splice
minimap2長讀 / 多模式多模式預設
salmon / kallistoRNA-seq 量化速度極快、不產生 BAM

從 FASTQ 到 sorted+indexed BAM

# 1. 建 index(一次性)/ Build index (one-off)
bwa-mem2 index reference/genome.fa

# 2. align + 排序 + 建索引(pipe 一條龍)
# Align + sort + index (one pipe)
bwa-mem2 mem -t 8 -R "@RG\tID:sampleA\tSM:sampleA\tPL:ILLUMINA" \
    reference/genome.fa \
    raw_data/sampleA_R1.fastq.gz raw_data/sampleA_R2.fastq.gz \
  | samtools sort -@ 4 -o results/bam/sampleA.sorted.bam -
samtools index results/bam/sampleA.sorted.bam
# 1. 建 HISAT2 index(含 splice 資訊)/ Build with splice sites
hisat2-build reference/genome.fa reference/hisat2_index

# 2. RNA align
hisat2 -p 8 -x reference/hisat2_index \
    -1 raw_data/sampleA_R1.fastq.gz -2 raw_data/sampleA_R2.fastq.gz \
    --rna-strandness RF \
    --summary-file logs/hisat2_sampleA.log \
  | samtools sort -@ 4 -o results/bam/sampleA.sorted.bam -
samtools index results/bam/sampleA.sorted.bam
# Nanopore long reads
# Use map-ont preset; -ax outputs SAM
minimap2 -t 8 -ax map-ont reference/genome.fa nanopore.fastq.gz \
  | samtools sort -@ 4 -o results/bam/nanopore.sorted.bam -
samtools index results/bam/nanopore.sorted.bam

# PacBio HiFi 用 -ax map-hifi;splice-aware 用 -ax splice
# Splice-aware long-read alignment uses -ax splice
💡

關鍵:aligner | samtools sort 用 pipe 串起來——避免產出中間 SAM 檔(純文字會很大)。samtools sort- 表示從 stdin 讀入。

Key trick: pipe aligner | samtools sort directly — avoids the huge intermediate SAM. The - in samtools sort - means read from stdin.

samtoolsBAM 操作主力

samtools view -h a.bam | head
看 BAM 內容
samtools view -c a.bam
總 alignment 數
samtools view -c -f 4 a.bam
未 mapped 的 reads
samtools view -c -F 4 a.bam
已 mapped 的 reads
samtools sort -@ 8 -o sorted.bam a.bam
按位置排序
samtools index sorted.bam
建立索引
samtools flagstat a.bam
alignment 摘要
samtools idxstats sorted.bam
每 chrom 的 read 數
samtools view -b -q 30 a.bam > mq30.bam
高 MAPQ 過濾
samtools view sorted.bam chr1:10000-20000
抽特定 region
samtools markdup -@ 8 in.bam out.bam
標記 duplicates

SAM FLAG — 用位元位記錄 alignment 屬性

FLAG 是個整數,每個 bit 代表一個屬性。常見的:

FLAG is an integer where each bit encodes a property. Common ones:

1 (0x1)

paired

2 (0x2)

proper pair

4 (0x4)

unmapped

8 (0x8)

mate unmapped

16 (0x10)

reverse strand

256 (0x100)

secondary

1024 (0x400)

duplicate

2048 (0x800)

supplementary

💡

FLAG 解碼太麻煩?用 Broad 的 Explain Flags 頁面,或在 terminal 中輸入 samtools flags 99

Decoding FLAGs is annoying — use the Broad Explain Flags page or run samtools flags 99 in the terminal.

互動:BAM 處理練習

📝 自我檢測

1. RNA-seq 為什麼不能用 bwa-mem 而要用 HISAT2 / STAR?

1. Why use HISAT2 / STAR for RNA-seq instead of bwa-mem?

A. bwa-mem 不支援 paired-endA. bwa-mem doesn't support paired-end
B. HISAT2 比較快B. HISAT2 is faster
C. RNA reads 跨 exon-junction,需 splice-aware alignerC. RNA reads span exon junctions; needs a splice-aware aligner
D. bwa-mem 不能讀 .gzD. bwa-mem can't read .gz

2. 為什麼要把 aligner 與 samtools sort 用 pipe 串起來?

2. Why pipe the aligner directly into samtools sort?

A. 兩者沒辦法分開跑A. They can't run separately
B. 避免產出超大的中間 SAM 檔,省 I/O 與磁碟B. Avoid the huge intermediate SAM file — save disk and I/O
C. samtools 必須讀 stdinC. samtools only reads stdin
D. pipe 比較快但結果不同D. Pipes are faster but produce different results

3. samtools view sorted.bam chr1:10000-20000 跑出錯誤「random access is not supported」。原因?

3. samtools view sorted.bam chr1:10000-20000 fails with "random access is not supported". Why?

A. samtools 太舊A. samtools is too old
B. region 寫法錯B. Wrong region syntax
C. BAM 沒排序C. BAM not sorted
D. 缺少 .bai 索引(要先 samtools indexD. Missing .bai index — run samtools index first

4. samtools view -c -F 4 a.bam 計算的是?

4. What does samtools view -c -F 4 a.bam count?

A. mapped reads 數A. Number of mapped reads
B. unmapped reads 數B. Number of unmapped reads
C. duplicates 數C. Number of duplicates
D. 第 4 個 readD. The 4th read