STEP 3 / 9

序列比對 (Read Alignment)

把幾億條 short reads 釘到 3 Gb 的人類基因體上——速度、準確度、MAPQ 三者的權衡。

Pin hundreds of millions of short reads onto a 3 Gb human genome — the trade-off between speed, accuracy, and MAPQ.

一、為什麼需要比對?

定序機讀出的 reads 是「一堆 150bp 的隨機片段」,沒有座標、沒有方向。Alignment 的工作就是找出每條 read 在參考基因組上「最可能來自」的位置,產生 (chrom, pos, CIGAR, MAPQ) 等資訊,輸出為 BAM 檔。後續所有 variant calling 都基於這份座標映射。

對 30× WGS 的人類樣本而言,這意味著要處理約 6 億條 paired reads,每條都要在 3 Gb 的搜尋空間中定位——這是整個 pipeline 中計算量最大的單一步驟之一。

The sequencer outputs reads as "a pile of 150bp random fragments" with no coordinates or orientation. Alignment finds the most-likely origin of each read on the reference, producing (chrom, pos, CIGAR, MAPQ) info written to a BAM file. All downstream variant calling depends on this coordinate map.

For 30× human WGS, this means processing ~600 million paired reads, each searched against 3 Gb of reference — one of the most compute-intensive single steps in the entire pipeline.

二、SAM / BAM 格式解析

SAM 是文字版、BAM 是二進位壓縮版(同樣資訊但小 4–5 倍)。每行代表一條 read 的比對結果,包含 11 個必要欄位 + 任意數量的 tag。

SAM is text; BAM is compressed binary (same info, 4–5× smaller). Each line is one read's alignment, with 11 mandatory fields plus optional tags.

QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL tags... A00123:1:1101 99 chr1 1234567 60 150M = 1234817 400 ACGT... FFFF... NM:i:0 MD:Z:150 RG:Z:S1 FLAG 99 → paired, mapped in proper pair, mate reversed, R1 MAPQ 60 → uniquely mapped, very high confidence (max for BWA-MEM) CIGAR 150M → 150 matches, no indels NM 0 → 0 mismatches against reference

FLAG

位元欄位,12 個 bit 編碼狀態:paired? mapped? duplicate? secondary? 等。常用查詢工具:Picard FLAG decoder

Bitfield with 12 bits encoding states: paired? mapped? duplicate? secondary? Use Picard FLAG decoder to look up.

MAPQ

Mapping quality,0–60 scale。MAPQ ≥ 30 通常視為可靠;MAPQ = 0 表示這條 read 比對到多個位置(multi-mapping,repetitive region)。

Mapping quality, 0–60 scale. MAPQ ≥ 30 is generally considered reliable; MAPQ = 0 means the read maps to multiple locations (multi-mapping, repetitive region).

CIGAR

緊湊字串描述 read 與 ref 的關係:M=match/mismatch、I=insertion、D=deletion、S=soft-clip、H=hard-clip。如 50M2I98M=50 對齊+2 插入+98 對齊。

Compact string describing read vs ref: M=match/mismatch, I=insertion, D=deletion, S=soft-clip, H=hard-clip. E.g. 50M2I98M = 50 aligned + 2 inserted + 98 aligned.

三、主流 Aligner 比較

工具演算法速度準確度場景
BWA-MEMFM-index + SMEM~6–10 hrs⭐⭐⭐⭐標準 baseline,GATK 官方推薦FM-index + SMEM~6-10 hrsStandard baseline, GATK recommended
BWA-MEM2BWA-MEM 重寫,向量化 SIMD~2–3 hrs⭐⭐⭐⭐ 與 MEM 相同same as MEM2024 起新專案首選BWA-MEM rewrite, SIMD-vectorized~2-3 hrsDefault for new projects from 2024
Bowtie2FM-index, end-to-end / local~5–8 hrs⭐⭐⭐RNA-seq 偏好;WGS/WES 低於 BWAFM-index, end-to-end / local~5-8 hrsRNA-seq favored; lower than BWA for WGS/WES
DRAGEN硬體加速 (FPGA)~30 min⭐⭐⭐⭐⭐商業方案,極快但需專屬硬體Hardware-accelerated (FPGA)~30 minCommercial, very fast but needs dedicated hardware
minimap2k-mer chaining~2 hrs⭐⭐⭐⭐long reads 主場 (PacBio/ONT)k-mer chaining~2 hrsLong-read champion (PacBio/ONT)
NovoalignNeedleman-Wunsch~5–10 hrs⭐⭐⭐⭐商業,敏感度略勝 BWA 但慢Needleman-Wunsch~5-10 hrsCommercial, slightly more sensitive than BWA but slower
2024 預設選擇:BWA-MEM2 與 BWA-MEM 在 variant calling 結果上完全相同(演算法相同,只是 SIMD 加速),但執行時間縮短 2–3 倍。新專案直接用 BWA-MEM2,記憶體需求約 80 GB(建索引時 100 GB)。 2024 default: BWA-MEM2 produces identical variant-calling results to BWA-MEM (same algorithm, just SIMD-accelerated), but runs 2–3× faster. Use BWA-MEM2 for new projects; needs ~80 GB RAM (100 GB during indexing).

四、MAPQ 過濾模擬器

不同 MAPQ 閾值會保留多少 reads?對下游 variant calling 又有什麼影響?拖動滑桿觀察。

How many reads pass at each MAPQ threshold, and what does it mean for downstream calling? Drag the slider.

MAPQ 分布:通常呈雙峰(高品質單一比對 vs 低品質 multi-mapping)

五、Read Groups (RG):絕對不能省

Read group 是 BAM header 中的元資料,告訴 GATK:「這些 reads 來自哪個樣本、哪批 library、哪台儀器、哪條 lane」。GATK 多步驟(BQSR、HaplotypeCaller)強制要求 RG,沒有 RG 的 BAM 會直接報錯。

關鍵欄位:ID(unique,通常 flowcell.lane)、SM(sample name,最重要)、LB(library,給 dedup 用)、PL(platform,如 ILLUMINA)。

Read groups are BAM header metadata telling GATK: "these reads are from sample X, library Y, machine Z, lane W." GATK steps (BQSR, HaplotypeCaller) require RGs — a BAM without them errors out immediately.

Key fields: ID (unique, usually flowcell.lane), SM (sample name, most important), LB (library, used by dedup), PL (platform, e.g. ILLUMINA).

@RG ID:HXXX.1 SM:patient001 LB:patient001_lib1 PL:ILLUMINA PU:HXXX.1.ATCACG CN:BroadInstitute ID → unique per (flowcell, lane); appears in every read's RG:Z: tag SM → sample identity; same patient sequenced multiple times shares this LB → library; same SM but different LB tells dedup not to merge across libs PL → ILLUMINA / PACBIO / ONT — affects how BQSR models errors PU → platform unit (flowcell.lane.barcode); finer than ID
🚫
常見災難:所有樣本的 RG 都設成同一個 SM(例如忘了改 sample name)。後果:joint genotyping 時所有 reads 被當成同一個樣本,所有 variant 變成 homozygous,分析全錯。每次 alignment 都要核對 SM。 Common disaster: All samples sharing the same SM (e.g. forgetting to change the sample name). Consequence: joint genotyping treats everything as one sample, every variant looks homozygous, the entire analysis is wrong. Verify SM at every alignment.

六、實作範例

# BWA-MEM2 alignment:建索引後直接比對 + samtools 排序
REF=Homo_sapiens_assembly38.fasta
SM=patient001
ID=HXXX.1.ATCACG

bwa-mem2 mem \
  -t 16 \                                   # thread 數
  -K 100000000 \                            # 確定性處理 batch(GATK 推薦)
  -Y \                                       # soft-clip supplementary alignments
  -R "@RG\tID:${ID}\tSM:${SM}\tLB:${SM}_lib1\tPL:ILLUMINA" \  # RG 必填
  $REF \
  ${SM}_R1.trim.fastq.gz ${SM}_R2.trim.fastq.gz \
  | samtools sort -@ 8 -o ${SM}.sorted.bam -

samtools index ${SM}.sorted.bam
# 傳統 BWA-MEM(與 MEM2 結果完全相同,只是慢 2-3 倍)
bwa mem -t 16 -K 100000000 -Y \
  -R "@RG\tID:HXXX.1\tSM:patient001\tLB:patient001_lib1\tPL:ILLUMINA" \
  Homo_sapiens_assembly38.fasta \
  patient001_R1.trim.fastq.gz patient001_R2.trim.fastq.gz \
  | samtools sort -@ 8 -o patient001.sorted.bam -

# GATK best practice 用的 -K 與 -Y 是為了與下游 MarkDuplicates 行為一致
# samtools 常用後處理

# 1. 統計比對率與 MAPQ 分布
samtools flagstat patient001.sorted.bam

# 2. 過濾 MAPQ ≥ 20 並保留 properly paired
samtools view -b -q 20 -f 2 patient001.sorted.bam > patient001.q20.bam

# 3. 視覺化某區段(IGV 之外的快速方式)
samtools view patient001.sorted.bam chr1:1000000-1001000 | head

# 4. 計算特定區域的 coverage
samtools depth -a -r chr1:1000000-2000000 patient001.sorted.bam | \
  awk '{sum+=$3; n++} END {print "mean cov:", sum/n}'

七、常見問題

人類 WGS/WES 對 GRCh38 的比對率通常 ≥ 97%。低於 90% 是警訊,常見原因:(1) 跨物種污染(細菌、mycoplasma);(2) 沒有 trim 嚴重的 adapter;(3) 用錯參考(如 GRCh37 vs hg19 的 chr 命名差異)。samtools flagstat 是第一道檢查。

Human WGS/WES alignment to GRCh38 typically achieves ≥ 97%. Below 90% is a red flag — common causes: (1) cross-species contamination (bacteria, mycoplasma); (2) unstripped adapter contamination; (3) wrong reference (GRCh37 vs hg19 chromosome naming mismatch). samtools flagstat is the first check.

BWA-MEM 對 multi-mapping reads(同樣最佳分數比對到 ≥2 個位置)的 MAPQ 設為 0。這通常出現在:(1) repetitive regions(如 segmental duplications, telomeres, centromeres);(2) 假基因/同源基因家族(如 GBA / GBAP1)。MAPQ=0 reads 在 variant calling 時會被忽略,這在某些臨床基因(如 SMN1/SMN2)造成嚴重盲區,需特別處理。

BWA-MEM sets MAPQ to 0 for multi-mapping reads (same best-score alignment in ≥2 places). Common in: (1) repetitive regions (segmental duplications, telomeres, centromeres); (2) pseudogenes / paralog families (e.g. GBA/GBAP1). MAPQ=0 reads are ignored during variant calling, creating serious blind spots for some clinical genes (e.g. SMN1/SMN2) requiring special handling.

GRCh38 包含 261 個 ALT contigs(人類群體中常見的替代單倍型,如 HLA、KIR)。GATK Best Practices 建議使用 alt-aware alignment(BWA-MEM 加上 .alt 索引檔),這會大幅改善這些區域的 calling 準確度。但工作流複雜度也提升——若不打算分析 HLA/KIR,使用「primary assembly only」(無 ALT)也可接受。

GRCh38 includes 261 ALT contigs (alternative haplotypes common in human populations, e.g. HLA, KIR). GATK Best Practices recommends alt-aware alignment (BWA-MEM with .alt index), which substantially improves calling accuracy in these regions. But workflow complexity rises — if you don't plan to analyze HLA/KIR, "primary assembly only" (no ALT) is acceptable.

BWA-MEM 比對人類基因組約需 10 GB RAMBWA-MEM2 需要 80 GB(記憶體換速度——預先載入大型 SIMD-friendly index)。BWA-MEM2 建索引階段甚至需要 100 GB。雲端環境或筆電要先確認規格,否則會 OOM。

BWA-MEM uses ~10 GB RAM for human alignment. BWA-MEM2 needs 80 GB (memory-for-speed trade — pre-loading a large SIMD-friendly index). BWA-MEM2 indexing alone needs ~100 GB. Verify your cloud/laptop specs first or you'll OOM.

📝 自我檢測

1. 你跑完 BWA-MEM 後 samtools flagstat 顯示 mapping rate 只有 78%。最該優先檢查的是?

1. After BWA-MEM, samtools flagstat shows only 78% mapping rate. Most likely cause to check first?

A. BWA 版本不對A. Wrong BWA version
B. CPU thread 數設太少B. Too few CPU threads
C. 跨物種污染(細菌/mycoplasma)或 adapter 沒有 trim 乾淨C. Cross-species contamination (bacteria/mycoplasma) or unstripped adapters
D. Reference fasta 沒有索引D. Reference fasta wasn't indexed

2. 一個 read 的 MAPQ 是 0,正確的解讀是?

2. A read with MAPQ = 0 means:

A. 這條 read 在 reference 上有多個最佳比對位置(multi-mapping)A. The read maps equally well to multiple positions (multi-mapping)
B. 這條 read 完全沒有比對成功B. The read failed to align entirely
C. 這條 read 的 base quality 很差C. The read has poor base quality
D. 這條 read 來自 mitochondriaD. The read is from the mitochondria

3. 關於 Read Group (RG),以下何者錯誤?

3. Which statement about Read Groups (RG) is FALSE?

A. SM 欄位代表 sample name,必須在同一個樣本所有 BAM 中保持一致A. The SM field is the sample name and must match across all BAMs from one sample
B. RG 是可選的,沒設定 GATK 也能正常跑B. RG is optional; GATK runs fine without it
C. LB 欄位告訴 MarkDuplicates 哪些 reads 來自同一個 libraryC. The LB field tells MarkDuplicates which reads share a library
D. PL 欄位影響 BQSR 如何建模 base quality 錯誤D. The PL field affects how BQSR models base quality errors