一、為什麼需要比對?
定序機讀出的 reads 是「一堆 150bp 的隨機片段」,沒有座標、沒有方向。Alignment 的工作就是找出每條 read 在參考基因組上「最可能來自」的位置,產生 (chrom, pos, CIGAR, MAPQ) 等資訊,輸出為 BAM 檔。後續所有 variant calling 都基於這份座標映射。
對 30× WGS 的人類樣本而言,這意味著要處理約 6 億條 paired reads,每條都要在 3 Gb 的搜尋空間中定位——這是整個 pipeline 中計算量最大的單一步驟之一。
The sequencer outputs reads as "a pile of 150bp random fragments" with no coordinates or orientation. Alignment finds the most-likely origin of each read on the reference, producing (chrom, pos, CIGAR, MAPQ) info written to a BAM file. All downstream variant calling depends on this coordinate map.
For 30× human WGS, this means processing ~600 million paired reads, each searched against 3 Gb of reference — one of the most compute-intensive single steps in the entire pipeline.
二、SAM / BAM 格式解析
SAM 是文字版、BAM 是二進位壓縮版(同樣資訊但小 4–5 倍)。每行代表一條 read 的比對結果,包含 11 個必要欄位 + 任意數量的 tag。
SAM is text; BAM is compressed binary (same info, 4–5× smaller). Each line is one read's alignment, with 11 mandatory fields plus optional tags.
FLAG
位元欄位,12 個 bit 編碼狀態:paired? mapped? duplicate? secondary? 等。常用查詢工具:Picard FLAG decoder。
Bitfield with 12 bits encoding states: paired? mapped? duplicate? secondary? Use Picard FLAG decoder to look up.
MAPQ
Mapping quality,0–60 scale。MAPQ ≥ 30 通常視為可靠;MAPQ = 0 表示這條 read 比對到多個位置(multi-mapping,repetitive region)。
Mapping quality, 0–60 scale. MAPQ ≥ 30 is generally considered reliable; MAPQ = 0 means the read maps to multiple locations (multi-mapping, repetitive region).
CIGAR
緊湊字串描述 read 與 ref 的關係:M=match/mismatch、I=insertion、D=deletion、S=soft-clip、H=hard-clip。如 50M2I98M=50 對齊+2 插入+98 對齊。
Compact string describing read vs ref: M=match/mismatch, I=insertion, D=deletion, S=soft-clip, H=hard-clip. E.g. 50M2I98M = 50 aligned + 2 inserted + 98 aligned.
三、主流 Aligner 比較
| 工具 | 演算法 | 速度 | 準確度 | 場景 | |||
|---|---|---|---|---|---|---|---|
| BWA-MEM | FM-index + SMEM | ~6–10 hrs | ⭐⭐⭐⭐ | 標準 baseline,GATK 官方推薦 | FM-index + SMEM | ~6-10 hrs | Standard baseline, GATK recommended |
| BWA-MEM2 | BWA-MEM 重寫,向量化 SIMD | ~2–3 hrs | ⭐⭐⭐⭐ 與 MEM 相同same as MEM | 2024 起新專案首選 | BWA-MEM rewrite, SIMD-vectorized | ~2-3 hrs | Default for new projects from 2024 |
| Bowtie2 | FM-index, end-to-end / local | ~5–8 hrs | ⭐⭐⭐ | RNA-seq 偏好;WGS/WES 低於 BWA | FM-index, end-to-end / local | ~5-8 hrs | RNA-seq favored; lower than BWA for WGS/WES |
| DRAGEN | 硬體加速 (FPGA) | ~30 min | ⭐⭐⭐⭐⭐ | 商業方案,極快但需專屬硬體 | Hardware-accelerated (FPGA) | ~30 min | Commercial, very fast but needs dedicated hardware |
| minimap2 | k-mer chaining | ~2 hrs | ⭐⭐⭐⭐ | long reads 主場 (PacBio/ONT) | k-mer chaining | ~2 hrs | Long-read champion (PacBio/ONT) |
| Novoalign | Needleman-Wunsch | ~5–10 hrs | ⭐⭐⭐⭐ | 商業,敏感度略勝 BWA 但慢 | Needleman-Wunsch | ~5-10 hrs | Commercial, slightly more sensitive than BWA but slower |
四、MAPQ 過濾模擬器
不同 MAPQ 閾值會保留多少 reads?對下游 variant calling 又有什麼影響?拖動滑桿觀察。
How many reads pass at each MAPQ threshold, and what does it mean for downstream calling? Drag the slider.
MAPQ 分布:通常呈雙峰(高品質單一比對 vs 低品質 multi-mapping)
五、Read Groups (RG):絕對不能省
Read group 是 BAM header 中的元資料,告訴 GATK:「這些 reads 來自哪個樣本、哪批 library、哪台儀器、哪條 lane」。GATK 多步驟(BQSR、HaplotypeCaller)強制要求 RG,沒有 RG 的 BAM 會直接報錯。
關鍵欄位:ID(unique,通常 flowcell.lane)、SM(sample name,最重要)、LB(library,給 dedup 用)、PL(platform,如 ILLUMINA)。
Read groups are BAM header metadata telling GATK: "these reads are from sample X, library Y, machine Z, lane W." GATK steps (BQSR, HaplotypeCaller) require RGs — a BAM without them errors out immediately.
Key fields: ID (unique, usually flowcell.lane), SM (sample name, most important), LB (library, used by dedup), PL (platform, e.g. ILLUMINA).
六、實作範例
# BWA-MEM2 alignment:建索引後直接比對 + samtools 排序 REF=Homo_sapiens_assembly38.fasta SM=patient001 ID=HXXX.1.ATCACG bwa-mem2 mem \ -t 16 \ # thread 數 -K 100000000 \ # 確定性處理 batch(GATK 推薦) -Y \ # soft-clip supplementary alignments -R "@RG\tID:${ID}\tSM:${SM}\tLB:${SM}_lib1\tPL:ILLUMINA" \ # RG 必填 $REF \ ${SM}_R1.trim.fastq.gz ${SM}_R2.trim.fastq.gz \ | samtools sort -@ 8 -o ${SM}.sorted.bam - samtools index ${SM}.sorted.bam
# 傳統 BWA-MEM(與 MEM2 結果完全相同,只是慢 2-3 倍) bwa mem -t 16 -K 100000000 -Y \ -R "@RG\tID:HXXX.1\tSM:patient001\tLB:patient001_lib1\tPL:ILLUMINA" \ Homo_sapiens_assembly38.fasta \ patient001_R1.trim.fastq.gz patient001_R2.trim.fastq.gz \ | samtools sort -@ 8 -o patient001.sorted.bam - # GATK best practice 用的 -K 與 -Y 是為了與下游 MarkDuplicates 行為一致
# samtools 常用後處理 # 1. 統計比對率與 MAPQ 分布 samtools flagstat patient001.sorted.bam # 2. 過濾 MAPQ ≥ 20 並保留 properly paired samtools view -b -q 20 -f 2 patient001.sorted.bam > patient001.q20.bam # 3. 視覺化某區段(IGV 之外的快速方式) samtools view patient001.sorted.bam chr1:1000000-1001000 | head # 4. 計算特定區域的 coverage samtools depth -a -r chr1:1000000-2000000 patient001.sorted.bam | \ awk '{sum+=$3; n++} END {print "mean cov:", sum/n}'
七、常見問題
人類 WGS/WES 對 GRCh38 的比對率通常 ≥ 97%。低於 90% 是警訊,常見原因:(1) 跨物種污染(細菌、mycoplasma);(2) 沒有 trim 嚴重的 adapter;(3) 用錯參考(如 GRCh37 vs hg19 的 chr 命名差異)。samtools flagstat 是第一道檢查。
Human WGS/WES alignment to GRCh38 typically achieves ≥ 97%. Below 90% is a red flag — common causes: (1) cross-species contamination (bacteria, mycoplasma); (2) unstripped adapter contamination; (3) wrong reference (GRCh37 vs hg19 chromosome naming mismatch). samtools flagstat is the first check.
BWA-MEM 對 multi-mapping reads(同樣最佳分數比對到 ≥2 個位置)的 MAPQ 設為 0。這通常出現在:(1) repetitive regions(如 segmental duplications, telomeres, centromeres);(2) 假基因/同源基因家族(如 GBA / GBAP1)。MAPQ=0 reads 在 variant calling 時會被忽略,這在某些臨床基因(如 SMN1/SMN2)造成嚴重盲區,需特別處理。
BWA-MEM sets MAPQ to 0 for multi-mapping reads (same best-score alignment in ≥2 places). Common in: (1) repetitive regions (segmental duplications, telomeres, centromeres); (2) pseudogenes / paralog families (e.g. GBA/GBAP1). MAPQ=0 reads are ignored during variant calling, creating serious blind spots for some clinical genes (e.g. SMN1/SMN2) requiring special handling.
GRCh38 包含 261 個 ALT contigs(人類群體中常見的替代單倍型,如 HLA、KIR)。GATK Best Practices 建議使用 alt-aware alignment(BWA-MEM 加上 .alt 索引檔),這會大幅改善這些區域的 calling 準確度。但工作流複雜度也提升——若不打算分析 HLA/KIR,使用「primary assembly only」(無 ALT)也可接受。
GRCh38 includes 261 ALT contigs (alternative haplotypes common in human populations, e.g. HLA, KIR). GATK Best Practices recommends alt-aware alignment (BWA-MEM with .alt index), which substantially improves calling accuracy in these regions. But workflow complexity rises — if you don't plan to analyze HLA/KIR, "primary assembly only" (no ALT) is acceptable.
BWA-MEM 比對人類基因組約需 10 GB RAM。BWA-MEM2 需要 80 GB(記憶體換速度——預先載入大型 SIMD-friendly index)。BWA-MEM2 建索引階段甚至需要 100 GB。雲端環境或筆電要先確認規格,否則會 OOM。
BWA-MEM uses ~10 GB RAM for human alignment. BWA-MEM2 needs 80 GB (memory-for-speed trade — pre-loading a large SIMD-friendly index). BWA-MEM2 indexing alone needs ~100 GB. Verify your cloud/laptop specs first or you'll OOM.
📝 自我檢測
1. 你跑完 BWA-MEM 後 samtools flagstat 顯示 mapping rate 只有 78%。最該優先檢查的是?
1. After BWA-MEM, samtools flagstat shows only 78% mapping rate. Most likely cause to check first?
2. 一個 read 的 MAPQ 是 0,正確的解讀是?
2. A read with MAPQ = 0 means:
3. 關於 Read Group (RG),以下何者錯誤?
3. Which statement about Read Groups (RG) is FALSE?