一、為什麼需要 joint calling?
單樣本呼叫 (single-sample calling) 在每個樣本身上獨立做決策,會遭遇兩個系統性問題:
- 低 coverage 位點被漏掉:某樣本在某位點只有 5× 覆蓋、其中 1 條 read 支持 ALT,single-sample caller 會視為 noise 直接 0/0。但若 100 個樣本中有 10 個都在這位點看到「1/5」訊號,joint calling 能整合證據判定為真 variant。
- 無法區分「真 0/0」與「missing」:合併 100 個 single-sample VCF 時,缺席的位點到底是「真 ref」還是「沒測到」?無法回答。GVCF 透過 reference-confidence block 解決這問題。
結果:joint calling 對稀有變異的 sensitivity 顯著提升,且 cohort 越大效果越好——這也是為什麼 gnomAD、UK Biobank、All of Us 等百萬級資料集都用 joint calling。
Single-sample calling decides independently per sample and hits two systematic problems:
- Low-coverage sites are dropped: a sample with 5× coverage and 1 ALT-supporting read at a site is dismissed as noise and called 0/0. But if 10 of 100 samples show the same "1/5" signal, joint calling integrates the evidence and rescues the real variant.
- Can't distinguish true 0/0 from missing: merging 100 single-sample VCFs leaves "absent at this site = true ref or untested?" unanswerable. GVCF's reference-confidence blocks solve this.
The payoff: joint calling markedly improves sensitivity for rare variants, and effectiveness scales with cohort size — which is why gnomAD, UK Biobank, and All of Us all use it.
二、GATK 三步流程
| 步驟 | 工具 | 輸入 | 輸出 | 說明 |
|---|---|---|---|---|
| 1 | HaplotypeCaller -ERC GVCF | 每個 sample 一份 BAM | N 份 GVCF | (已在 Step 5 完成) per-sample 獨立呼叫 |
| 2 | GenomicsDBImport 或 CombineGVCFs | N 份 GVCF | GenomicsDB / combined GVCF | 把 N 份 GVCF 索引到單一 columnar 資料庫,方便快速查詢 |
| 3 | GenotypeGVCFs | GenomicsDB | cohort VCF | 掃過所有位點,根據 N 樣本的 likelihood 重新呼叫 genotype |
兩者目的相同,但 GenomicsDBImport(推薦)使用 Intel TileDB 後端,磁碟 I/O 與記憶體效率高很多,可處理數萬樣本;CombineGVCFs 是舊版作法,產生 plain text VCF,超過 ~200 樣本就吃不消。Broad Institute 內部 100,000-sample joint calling 即用 GenomicsDB。
Same purpose, but GenomicsDBImport (recommended) uses Intel's TileDB backend for far better disk and memory efficiency, scaling to tens of thousands of samples. CombineGVCFs is the older approach producing plain VCF text and chokes past ~200 samples. Broad Institute's 100,000-sample joint calling uses GenomicsDB.
三、互動模擬:cohort size 如何救稀有變異
調整 cohort 樣本數,看 single-sample calling 與 joint calling 對「allele frequency 不同」的變異的 recall (sensitivity) 差異。常見變異 (AF=0.1) 兩者差不多;但稀有變異 (AF<0.001) 只有大 cohort joint calling 才抓得到。
Adjust cohort size to compare single-sample vs joint-calling recall (sensitivity) across allele frequencies. Common variants (AF=0.1) — comparable. Rare variants (AF<0.001) — only large-cohort joint calling catches them.
四、scatter-gather 平行化
單一 GenotypeGVCFs job 對 30,000 樣本 × 全基因組會跑數天。實務上以 區段 scatter-gather 拆解:
- 把 reference 切成 ~5 Mb 片段(可用 GATK 的
SplitIntervals自動產生 ~50-100 片) - 每片段獨立執行 GenomicsDBImport + GenotypeGVCFs
- 用
GatherVcfs或MergeVcfs把所有片段合併
在雲端(GCP DataProc, Terra workflow, AWS Batch)很容易啟動 100+ 平行 jobs,把全基因組 cohort joint calling 從幾天降到幾小時。Broad 的 JointGenotyping.wdl 就是這個架構。
A single GenotypeGVCFs job on 30,000 samples × whole genome would take days. In practice, interval scatter-gather breaks it up:
- Split the reference into ~5 Mb chunks (GATK's
SplitIntervalsauto-generates ~50–100) - Run GenomicsDBImport + GenotypeGVCFs per chunk independently
- Use
GatherVcfsorMergeVcfsto combine all chunks
On cloud (GCP Dataproc, Terra workflows, AWS Batch), spinning up 100+ parallel jobs cuts whole-genome cohort joint calling from days to hours. Broad's JointGenotyping.wdl uses exactly this pattern.
切割
SplitIntervals 依 region 大小切;--scatter-count 100 自動均分 100 等份。
SplitIntervals divides by region size; --scatter-count 100 creates 100 even chunks.
平行
每片段獨立 import + genotype,可分散到不同 node;DB 建議放共用儲存或 cloud bucket。
Each chunk is imported and genotyped independently across nodes; store DB on shared storage or a cloud bucket.
合併
用 GatherVcfs (快, 假設區段不重疊) 或 MergeVcfs (穩, 處理重疊);統一壓縮為 bgzip + tabix index。
Use GatherVcfs (fast, assumes no overlap) or MergeVcfs (robust to overlaps); compress with bgzip + tabix.
五、Pipeline 指令範例
# 1. 準備 sample-name-map (一行一樣本: name TAB path)
cat > cohort.sample_map <<EOF
sample001 /data/gvcfs/sample001.g.vcf.gz
sample002 /data/gvcfs/sample002.g.vcf.gz
sample003 /data/gvcfs/sample003.g.vcf.gz
EOF
# 2. Import 到 GenomicsDB (per-interval)
gatk --java-options "-Xmx16g" GenomicsDBImport \
--genomicsdb-workspace-path my_database \
--batch-size 50 \
-L chr20 \
--sample-name-map cohort.sample_map \
--tmp-dir /tmp \
--reader-threads 5
# 注意:workspace 必須是「不存在的目錄」,工具會自己建立# Joint calling
gatk --java-options "-Xmx16g" GenotypeGVCFs \
-R Homo_sapiens_assembly38.fasta \
-V gendb://my_database \
-O cohort_chr20.vcf.gz \
--tmp-dir /tmp
# 加 dbSNP 註解 ID
gatk GenotypeGVCFs \
-R ref.fa \
-V gendb://my_database \
--dbsnp Homo_sapiens_assembly38.dbsnp138.vcf \
-O cohort_chr20.vcf.gz# Scatter-gather 全流程 (bash 簡化版)
REF=Homo_sapiens_assembly38.fasta
INTERVALS=wgs_calling_regions.hg38.interval_list
# Step A: 切成 100 份
gatk SplitIntervals -R $REF -L $INTERVALS \
--scatter-count 100 -O scatter_dir
# Step B: per-interval import + genotype 平行 (用 GNU parallel)
ls scatter_dir/*.interval_list | parallel -j 16 '
chunk=$(basename {} .interval_list)
gatk GenomicsDBImport \
--genomicsdb-workspace-path db_${chunk} \
--sample-name-map cohort.sample_map \
-L {} --batch-size 50
gatk GenotypeGVCFs \
-R '$REF' -V gendb://db_${chunk} \
-O cohort_${chunk}.vcf.gz
'
# Step C: 合併
ls cohort_*.vcf.gz | sort -V > vcf_list.txt
gatk GatherVcfs -I vcf_list.txt -O cohort_final.vcf.gz
gatk IndexFeatureFile -I cohort_final.vcf.gz六、FAQ
cohort 要多大才有意義?
可以「加入新樣本」到已存在的 GenomicsDB 嗎?
--genomicsdb-update-workspace-path(取代 --genomicsdb-workspace-path)即可附加。但 cohort 改變後仍需重跑 GenotypeGVCFs。--genomicsdb-update-workspace-path instead of --genomicsdb-workspace-path to append. The cohort has changed, so GenotypeGVCFs must be re-run.N+1 問題:有新樣本就要重跑全 cohort 嗎?
輸出 cohort VCF 多大?
七、小測驗
Q1. 為什麼合併 single-sample VCF 不等於 joint calling?
Q2. GenomicsDBImport 比 CombineGVCFs 適合大 cohort 的主因?
Q3. 一個 30× WGS sample joint-call 完,下一週又進來 1 個新樣本,最佳作法?