Step 6: Joint Genotyping — WGS/WES Tutorial

為什麼

一、為什麼需要 joint calling？

單樣本呼叫 (single-sample calling) 在每個樣本身上獨立做決策，會遭遇兩個系統性問題：

低 coverage 位點被漏掉：某樣本在某位點只有 5× 覆蓋、其中 1 條 read 支持 ALT，single-sample caller 會視為 noise 直接 0/0。但若 100 個樣本中有 10 個都在這位點看到「1/5」訊號，joint calling 能整合證據判定為真 variant。
無法區分「真 0/0」與「missing」：合併 100 個 single-sample VCF 時，缺席的位點到底是「真 ref」還是「沒測到」？無法回答。GVCF 透過 reference-confidence block 解決這問題。

結果：joint calling 對稀有變異的 sensitivity 顯著提升，且 cohort 越大效果越好——這也是為什麼 gnomAD、UK Biobank、All of Us 等百萬級資料集都用 joint calling。

Single-sample calling decides independently per sample and hits two systematic problems:

Low-coverage sites are dropped: a sample with 5× coverage and 1 ALT-supporting read at a site is dismissed as noise and called 0/0. But if 10 of 100 samples show the same "1/5" signal, joint calling integrates the evidence and rescues the real variant.
Can't distinguish true 0/0 from missing: merging 100 single-sample VCFs leaves "absent at this site = true ref or untested?" unanswerable. GVCF's reference-confidence blocks solve this.

The payoff: joint calling markedly improves sensitivity for rare variants, and effectiveness scales with cohort size — which is why gnomAD, UK Biobank, and All of Us all use it.

技術

二、GATK 三步流程

步驟	工具	輸入	輸出	說明
1	`HaplotypeCaller -ERC GVCF`	每個 sample 一份 BAM	N 份 GVCF	(已在 Step 5 完成) per-sample 獨立呼叫
2	`GenomicsDBImport` 或 `CombineGVCFs`	N 份 GVCF	GenomicsDB / combined GVCF	把 N 份 GVCF 索引到單一 columnar 資料庫，方便快速查詢
3	`GenotypeGVCFs`	GenomicsDB	cohort VCF	掃過所有位點，根據 N 樣本的 likelihood 重新呼叫 genotype

GenomicsDBImport vs CombineGVCFs

兩者目的相同，但 GenomicsDBImport（推薦）使用 Intel TileDB 後端，磁碟 I/O 與記憶體效率高很多，可處理數萬樣本；CombineGVCFs 是舊版作法，產生 plain text VCF，超過 ~200 樣本就吃不消。Broad Institute 內部 100,000-sample joint calling 即用 GenomicsDB。

Same purpose, but GenomicsDBImport (recommended) uses Intel's TileDB backend for far better disk and memory efficiency, scaling to tens of thousands of samples. CombineGVCFs is the older approach producing plain VCF text and chokes past ~200 samples. Broad Institute's 100,000-sample joint calling uses GenomicsDB.

互動模擬

三、互動模擬：cohort size 如何救稀有變異

調整 cohort 樣本數，看 single-sample calling 與 joint calling 對「allele frequency 不同」的變異的 recall (sensitivity) 差異。常見變異 (AF=0.1) 兩者差不多；但稀有變異 (AF<0.001) 只有大 cohort joint calling 才抓得到。

Adjust cohort size to compare single-sample vs joint-calling recall (sensitivity) across allele frequencies. Common variants (AF=0.1) — comparable. Rare variants (AF<0.001) — only large-cohort joint calling catches them.

Cohort 樣本數: 100

藍 = single-sample calling 的 recall（與 cohort 大小無關）；橘 = joint calling 的 recall（隨 cohort 增大而上升）。X 軸為 log scale 的 allele frequency。

Blue = single-sample recall (independent of cohort size). Orange = joint-calling recall (rises with cohort size). X-axis is log-scale allele frequency.

技術

四、scatter-gather 平行化

單一 GenotypeGVCFs job 對 30,000 樣本 × 全基因組會跑數天。實務上以 區段 scatter-gather 拆解：

把 reference 切成 ~5 Mb 片段（可用 GATK 的 SplitIntervals 自動產生 ~50-100 片）
每片段獨立執行 GenomicsDBImport + GenotypeGVCFs
用 GatherVcfs 或 MergeVcfs 把所有片段合併

在雲端（GCP DataProc, Terra workflow, AWS Batch）很容易啟動 100+ 平行 jobs，把全基因組 cohort joint calling 從幾天降到幾小時。Broad 的 JointGenotyping.wdl 就是這個架構。

A single GenotypeGVCFs job on 30,000 samples × whole genome would take days. In practice, interval scatter-gather breaks it up:

Split the reference into ~5 Mb chunks (GATK's SplitIntervals auto-generates ~50–100)
Run GenomicsDBImport + GenotypeGVCFs per chunk independently
Use GatherVcfs or MergeVcfs to combine all chunks

On cloud (GCP Dataproc, Terra workflows, AWS Batch), spinning up 100+ parallel jobs cuts whole-genome cohort joint calling from days to hours. Broad's JointGenotyping.wdl uses exactly this pattern.

🔪

切割

SplitIntervals 依 region 大小切；--scatter-count 100 自動均分 100 等份。

SplitIntervals divides by region size; --scatter-count 100 creates 100 even chunks.

⚡

平行

每片段獨立 import + genotype，可分散到不同 node；DB 建議放共用儲存或 cloud bucket。

Each chunk is imported and genotyped independently across nodes; store DB on shared storage or a cloud bucket.

🧷

合併

用 GatherVcfs (快, 假設區段不重疊) 或 MergeVcfs (穩, 處理重疊)；統一壓縮為 bgzip + tabix index。

Use GatherVcfs (fast, assumes no overlap) or MergeVcfs (robust to overlaps); compress with bgzip + tabix.

實務指令

五、Pipeline 指令範例

# 1. 準備 sample-name-map (一行一樣本: name TAB path)
cat > cohort.sample_map <<EOF
sample001	/data/gvcfs/sample001.g.vcf.gz
sample002	/data/gvcfs/sample002.g.vcf.gz
sample003	/data/gvcfs/sample003.g.vcf.gz
EOF

# 2. Import 到 GenomicsDB (per-interval)
gatk --java-options "-Xmx16g" GenomicsDBImport \
    --genomicsdb-workspace-path my_database \
    --batch-size 50 \
    -L chr20 \
    --sample-name-map cohort.sample_map \
    --tmp-dir /tmp \
    --reader-threads 5

# 注意：workspace 必須是「不存在的目錄」，工具會自己建立

# Joint calling
gatk --java-options "-Xmx16g" GenotypeGVCFs \
    -R Homo_sapiens_assembly38.fasta \
    -V gendb://my_database \
    -O cohort_chr20.vcf.gz \
    --tmp-dir /tmp

# 加 dbSNP 註解 ID
gatk GenotypeGVCFs \
    -R ref.fa \
    -V gendb://my_database \
    --dbsnp Homo_sapiens_assembly38.dbsnp138.vcf \
    -O cohort_chr20.vcf.gz

# Scatter-gather 全流程 (bash 簡化版)
REF=Homo_sapiens_assembly38.fasta
INTERVALS=wgs_calling_regions.hg38.interval_list

# Step A: 切成 100 份
gatk SplitIntervals -R $REF -L $INTERVALS \
    --scatter-count 100 -O scatter_dir

# Step B: per-interval import + genotype 平行 (用 GNU parallel)
ls scatter_dir/*.interval_list | parallel -j 16 '
    chunk=$(basename {} .interval_list)
    gatk GenomicsDBImport \
        --genomicsdb-workspace-path db_${chunk} \
        --sample-name-map cohort.sample_map \
        -L {} --batch-size 50
    gatk GenotypeGVCFs \
        -R '$REF' -V gendb://db_${chunk} \
        -O cohort_${chunk}.vcf.gz
'

# Step C: 合併
ls cohort_*.vcf.gz | sort -V > vcf_list.txt
gatk GatherVcfs -I vcf_list.txt -O cohort_final.vcf.gz
gatk IndexFeatureFile -I cohort_final.vcf.gz

常見問題

六、FAQ

cohort 要多大才有意義？

家系分析 3-4 人就有用 (de novo identification)。族群關聯研究通常 ≥ 1,000 人才能達到稀有變異統計力。gnomAD v4 包含 ~80,000 WGS + 730,000 WES。實務上 100 人以上 joint calling 就能看到對 single-sample 顯著的 recall 提升。

Trio analyses (3-4 people) are useful for de novo identification. Population association studies typically need ≥1,000 to reach rare-variant power. gnomAD v4 includes ~80,000 WGS + 730,000 WES. In practice, joint calling with ≥100 samples shows clearly better recall than single-sample calling.

可以「加入新樣本」到已存在的 GenomicsDB 嗎？

可以。再次執行 GenomicsDBImport 並指定 --genomicsdb-update-workspace-path（取代 --genomicsdb-workspace-path）即可附加。但 cohort 改變後仍需重跑 GenotypeGVCFs。

Yes. Re-run GenomicsDBImport with --genomicsdb-update-workspace-path instead of --genomicsdb-workspace-path to append. The cohort has changed, so GenotypeGVCFs must be re-run.

N+1 問題：有新樣本就要重跑全 cohort 嗎？

是的，這是 GATK joint calling 的設計缺陷。每加入新樣本，joint genotyping 必須重跑（雖然 GVCF 本身不必重產）。GLnexus 與 DRAGEN 的 incremental joint calling 部分緩解此問題；大 cohort（>100k）通常分批 freeze。

Yes, this is GATK joint calling's design limitation. Adding samples requires re-running joint genotyping (though GVCFs themselves are not regenerated). GLnexus and DRAGEN's incremental joint calling partially mitigate this; very large cohorts (>100k) typically freeze in batches.

輸出 cohort VCF 多大？

100 樣本 30× WGS 約 100-200 GB（bgzip 壓縮）。隨 cohort 線性增長，10,000 樣本可達數 TB。長期儲存常分 chromosome 切片，並轉成 Hail Table 或 VariantStore (Apache Iceberg) 便於分析。

100 samples × 30× WGS: ~100–200 GB bgzipped. Grows linearly; 10,000 samples can reach multiple TB. Long-term storage often slices by chromosome and converts to Hail Tables or a VariantStore (Apache Iceberg) for analysis.

自我檢測

七、小測驗

Q1. 為什麼合併 single-sample VCF 不等於 joint calling？

Q2. GenomicsDBImport 比 CombineGVCFs 適合大 cohort 的主因？

Q3. 一個 30× WGS sample joint-call 完，下一週又進來 1 個新樣本，最佳作法？