Ch 7: Download, Compression & Integrity

下載

wget vs curl vs rsync — 該選哪一個？

工具	適合情境	關鍵特性
wget	一次下載一個 / 一個資料夾	-c 續傳；-r 遞迴；簡單可靠
curl	API、自訂 header / POST	-o 輸出；-L follow redirect
rsync	資料夾同步	只傳差異 + 續傳

實戰範例

# 下載 GENCODE GTF（含斷點續傳）
# Download GENCODE GTF with resume support
wget -c "https://ftp.ebi.ac.uk/.../gencode.v44.annotation.gtf.gz"

# 下載整個 FTP 目錄
# Download an entire FTP folder
wget -r -np -nH --cut-dirs=3 "ftp://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/"

# 用 curl 下載並重新命名
# Download and rename with curl
curl -L -o gencode.gtf.gz "https://ftp.ebi.ac.uk/.../gencode.v44.gtf.gz"

# 呼叫 Ensembl REST API 查詢基因
# Query the Ensembl REST API for a gene
curl "https://rest.ensembl.org/lookup/symbol/human/TP53?expand=1" \
  -H 'Content-Type: application/json' | python -m json.tool

# 把本地分析結果同步到實驗室伺服器（只傳差異、可恢復）
# Sync local results to lab server (delta + resumable)
rsync -avzh --progress \
  results/ \
  charlene@server:/data/projects/rnaseq/results/

# 從 cluster 拉檔案下來
# Pull files from cluster
rsync -avzh charlene@cluster:~/work/multiqc_report.html ./

# SRA 公開資料下載 (sra-tools / fasterq-dump)
# Public SRA download — install via conda first
conda install -c bioconda sra-tools -y
prefetch SRR1234567
fasterq-dump SRR1234567 --split-files --gzip

壓縮

gzip / bgzip / tar — NGS 必懂的壓縮技

gzip large.fastq

壓縮（原檔消失）

gunzip large.fastq.gz

解壓

gzip -k file

壓縮但保留原檔

zcat sample.fastq.gz | head

串流式讀取

bgzip sample.vcf

VCF/GFF 用 bgzip

tabix -p vcf sample.vcf.gz

建立 tabix 索引

tar -czvf bk.tar.gz dir/

打包資料夾

tar -xzvf bk.tar.gz

解壓 tar.gz

pigz -p 8 sample.fastq

多執行緒 gzip

⚠️

gzip vs bgzip：普通 gzip 不支援隨機存取；bgzip 是「block gzip」，每 64KB 一個 block 可以被 tabix 建索引——VCF / GFF / BED 一律使用 bgzip 而非 gzip。

gzip vs bgzip: regular gzip cannot be randomly accessed. bgzip ("block gzip") makes 64KB blocks indexable by tabix — VCF / GFF / BED always use bgzip, never plain gzip.

完整性

md5sum / sha256sum — 確保 30 GB 檔案沒壞掉

大型 NGS 檔案下載中途斷線、磁碟壞道、上傳檔案丟 byte，都可能讓檔案看似存在但內容已壞。每個分析的第一步都應該驗證 checksum。

Large NGS downloads can disconnect mid-flight, disks can corrupt blocks, and uploads can drop bytes. Files may look present but be silently broken. Always verify a checksum before any analysis.

# 計算單一檔案的 md5
# Compute md5 for a single file
md5sum sample.fastq.gz
# d41d8cd98f00b204e9800998ecf8427e  sample.fastq.gz

# sha256 更安全（建議用於發佈資料）
# sha256 is stronger (recommended for publications)
sha256sum sample.fastq.gz

# 將所有 .fastq.gz 寫進 checksums.md5
# Write checksums for every fastq.gz
md5sum *.fastq.gz > checksums.md5

# 之後（甚至幾個月後）驗證：
# Later verify (even months later):
md5sum -c checksums.md5
# sampleA.fastq.gz: OK
# sampleB.fastq.gz: FAILED  ← 這個檔案有問題！

# 結合 find 與 xargs 批次計算
# Combine find + xargs for batch checksums
find raw_data/ -name "*.fastq.gz" -print0 \
  | xargs -0 md5sum \
  > raw_data.checksums.md5

# 從公開資料下載時務必比對網站提供的 checksum

💡

SOP 建議：每個 raw_data/ 都附一份 checksums.md5，收檔當天執行 md5sum -c，確認後把整個資料夾設成唯讀。日後任何懷疑檔案壞掉時，再跑一次比對。

SOP suggestion: ship every raw_data/ with a checksums.md5, run md5sum -c on day-one, then set the folder read-only. Re-verify whenever a file feels suspicious later.

資料來源

生物資訊常用公開資料庫

📦 SRA / ENA

原始 NGS reads，FASTQ 來源；用 sra-tools、fasterq-dump、或 ENA 直接 wget。

Raw NGS reads (FASTQ source). Use sra-tools, fasterq-dump, or wget from ENA directly.

🧬 GENCODE / Ensembl

人 / 鼠等模式生物的 reference genome、GTF/GFF；推薦使用 GENCODE / Ensembl 的 annotation。

Reference genomes and GTF/GFF for human, mouse and others. GENCODE / Ensembl annotations are the standard.

🌐 UCSC Genome Browser

提供 chromsizes、tracks、liftover chain 等。檔名常用 chr1（不是 1）。

Provides chrom sizes, tracks, liftover chains. Uses chr1 naming (not 1).

🧪 GEO

已分析的 microarray / RNA-seq / ATAC-seq 資料、GSE 編號；可拿 expression matrix 與 metadata。

Processed microarray / RNA-seq / ATAC-seq datasets with GSE identifiers. Provides expression matrices and metadata.

💧 1000G / gnomAD

人類群體 variants 的金本位資料庫，VCF 來源。

Population-scale human variant resources — gold-standard VCF data.

🏥 TCGA / ICGC

癌症基因體資料；通常需要 controlled access（dbGaP、EGA）。

Cancer-genomics resources; typically require controlled access (dbGaP, EGA).

動手做

互動：下載與壓縮模擬

📝 自我檢測

1. 從本地把 200 GB 的 results/ 同步到伺服器，最適合用？

1. Best tool to sync a 200 GB local results/ to a remote server?

scp -r

wget

rsync -avzh --progress

cp -r

2. 為什麼 VCF 檔通常用 bgzip 而非普通 gzip？

2. Why use bgzip on VCF instead of plain gzip?

A. bgzip 壓得比較小A. bgzip compresses more

B. bgzip 是 block-based，可被 tabix 建索引、支援隨機存取B. bgzip is block-based, indexable by tabix for random access

C. gzip 不能用在 VCFC. gzip can't compress VCF

D. bgzip 不會破壞檔案D. bgzip is the only safe one

3. 收到一份附 checksums.md5 的 NGS 資料，最佳第一步？

3. You get NGS data with a checksums.md5. Best first step?

cat checksums.md5

md5sum *.fastq.gz

head checksums.md5

md5sum -c checksums.md5

4. 在 32 核心伺服器上壓縮 100 個 fastq 檔，最快的選擇是？

4. Compressing 100 fastq files on a 32-core server, which is fastest?

pigz -p 16 *.fastq

gzip *.fastq

tar -czf out.tar.gz *.fastq

zip *.fastq