Run the WGS/WES Pipeline

This section describes an end-to-end, config-driven workflow for WGS/WES data. The same pipeline supports germline and somatic analyses and can be switched via a single YAML configuration file.

Pipeline Inputs & Layout 

Recommended project layout:

project/
├── config/
│   └── settings.yaml
├── refs/
│   ├── GRCh38.fa
│   ├── GRCh38.fa.fai
│   ├── GRCh38.dict
│   ├── dbsnp.vcf.gz
│   ├── dbsnp.vcf.gz.tbi
│   ├── Mills_and_1000G_gold_standard.indels.vcf.gz
│   ├── Mills_and_1000G_gold_standard.indels.vcf.gz.tbi
│   ├── 1000G_phase1.indels.vcf.gz
│   ├── 1000G_phase1.indels.vcf.gz.tbi
│   ├── af-only-gnomad.vcf.gz
│   ├── af-only-gnomad.vcf.gz.tbi
│   └── targets_exome.bed  # WES only (+ optional padded BED)
├── fastq/
│   ├── SAMPLE1_R1.fastq.gz
│   ├── SAMPLE1_R2.fastq.gz
│   └── ...
├── work/      # intermediate outputs
└── results/   # final QC, BAM/CRAM, VCF, annotation

Example settings.yaml (edit to your environment):

mode: germline  # one of: germline | somatic_tn | somatic_tumor_only
reference: refs/GRCh38.fa
known_sites:
  dbsnp: refs/dbsnp.vcf.gz
  mills: refs/Mills_and_1000G_gold_standard.indels.vcf.gz
  indels: refs/1000G_phase1.indels.vcf.gz
population_af: refs/af-only-gnomad.vcf.gz
wes:
  targets_bed: refs/targets_exome.bed
  padded_bed: refs/targets_exome.padded100.bed  # optional
resources:
  threads: 16
  memory_gb: 64
containers:
  use_apptainer: true
  image: docker://broadinstitute/gatk:latest
bams:
  compress_to_cram: true
  cram_ref: refs/GRCh38.fa
scatter:
  intervals: 22  # chromosomes or interval shards for parallelization

Note

For WES, use the vendor-provided BED; consider generating a padded BED (+/- 50–100 bp) to capture near-target indels.

Step-by-Step Workflow 

1) Raw Read QC 

Run FastQC on all FASTQ pairs; aggregate with MultiQC.

mkdir -p results/qc/fastqc
fastqc fastq/*_R1.fastq.gz fastq/*_R2.fastq.gz -o results/qc/fastqc
multiqc results/qc -o results/qc

Gate: No catastrophic adapter content; base qualities reasonable; overrepresented sequences understood (e.g., adapters).

2) Adapter/Quality Trimming (optional but recommended for WES)

Use fastp (recommended) or Trimmomatic. Enable poly-G trimming for NovaSeq data if needed.

fastp \
  -i fastq/SAMPLE_R1.fastq.gz -I fastq/SAMPLE_R2.fastq.gz \
  -o work/SAMPLE_R1.trim.fq.gz -O work/SAMPLE_R2.trim.fq.gz \
  --detect_adapter_for_pe --thread 16 \
  --html results/qc/SAMPLE.fastp.html --json results/qc/SAMPLE.fastp.json

Outputs: Trimmed FASTQs + fastp QC reports (included in MultiQC).

3) Alignment to Reference 

Align with BWA-MEM2 (or BWA-MEM) against GRCh38/hg38 (ALT-aware). Always set Read Group (RG) tags.

RG='@RG\tID:SAMPLE\tSM:SAMPLE\tPL:ILLUMINA\tLB:LIB1\tPU:FLOWCELL.LANE'
bwa-mem2 mem -t 16 refs/GRCh38.fa \
  work/SAMPLE_R1.trim.fq.gz work/SAMPLE_R2.trim.fq.gz \
  | samtools view -b - \
  | samtools sort -@ 8 -o work/SAMPLE.sorted.bam
samtools index work/SAMPLE.sorted.bam

QC:

Alignment stats: samtools flagstat and (WES) Picard CollectHsMetrics / (WGS) Picard CollectWgsMetrics.
Coverage profiling: mosdepth or bedtools genomecov.

samtools flagstat work/SAMPLE.sorted.bam > results/qc/SAMPLE.flagstat.txt
# WES capture metrics
gatk CollectHsMetrics \
  -I work/SAMPLE.sorted.bam -O results/qc/SAMPLE.hs_metrics.txt \
  -R refs/GRCh38.fa --BAIT_INTERVALS refs/targets_exome.bed \
  --TARGET_INTERVALS refs/targets_exome.bed
# WGS metrics
gatk CollectWgsMetrics \
  -I work/SAMPLE.sorted.bam -O results/qc/SAMPLE.wgs_metrics.txt -R refs/GRCh38.fa

4) Duplicate Marking & BQSR 

Mark duplicates (UMI-aware if applicable). Perform Base Quality Score Recalibration (BQSR) using known sites.

gatk MarkDuplicatesSpark \
  -I work/SAMPLE.sorted.bam \
  -O work/SAMPLE.dedup.bam \
  -M results/qc/SAMPLE.dup_metrics.txt
samtools index work/SAMPLE.dedup.bam
gatk BaseRecalibrator \
  -I work/SAMPLE.dedup.bam -R refs/GRCh38.fa \
  --known-sites refs/dbsnp.vcf.gz \
  --known-sites refs/Mills_and_1000G_gold_standard.indels.vcf.gz \
  --known-sites refs/1000G_phase1.indels.vcf.gz \
  -O work/SAMPLE.recal_data.table
gatk ApplyBQSR \
  -I work/SAMPLE.dedup.bam -R refs/GRCh38.fa \
  --bqsr-recal-file work/SAMPLE.recal_data.table \
  -O work/SAMPLE.recal.bam
samtools index work/SAMPLE.recal.bam

Tip

To reduce storage, write CRAM instead of BAM: gatk PrintReads -I work/SAMPLE.recal.bam -O results/bam/SAMPLE.cram -R refs/GRCh38.fa

5) Variant Calling 

Germline (per-sample GVCF)

Call each sample in GVCF mode for joint genotyping later.

gatk HaplotypeCaller \
  -R refs/GRCh38.fa -I work/SAMPLE.recal.bam \
  -O results/gvcf/SAMPLE.g.vcf.gz -ERC GVCF

Joint genotyping (cohort)

gatk GenomicsDBImport \
  --genomicsdb-workspace-path work/cohort_gdb \
  --sample-name-map config/gvcf_samples.map \
  --reader-threads 8
gatk GenotypeGVCFs \
  -R refs/GRCh38.fa \
  -V gendb://work/cohort_gdb \
  -O results/vcf/cohort.unfiltered.vcf.gz

Note

For small cohorts, you may replace GenomicsDB with CombineGVCFs but GenomicsDB scales better.

Somatic (Tumor–Normal)

Use Mutect2 with matched normal, a germline resource (gnomAD), and a Panel of Normals (PoN) if available.

# Calling
gatk Mutect2 \
  -R refs/GRCh38.fa \
  -I work/TUMOR.recal.bam -tumor TUMOR \
  -I work/NORMAL.recal.bam -normal NORMAL \
  --germline-resource refs/af-only-gnomad.vcf.gz \
  --panel-of-normals refs/pon.vcf.gz \
  -O work/TUMOR.mutect2.unfiltered.vcf.gz
# Orientation bias model (important for FFPE)
gatk LearnReadOrientationModel \
  -I work/TUMOR.f1r2.tar.gz \
  -O work/TUMOR.read-orientation-model.tar.gz
# Contamination estimation
gatk GetPileupSummaries \
  -I work/TUMOR.recal.bam -V refs/af-only-gnomad.vcf.gz \
  -L refs/targets_exome.bed -O work/TUMOR.pileups.table
gatk GetPileupSummaries \
  -I work/NORMAL.recal.bam -V refs/af-only-gnomad.vcf.gz \
  -L refs/targets_exome.bed -O work/NORMAL.pileups.table
gatk CalculateContamination \
  -I work/TUMOR.pileups.table \
  -matched work/NORMAL.pileups.table \
  -O work/TUMOR.contamination.table \
  --tumor-segmentation work/TUMOR.segments.table
# Filtering
gatk FilterMutectCalls \
  -R refs/GRCh38.fa \
  -V work/TUMOR.mutect2.unfiltered.vcf.gz \
  --contamination-table work/TUMOR.contamination.table \
  --ob-priors work/TUMOR.read-orientation-model.tar.gz \
  -O results/vcf/TUMOR.somatic.filtered.vcf.gz

Somatic (Tumor-only)

Call without a matched normal. Use PoN and gnomAD to suppress artifacts and likely germline.

gatk Mutect2 \
  -R refs/GRCh38.fa \
  -I work/TUMOR.recal.bam -tumor TUMOR \
  --germline-resource refs/af-only-gnomad.vcf.gz \
  --panel-of-normals refs/pon.vcf.gz \
  -O work/TUMOR.to.unfiltered.vcf.gz
gatk LearnReadOrientationModel \
  -I work/TUMOR.f1r2.tar.gz \
  -O work/TUMOR.read-orientation-model.tar.gz
gatk GetPileupSummaries \
  -I work/TUMOR.recal.bam -V refs/af-only-gnomad.vcf.gz \
  -O work/TUMOR.pileups.table
gatk CalculateContamination \
  -I work/TUMOR.pileups.table -O work/TUMOR.contamination.table
gatk FilterMutectCalls \
  -R refs/GRCh38.fa \
  -V work/TUMOR.to.unfiltered.vcf.gz \
  --contamination-table work/TUMOR.contamination.table \
  --ob-priors work/TUMOR.read-orientation-model.tar.gz \
  -O results/vcf/TUMOR.somatic.filtered.vcf.gz

Tip

Building a PoN: Run Mutect2 on ≥30 normal samples (as “tumors” without normal), then gatk CreateSomaticPanelOfNormals -V normals.list -O refs/pon.vcf.gz

6) Variant Filtering 

Germline

Prefer VQSR for sufficiently large cohorts. Otherwise use hard filters.

VQSR (cohort)

# SNP model
gatk VariantRecalibrator \
  -R refs/GRCh38.fa -V results/vcf/cohort.unfiltered.vcf.gz \
  --resource:hapmap,known=false,training=true,truth=true,prior=15.0 refs/hapmap.vcf.gz \
  --resource:omni,known=false,training=true,truth=false,prior=12.0 refs/1000G_omni.vcf.gz \
  --resource:1000G,known=false,training=true,truth=false,prior=10.0 refs/1000G_high_conf.vcf.gz \
  --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 refs/dbsnp.vcf.gz \
  -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
  -mode SNP -O work/cohort.snp.recal --tranches-file work/cohort.snp.tranches
gatk ApplyVQSR \
  -V results/vcf/cohort.unfiltered.vcf.gz \
  --recal-file work/cohort.snp.recal \
  --tranches-file work/cohort.snp.tranches \
  -mode SNP -O work/cohort.snp.recalibrated.vcf.gz
# INDEL model
gatk VariantRecalibrator \
  -R refs/GRCh38.fa -V work/cohort.snp.recalibrated.vcf.gz \
  --resource:mills,known=false,training=true,truth=true,prior=12.0 refs/Mills_and_1000G_gold_standard.indels.vcf.gz \
  --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 refs/dbsnp.vcf.gz \
  -an QD -an ReadPosRankSum -an FS -an SOR \
  -mode INDEL -O work/cohort.indel.recal --tranches-file work/cohort.indel.tranches
gatk ApplyVQSR \
  -V work/cohort.snp.recalibrated.vcf.gz \
  --recal-file work/cohort.indel.recal \
  --tranches-file work/cohort.indel.tranches \
  -mode INDEL -O results/vcf/cohort.filtered.vcf.gz

Hard filters (small cohorts)

# SNPs (example thresholds)
gatk VariantFiltration \
  -V results/vcf/cohort.unfiltered.vcf.gz \
  --filter-expression "QD < 2.0" --filter-name "QD2" \
  --filter-expression "FS > 60.0" --filter-name "FS60" \
  --filter-expression "MQ < 40.0" --filter-name "MQ40" \
  --filter-expression "MQRankSum < -12.5" --filter-name "MQRS-12.5" \
  --filter-expression "ReadPosRankSum < -8.0" --filter-name "RPRS-8" \
  --filter-expression "SOR > 3.0" --filter-name "SOR3" \
  -O work/cohort.snps.hardfiltered.vcf.gz
# INDELs (example thresholds)
gatk VariantFiltration \
  -V work/cohort.snps.hardfiltered.vcf.gz \
  --filter-expression "QD < 2.0" --filter-name "QD2" \
  --filter-expression "FS > 200.0" --filter-name "FS200" \
  --filter-expression "ReadPosRankSum < -20.0" --filter-name "RPRS-20" \
  --filter-expression "SOR > 10.0" --filter-name "SOR10" \
  -O results/vcf/cohort.filtered.vcf.gz

Somatic

Use FilterMutectCalls output; optionally apply FilterByOrientationBias for specific FFPE artifacts:

gatk FilterByOrientationBias \
  -V results/vcf/TUMOR.somatic.filtered.vcf.gz \
  -P G/T -P C/T \
  -O results/vcf/TUMOR.somatic.ffpe_filtered.vcf.gz

7) Annotation 

Annotate for consequence, population frequency, and clinical context.

VEP (recommended)

vep \
  -i results/vcf/*.vcf.gz \
  -o results/annotation/annotated.vep.vcf.gz \
  --vcf --compress_output bgzip \
  --assembly GRCh38 --offline --cache \
  --fork 8 --everything

ANNOVAR (alternative)

table_annovar.pl results/vcf/*.vcf.gz humandb/ \
  -buildver hg38 -out results/annotation/annovar \
  -remove -protocol refGene,gnomad211_exome,clinvar_202403 \
  -operation g,f,f -nastring . -polish -vcfinput

Warning

ClinVar and cancer knowledge bases are for interpretation. Do not use them as hard filters without a defined policy.

8) Cohort-Level QC & Reports 

Aggregate QC with MultiQC (FastQC, alignment metrics, coverage, duplication).
Confirm sample identity/sex and detect swaps with Somalier or VerifyBamID2.
Summarize callable bases and % bases ≥20×/30× (WES) or ≥30× (WGS).

Parallelization & Performance 

Scatter/gather by chromosomes or interval shards; keep read group boundaries intact.
Use MarkDuplicatesSpark for speed (ensure adequate I/O).
Tune BWA-MEM2 threads to CPU/NUMA; avoid oversubscription.

Example Slurm snippet:

#!/bin/bash
#SBATCH -J wgswes
#SBATCH -N 1
#SBATCH -c 16
#SBATCH --mem=64G
#SBATCH -t 24:00:00
module load apptainer
apptainer exec docker://broadinstitute/gatk:latest \
  gatk --version
# run HaplotypeCaller / Mutect2 shards here ...

WES-Specific Notes 

Report on-target %, fold-80 base penalty, mean target coverage, and % targets ≥ 20×/30×.
Use padded BED for calling; revert to strict BED for coverage/QC.
Capture-kit updates may change target definitions; pin versions in the repository.

Expected Outputs 

BAM/CRAM + index (post-BQSR), duplication metrics, alignment/coverage metrics.
VCF/BCF: Germline joint genotyped and filtered; or somatic filtered calls.
Annotation files (VEP/ANNOVAR outputs).
QC: MultiQC report, per-sample metrics, contamination estimates.

Troubleshooting 

Low on-target (WES): Verify BED compatibility, library quality, and adapter trimming.
High contamination: Re-evaluate sample swaps; remove contaminated samples from PoN.
Over-filtering (germline): Revisit hard-filter thresholds or use VQSR with adequate cohort size.
Somatic false positives (tumor-only): Ensure PoN is technology-matched; tighten population AF thresholds; apply orientation bias filtering.

Next Steps 

See Advanced Analysis for trios, joint calling strategies, CNV/mtDNA modules, and tumor purity considerations.
See Pipeline Structure and Containerisation for exact containers, pinned versions, and YAML-driven execution wrappers.
Consult References for methods papers and best-practice guidance.