WGS/WES Variant Calling Pipeline

Author:

Akhilesh Kaushal

Version:

1.0.0

Keywords:

WGS, WES, Germline, Somatic, GATK, Apptainer, Docker, Singularity

Overview

This documentation introduces a comprehensive, reproducible pipeline for variant discovery from Whole Genome Sequencing (WGS) and Whole Exome Sequencing (WES) data. It supports both germline and somatic analyses and can operate with or without matched normal samples. The workflow aligns with GATK Best Practices, emphasizing high-quality read preprocessing, robust variant discovery, calibrated filtering, and rigorous quality control (QC), followed by standards-compliant annotation for downstream interpretation.

Use Cases

  • Germline: population-scale studies, rare disease genetics, trio/quad analyses, and cohort joint genotyping.

  • Somatic (Tumor/Normal): cancer genomics using matched tumor–normal pairs; optimized for sensitivity with artifact suppression.

  • Somatic (Tumor-only): analyses without normal tissue using Panel of Normals (PoN) and population allele frequency filtering.

Supported Data Types

  • WGS: 30× coverage typical (higher for somatic); uniform genome-wide capture.

  • WES: capture-based enrichment (e.g., Agilent SureSelect, IDT xGen). Requires target BED (and optional “padding” for near-target indels).

When to Choose WGS vs WES

  • WGS: comprehensive coverage (coding + noncoding), superior indel/SV detection, better uniformity; higher cost and storage.

  • WES: cost-effective for coding regions, higher effective coverage over exons, limited noncoding insight and capture biases.

Design Principles

  • Standards-aligned: adheres to GATK4 Best Practices; no indel realignment (deprecated); BQSR performed using known sites.

  • Reproducible: containerized with Apptainer/Singularity or Docker; deterministic configuration captured in environment files.

  • Modular: pluggable steps for alignment, calling, filtering, and annotation; optional mitochondrial and CNV modules.

  • Scalable: HPC/Cloud-friendly; parallelization by sample/chromosome/scatter-gather.

  • Auditable: rich QC (FastQC/MultiQC, alignment metrics, coverage profiling) and provenance tracking.

High-Level Workflow

Common Preprocessing
  • Input FASTQ (gz)

  • Adapter/quality assessment (FastQC, MultiQC)

  • Alignment (BWA-MEM2 or BWA-MEM) to GRCh38/hg38 (ALT-aware, decoys if applicable)

  • Sorting, duplicate marking (Picard/GATK MarkDuplicates(Spark); UMI-aware dedup possible)

  • Base Quality Score Recalibration (BQSR) using known variant sites (dbSNP, Mills indels, 1000G indels)

Germline Path
  • Per-sample calling: GATK HaplotypeCaller in GVCF mode

  • Cohort integration: GenomicsDBImport (or CombineGVCFs) → GenotypeGVCFs

  • Variant filtering: VQSR (recommended for sufficiently large cohorts) or hard filters (small cohorts)

  • Annotation: VEP/Funcotator/ANNOVAR with gnomAD, dbSNP, dbNSFP, ClinVar (interpretation only), ±HGVS

Somatic (Tumor/Normal) Path
  • Calling: GATK Mutect2 with matched normal, germline resource (e.g., gnomAD), and optional PoN

  • Post-processing: LearnReadOrientationModel, CalculateContamination, FilterMutectCalls, optional FilterByOrientationBias (FFPE)

  • Optional modules: mitochondrial calling (Mutect2 mitochondria mode), CNV (GATK ModelSegments/CallCopyRatioSegments)

Somatic (Tumor-only) Path
  • Calling: Mutect2 without matched normal

  • Artifact suppression: PoN (ideally ≥30 normals, technology-matched), gnomAD-based germline filtering, orientation bias model, contamination estimation

  • Caveat: higher false-positive risk and limited certainty of somatic status; careful post-filters and orthogonal validation recommended.

Inputs and Outputs

Inputs
  • Paired-end FASTQ files (R1/R2), consistent sample naming

  • Reference genome (GRCh38/hg38 recommended), indices and dictionary

  • Known sites for BQSR (dbSNP; Mills/1000G indels)

  • WES: target BED (+ optional padded BED)

  • Somatic: matched normal BAM/FASTQ when available; PoN VCF (or cohort of normals to build one); gnomAD resource

Outputs
  • Recalibrated BAM/CRAM + index and metrics

  • VCF/BCF (germline joint-called, or somatic filtered calls), with indices

  • QC reports (MultiQC), coverage summaries (e.g., mosdepth, Qualimap)

  • Annotated variant files (e.g., VEP/Funcotator outputs) for interpretation

Quality, Coverage, and QC

  • WGS: ≥30× mean depth for germline; somatic studies often require higher tumor depth (e.g., 60–100×) and ≥30× normal.

  • WES: ≥80–120× mean on-target; report on-target %, uniformity, and % bases ≥20×/30×.

  • Contamination: estimate via VerifyBamID2 (germline) or CalculateContamination (somatic); flag swaps with Somalier.

  • Sex chromosomes: handle PAR/non-PAR; report sex inference for consistency checks.

  • UMIs: if present, use UMI-aware collapsing/dedup (e.g., fgbio) to reduce PCR artifacts.

Filtering Strategy Notes

  • VQSR is preferred for large cohorts (stable tranche modeling); for small studies, use hard filters with empirically chosen cutoffs.

  • Somatic filtering combines: artifact modeling (orientation/FFPE), contamination estimates, PoN, and population AF thresholds.

  • ClinVar is for interpretation, not for hard filtering criteria.

Reference and Resource Recommendations

  • Genome: GRCh38/hg38 with ALT contigs and decoys where applicable.

  • Known sites: dbSNP (latest), Mills and 1000G indels.

  • Population AF: gnomAD (exomes + genomes) as a germline resource in somatic workflows.

  • Cancer knowledge bases (annotation only): COSMIC, CIViC, OncoKB (license terms may apply).

Reproducibility & Execution

  • Provided Apptainer/Singularity and Docker images for stable toolchains.

  • YAML/JSON configuration captures references, intervals, parameters, and resource sizing.

  • Scales on Slurm/SGE/PBS or cloud batch services; supports scatter/gather by interval list.

Data Stewardship & Compliance

  • Maintain sample sheets with immutable IDs, checksums for all inputs/outputs, and complete metadata.

  • Remove or avoid embedding PHI in filenames/headers; adhere to HIPAA/GDPR and institutional IRB policies.

  • Store large artifacts in object storage with lifecycle policies; keep a minimal, query-friendly variant store.

Limitations & Non-Goals

  • Tumor-only analyses cannot definitively distinguish all germline from somatic variation; interpret with caution.

  • Structural variant and repeat expansion calling are optional modules and may require specialized callers and validation.

  • Clinical reporting requires orthogonal confirmation and domain-specific review beyond this pipeline.

What’s Next

Note

This pipeline assumes high-quality libraries and appropriate experimental design. For FFPE, ultra-low input, or single-cell protocols, additional artifact controls and validation are recommended.