WGS/WES Variant Calling Pipeline
================================

:Author: Akhilesh Kaushal
:Version: 1.0.0
:Keywords: WGS, WES, Germline, Somatic, GATK, Apptainer, Docker, Singularity

Overview
--------

This documentation introduces a **comprehensive, reproducible** pipeline for variant discovery from **Whole Genome Sequencing (WGS)** and **Whole Exome Sequencing (WES)** data. It supports both **germline** and **somatic** analyses and can operate **with or without matched normal** samples. The workflow aligns with **GATK Best Practices**, emphasizing high-quality read preprocessing, robust variant discovery, calibrated filtering, and rigorous quality control (QC), followed by standards-compliant annotation for downstream interpretation.

Use Cases
---------

- **Germline**: population-scale studies, rare disease genetics, trio/quad analyses, and cohort joint genotyping.
- **Somatic (Tumor/Normal)**: cancer genomics using matched tumor–normal pairs; optimized for sensitivity with artifact suppression.
- **Somatic (Tumor-only)**: analyses without normal tissue using **Panel of Normals (PoN)** and population allele frequency filtering.

Supported Data Types
--------------------

- **WGS**: 30× coverage typical (higher for somatic); uniform genome-wide capture.
- **WES**: capture-based enrichment (e.g., Agilent SureSelect, IDT xGen). Requires target BED (and optional “padding” for near-target indels).

When to Choose WGS vs WES
-------------------------

- **WGS**: comprehensive coverage (coding + noncoding), superior indel/SV detection, better uniformity; higher cost and storage.
- **WES**: cost-effective for coding regions, higher effective coverage over exons, limited noncoding insight and capture biases.

Design Principles
-----------------

- **Standards-aligned**: adheres to GATK4 Best Practices; no indel realignment (deprecated); **BQSR** performed using known sites.
- **Reproducible**: containerized with **Apptainer/Singularity** or **Docker**; deterministic configuration captured in environment files.
- **Modular**: pluggable steps for alignment, calling, filtering, and annotation; optional mitochondrial and CNV modules.
- **Scalable**: HPC/Cloud-friendly; parallelization by sample/chromosome/scatter-gather.
- **Auditable**: rich QC (FastQC/MultiQC, alignment metrics, coverage profiling) and provenance tracking.

High-Level Workflow
-------------------

**Common Preprocessing**
 - Input FASTQ (gz)
 - Adapter/quality assessment (**FastQC**, **MultiQC**)
 - Alignment (**BWA-MEM2** or **BWA-MEM**) to **GRCh38/hg38** (ALT-aware, decoys if applicable)
 - Sorting, duplicate marking (**Picard/GATK MarkDuplicates(Spark)**; UMI-aware dedup possible)
 - Base Quality Score Recalibration (**BQSR**) using known variant sites (dbSNP, Mills indels, 1000G indels)

**Germline Path**
 - Per-sample calling: **GATK HaplotypeCaller** in **GVCF** mode
 - Cohort integration: **GenomicsDBImport** (or CombineGVCFs) → **GenotypeGVCFs**
 - Variant filtering: **VQSR** (recommended for sufficiently large cohorts) or **hard filters** (small cohorts)
 - Annotation: **VEP/Funcotator/ANNOVAR** with gnomAD, dbSNP, dbNSFP, ClinVar (interpretation only), ±HGVS

**Somatic (Tumor/Normal) Path**
 - Calling: **GATK Mutect2** with matched normal, **germline resource** (e.g., gnomAD), and optional **PoN**
 - Post-processing: **LearnReadOrientationModel**, **CalculateContamination**, **FilterMutectCalls**, optional **FilterByOrientationBias** (FFPE)
 - Optional modules: mitochondrial calling (Mutect2 mitochondria mode), CNV (GATK ModelSegments/CallCopyRatioSegments)

**Somatic (Tumor-only) Path**
 - Calling: **Mutect2** without matched normal
 - Artifact suppression: **PoN** (ideally ≥30 normals, technology-matched), gnomAD-based germline filtering, orientation bias model, contamination estimation
 - **Caveat**: higher false-positive risk and limited certainty of somatic status; careful post-filters and orthogonal validation recommended.

Inputs and Outputs
------------------

**Inputs**
 - Paired-end **FASTQ** files (R1/R2), consistent sample naming
 - **Reference** genome (GRCh38/hg38 recommended), indices and dictionary
 - Known sites for BQSR (dbSNP; Mills/1000G indels)
 - **WES**: target BED (+ optional padded BED)
 - **Somatic**: matched normal BAM/FASTQ when available; **PoN** VCF (or cohort of normals to build one); gnomAD resource

**Outputs**
 - Recalibrated **BAM/CRAM** + index and metrics
 - **VCF/BCF** (germline joint-called, or somatic filtered calls), with indices
 - QC reports (MultiQC), coverage summaries (e.g., **mosdepth**, **Qualimap**)
 - Annotated variant files (e.g., VEP/Funcotator outputs) for interpretation

Quality, Coverage, and QC
-------------------------

- **WGS**: ≥30× mean depth for germline; somatic studies often require higher tumor depth (e.g., 60–100×) and ≥30× normal.
- **WES**: ≥80–120× mean on-target; report on-target %, uniformity, and % bases ≥20×/30×.
- **Contamination**: estimate via **VerifyBamID2** (germline) or **CalculateContamination** (somatic); flag swaps with **Somalier**.
- **Sex chromosomes**: handle PAR/non-PAR; report sex inference for consistency checks.
- **UMIs**: if present, use UMI-aware collapsing/dedup (e.g., **fgbio**) to reduce PCR artifacts.

Filtering Strategy Notes
------------------------

- **VQSR** is preferred for large cohorts (stable tranche modeling); for small studies, use **hard filters** with empirically chosen cutoffs.
- **Somatic** filtering combines: artifact modeling (orientation/FFPE), contamination estimates, PoN, and population AF thresholds.
- **ClinVar** is for **interpretation**, not for hard filtering criteria.

Reference and Resource Recommendations
--------------------------------------

- **Genome**: GRCh38/hg38 with ALT contigs and decoys where applicable.
- **Known sites**: dbSNP (latest), Mills and 1000G indels.
- **Population AF**: **gnomAD** (exomes + genomes) as a germline resource in somatic workflows.
- **Cancer knowledge bases** (annotation only): **COSMIC**, **CIViC**, **OncoKB** (license terms may apply).

Reproducibility & Execution
---------------------------

- Provided **Apptainer/Singularity** and **Docker** images for stable toolchains.
- YAML/JSON configuration captures references, intervals, parameters, and resource sizing.
- Scales on Slurm/SGE/PBS or cloud batch services; supports scatter/gather by interval list.

Data Stewardship & Compliance
-----------------------------

- Maintain sample sheets with immutable IDs, checksums for all inputs/outputs, and complete metadata.
- Remove or avoid embedding PHI in filenames/headers; adhere to **HIPAA/GDPR** and institutional IRB policies.
- Store large artifacts in object storage with lifecycle policies; keep a minimal, query-friendly variant store.

Limitations & Non-Goals
-----------------------

- Tumor-only analyses cannot definitively distinguish all germline from somatic variation; interpret with caution.
- Structural variant and repeat expansion calling are **optional** modules and may require specialized callers and validation.
- Clinical reporting requires orthogonal confirmation and domain-specific review beyond this pipeline.

What’s Next
-----------

- See :doc:`run_pipeline` for exact commands, parameters, and file layouts.
- Explore :doc:`advanced_analysis` for joint calling, trios, tumor purity handling, CNV/mtDNA options, and best-practice filters.
- Review :doc:`structure_and_containerisation` for environment reproducibility.
- Consult :doc:`references` for standards and key literature.

.. note::
   This pipeline assumes high-quality libraries and appropriate experimental design. For FFPE, ultra-low input, or single-cell protocols, additional artifact controls and validation are recommended.