Pipeline Structure and Containerisation

This WGS/WES pipeline is structured into modular stages supporting germline and somatic workflows, designed for flexibility, reproducibility, and scalability. It operates seamlessly on local servers, HPC clusters, and cloud platforms, and accommodates tumor-only, tumor–normal paired, trio-based, and large cohort analyses.

Modular Workflow Overview

Module 1: Data Input

  • Accepts:

    • Paired-end FASTQ files (R1, R2)

    • Sample metadata (TSV/CSV) with columns such as:

      • SampleID

      • Type

      • ReadGroup

      • BAM/FASTQ path

      • Target BED (for WES)

    • Optional Panel of Normals (PoN) VCF for somatic workflows

    • Optional germline resource VCF (e.g., gnomAD) for tumor-only somatic calling

  • File formats supported:

    • Compressed (.fastq.gz)

    • Uncompressed FASTQ

Module 2: Quality Control and Preprocessing

  • Tools:

    • FastQC

    • MultiQC

    • fastp

    • Cutadapt (optional)

    • Trimmomatic (optional)

  • Functions:

    • Adapter trimming

    • Low-quality base clipping

    • Poly-G trimming for NovaSeq data

    • Optional UMI extraction for UMI-aware workflows

  • QC metrics:

    • Per-base quality scores

    • GC content

    • Adapter contamination

Module 3: Alignment

  • Tool: BWA-MEM2 (or BWA-MEM)

  • Reference: GRCh38 or GRCh37 (ALT-aware and decoy-inclusive builds recommended)

  • Features:

    • Read group (@RG) tags automatically assigned from metadata

    • Optional alignment benchmarking with samtools stats

    • Contamination check with VerifyBamID2 or Picard CollectWgsMetrics

  • Output: Sorted BAM with alignment metrics

Module 4: Post-Alignment Processing

  • Tools:

    • GATK MarkDuplicatesSpark

    • Picard MarkDuplicates

  • Functions:

    • Sorting

    • Duplicate marking

    • Insert size calculation

    • Optical duplicate detection (Picard)

  • Outputs: Duplicate-marked BAM + metrics

Module 5: Base Quality Score Recalibration (BQSR)

  • Tool: GATK BaseRecalibrator

  • Uses known sites:

    • dbSNP

    • Mills and 1000G gold standard indels

    • (Optional) 1000G Phase1 indels

  • Somatic workflows: BQSR is optional but recommended for quality consistency

  • Output: Recalibrated BAM/CRAM

Module 6: Variant Calling

  • Germline:

    • GATK HaplotypeCaller in GVCF mode

    • GenotypeGVCFs for joint calling

  • Somatic:

    • GATK Mutect2 with matched normal (tumor–normal) or without normal (tumor-only)

    • Tumor-only mode uses PoN + gnomAD filtering

    • (Optional) mitochondrial mode for mtDNA variant calling

Module 7: Variant Filtering and Annotation

  • Somatic:

    • FilterMutectCalls

    • Orientation bias filtering (FilterByOrientationBias)

    • Contamination adjustment (CalculateContamination)

  • Germline:

    • VQSR or hard filters

  • Annotation:

    • Tools: VEP, ANNOVAR, snpEff

    • Databases: ClinVar, gnomAD, COSMIC, dbNSFP

    • (Optional) clinical overlays: OncoKB, CIViC

Module 8: Optional Analysis

  • CNV calling: CNVkit, GATK gCNV, ModelSegments

  • MSI status: MSIsensor, MANTIS

  • Mutational signatures: SigProfilerExtractor, deconstructSigs

  • Tumor mutational burden (TMB) calculation

  • Loss of heterozygosity (LOH) detection

Support for Germline and Somatic Calling

Germline Workflows

  • Targeted for: - Rare disease studies - Family-based inheritance analysis (trios/quads) - Population-scale cohorts

  • Supports: - Cohort joint calling - VQSR for large datasets - Mendelian filtering and de novo detection in trios

Somatic Workflows

  • Designed for cancer genomics

  • Supports: - Tumor–normal paired analysis - Tumor-only calling with artifact suppression - Cohort-level filtering for recurrent artifact removal

  • PoN integration to filter technical noise and recurrent sequencing artifacts

  • Germline resource VCF to flag common germline variants

Containerisation

Why containerisation?

  • Eliminates environment drift

  • Ensures tool version consistency

  • Makes workflows portable across platforms

Docker

  • Includes: - GATK - BWA-MEM2 - samtools/bcftools - FastQC, MultiQC - VEP, ANNOVAR (licensed copy required)

  • Version pinning in Dockerfile

  • Suitable for: - Local execution - Cloud-native workflows (AWS Batch, Terra, DNAnexus, Seven Bridges)

Apptainer (Singularity)

  • Designed for HPC without root privileges

  • Compatible with: - SLURM - PBS - LSF

  • Auto-mounts: - Input data directories - Output result directories

  • Container versions tracked in a manifest file

  • SHA256 checksum verification before execution

Reproducibility measures

  • Container image IDs stored in run logs

  • All tool versions reported in pipeline_versions.log

  • Environment variables and configuration saved per run

Scalability and Workflow Automation

Orchestration Support

  • Compatible with:

    • Nextflow

    • Snakemake

    • Cromwell/WDL

  • Features:

    • Checkpointing between modules

    • Resume capability after failure

    • Automatic logging of:

      • Tool commands

      • Version info

      • Resource usage

Parallelisation

  • Scatter–gather per sample or per chromosome

  • Interval-based parallelism for GATK

Cloud Support

  • Ready for: - Terra - DNAnexus - Seven Bridges

  • GA4GH-compliant WDL/CWL wrappers

Customisation

  • Configurable: - CPU/memory per step - Retry logic for failed jobs - Job monitoring and reporting

Future extensions

  • ML-based variant scoring

  • Panel-specific annotation overlays

  • WGS/WES harmonisation into joint multi-sample VCFs

Example Directory Layout

pipeline/
├── bin/           # Scripts
├── config/        # YAML/JSON configs
├── containers/    # Docker/Singularity manifests
├── docs/         # Documentation
├── modules/      # Pipeline stages
├── refs/         # Reference data
├── results/      # Final outputs
└── work/         # Intermediate files

Best Practices for Execution

  • Pin container versions and reference datasets per project

  • Store all pipeline logs and QC outputs in a dedicated logs/ directory

  • Verify checksums of references before starting

  • Maintain a central manifest of all PoN and germline resource files

  • Archive final VCFs with associated annotation databases for reproducibility