Pipeline Structure and Containerisation
This WGS/WES pipeline is structured into modular stages supporting germline and somatic workflows, designed for flexibility, reproducibility, and scalability. It operates seamlessly on local servers, HPC clusters, and cloud platforms, and accommodates tumor-only, tumor–normal paired, trio-based, and large cohort analyses.
Modular Workflow Overview
Module 1: Data Input
Accepts:
Paired-end FASTQ files (
R1,R2)Sample metadata (TSV/CSV) with columns such as:
SampleIDTypeReadGroupBAM/FASTQ pathTarget BED(for WES)
Optional Panel of Normals (PoN) VCF for somatic workflows
Optional germline resource VCF (e.g., gnomAD) for tumor-only somatic calling
File formats supported:
Compressed (
.fastq.gz)Uncompressed FASTQ
Module 2: Quality Control and Preprocessing
Tools:
FastQC
MultiQC
fastp
Cutadapt (optional)
Trimmomatic (optional)
Functions:
Adapter trimming
Low-quality base clipping
Poly-G trimming for NovaSeq data
Optional UMI extraction for UMI-aware workflows
QC metrics:
Per-base quality scores
GC content
Adapter contamination
Module 3: Alignment
Tool: BWA-MEM2 (or BWA-MEM)
Reference: GRCh38 or GRCh37 (ALT-aware and decoy-inclusive builds recommended)
Features:
Read group (
@RG) tags automatically assigned from metadataOptional alignment benchmarking with
samtools statsContamination check with VerifyBamID2 or Picard CollectWgsMetrics
Output: Sorted BAM with alignment metrics
Module 4: Post-Alignment Processing
Tools:
GATK MarkDuplicatesSpark
Picard MarkDuplicates
Functions:
Sorting
Duplicate marking
Insert size calculation
Optical duplicate detection (Picard)
Outputs: Duplicate-marked BAM + metrics
Module 5: Base Quality Score Recalibration (BQSR)
Tool: GATK BaseRecalibrator
Uses known sites:
dbSNP
Mills and 1000G gold standard indels
(Optional) 1000G Phase1 indels
Somatic workflows: BQSR is optional but recommended for quality consistency
Output: Recalibrated BAM/CRAM
Module 6: Variant Calling
Germline:
GATK HaplotypeCaller in GVCF mode
GenotypeGVCFs for joint calling
Somatic:
GATK Mutect2 with matched normal (tumor–normal) or without normal (tumor-only)
Tumor-only mode uses PoN + gnomAD filtering
(Optional) mitochondrial mode for mtDNA variant calling
Module 7: Variant Filtering and Annotation
Somatic:
FilterMutectCallsOrientation bias filtering (
FilterByOrientationBias)Contamination adjustment (
CalculateContamination)
Germline:
VQSR or hard filters
Annotation:
Tools: VEP, ANNOVAR, snpEff
Databases: ClinVar, gnomAD, COSMIC, dbNSFP
(Optional) clinical overlays: OncoKB, CIViC
Module 8: Optional Analysis
CNV calling: CNVkit, GATK gCNV, ModelSegments
MSI status: MSIsensor, MANTIS
Mutational signatures: SigProfilerExtractor, deconstructSigs
Tumor mutational burden (TMB) calculation
Loss of heterozygosity (LOH) detection
Support for Germline and Somatic Calling
Germline Workflows
Targeted for: - Rare disease studies - Family-based inheritance analysis (trios/quads) - Population-scale cohorts
Supports: - Cohort joint calling - VQSR for large datasets - Mendelian filtering and de novo detection in trios
Somatic Workflows
Designed for cancer genomics
Supports: - Tumor–normal paired analysis - Tumor-only calling with artifact suppression - Cohort-level filtering for recurrent artifact removal
PoN integration to filter technical noise and recurrent sequencing artifacts
Germline resource VCF to flag common germline variants
Containerisation
Why containerisation?
Eliminates environment drift
Ensures tool version consistency
Makes workflows portable across platforms
Docker
Includes: - GATK - BWA-MEM2 - samtools/bcftools - FastQC, MultiQC - VEP, ANNOVAR (licensed copy required)
Version pinning in Dockerfile
Suitable for: - Local execution - Cloud-native workflows (AWS Batch, Terra, DNAnexus, Seven Bridges)
Apptainer (Singularity)
Designed for HPC without root privileges
Compatible with: - SLURM - PBS - LSF
Auto-mounts: - Input data directories - Output result directories
Container versions tracked in a manifest file
SHA256 checksum verification before execution
Reproducibility measures
Container image IDs stored in run logs
All tool versions reported in pipeline_versions.log
Environment variables and configuration saved per run
Scalability and Workflow Automation
Orchestration Support
Compatible with:
Nextflow
Snakemake
Cromwell/WDL
Features:
Checkpointing between modules
Resume capability after failure
Automatic logging of:
Tool commands
Version info
Resource usage
Parallelisation
Scatter–gather per sample or per chromosome
Interval-based parallelism for GATK
Cloud Support
Ready for: - Terra - DNAnexus - Seven Bridges
GA4GH-compliant WDL/CWL wrappers
Customisation
Configurable: - CPU/memory per step - Retry logic for failed jobs - Job monitoring and reporting
Future extensions
ML-based variant scoring
Panel-specific annotation overlays
WGS/WES harmonisation into joint multi-sample VCFs
Example Directory Layout
pipeline/
├── bin/ # Scripts
├── config/ # YAML/JSON configs
├── containers/ # Docker/Singularity manifests
├── docs/ # Documentation
├── modules/ # Pipeline stages
├── refs/ # Reference data
├── results/ # Final outputs
└── work/ # Intermediate files
Best Practices for Execution
Pin container versions and reference datasets per project
Store all pipeline logs and QC outputs in a dedicated logs/ directory
Verify checksums of references before starting
Maintain a central manifest of all PoN and germline resource files
Archive final VCFs with associated annotation databases for reproducibility