Pipeline Structure and Containerisation =========================================== This WGS/WES pipeline is structured into **modular stages** supporting **germline** and **somatic** workflows, designed for **flexibility**, **reproducibility**, and **scalability**. It operates seamlessly on **local servers**, **HPC clusters**, and **cloud platforms**, and accommodates **tumor-only**, **tumor–normal paired**, **trio-based**, and **large cohort** analyses. .. contents:: :local: :depth: 2 Modular Workflow Overview -------------------------- **Module 1: Data Input** - **Accepts:** - Paired-end **FASTQ** files (``R1``, ``R2``) - Sample metadata (TSV/CSV) with columns such as: - ``SampleID`` - ``Type`` - ``ReadGroup`` - ``BAM/FASTQ path`` - ``Target BED`` (for WES) - Optional **Panel of Normals (PoN)** VCF for somatic workflows - Optional **germline resource** VCF (e.g., gnomAD) for tumor-only somatic calling - **File formats supported:** - Compressed (``.fastq.gz``) - Uncompressed FASTQ **Module 2: Quality Control and Preprocessing** - **Tools:** - **FastQC** - **MultiQC** - **fastp** - **Cutadapt** *(optional)* - **Trimmomatic** *(optional)* - **Functions:** - Adapter trimming - Low-quality base clipping - Poly-G trimming for NovaSeq data - Optional UMI extraction for UMI-aware workflows - **QC metrics:** - Per-base quality scores - GC content - Adapter contamination **Module 3: Alignment** - **Tool:** **BWA-MEM2** (or **BWA-MEM**) - **Reference:** **GRCh38** or **GRCh37** *(ALT-aware and decoy-inclusive builds recommended)* - **Features:** - Read group (``@RG``) tags automatically assigned from metadata - Optional **alignment benchmarking** with ``samtools stats`` - Contamination check with **VerifyBamID2** or **Picard CollectWgsMetrics** - **Output:** Sorted BAM with alignment metrics **Module 4: Post-Alignment Processing** - **Tools:** - **GATK MarkDuplicatesSpark** - **Picard MarkDuplicates** - **Functions:** - Sorting - Duplicate marking - Insert size calculation - Optical duplicate detection (**Picard**) - **Outputs:** Duplicate-marked BAM + metrics **Module 5: Base Quality Score Recalibration (BQSR)** - **Tool:** **GATK BaseRecalibrator** - **Uses known sites:** - **dbSNP** - **Mills and 1000G gold standard indels** - *(Optional)* **1000G Phase1 indels** - **Somatic workflows:** BQSR is optional but recommended for quality consistency - **Output:** Recalibrated BAM/CRAM **Module 6: Variant Calling** - **Germline:** - **GATK HaplotypeCaller** in GVCF mode - **GenotypeGVCFs** for joint calling - **Somatic:** - **GATK Mutect2** with matched normal (tumor–normal) or without normal (tumor-only) - Tumor-only mode uses PoN + gnomAD filtering - *(Optional)* **mitochondrial mode** for mtDNA variant calling **Module 7: Variant Filtering and Annotation** - **Somatic:** - ``FilterMutectCalls`` - Orientation bias filtering (``FilterByOrientationBias``) - Contamination adjustment (``CalculateContamination``) - **Germline:** - VQSR or hard filters - **Annotation:** - **Tools:** **VEP**, **ANNOVAR**, **snpEff** - **Databases:** **ClinVar**, **gnomAD**, **COSMIC**, **dbNSFP** - *(Optional)* clinical overlays: **OncoKB**, **CIViC** **Module 8: Optional Analysis** - CNV calling: **CNVkit**, **GATK gCNV**, **ModelSegments** - MSI status: **MSIsensor**, **MANTIS** - Mutational signatures: **SigProfilerExtractor**, **deconstructSigs** - Tumor mutational burden (TMB) calculation - Loss of heterozygosity (LOH) detection Support for Germline and Somatic Calling ------------------------------------------ **Germline Workflows** - Targeted for: - Rare disease studies - Family-based inheritance analysis (trios/quads) - Population-scale cohorts - Supports: - Cohort joint calling - VQSR for large datasets - Mendelian filtering and de novo detection in trios **Somatic Workflows** - Designed for cancer genomics - Supports: - Tumor–normal paired analysis - Tumor-only calling with artifact suppression - Cohort-level filtering for recurrent artifact removal - PoN integration to filter technical noise and recurrent sequencing artifacts - Germline resource VCF to flag common germline variants Containerisation ---------------- **Why containerisation?** - Eliminates environment drift - Ensures tool version consistency - Makes workflows portable across platforms **Docker** - Includes: - GATK - BWA-MEM2 - samtools/bcftools - FastQC, MultiQC - VEP, ANNOVAR (licensed copy required) - Version pinning in Dockerfile - Suitable for: - Local execution - Cloud-native workflows (AWS Batch, Terra, DNAnexus, Seven Bridges) **Apptainer (Singularity)** - Designed for HPC without root privileges - Compatible with: - SLURM - PBS - LSF - Auto-mounts: - Input data directories - Output result directories - Container versions tracked in a **manifest file** - SHA256 checksum verification before execution **Reproducibility measures** - Container image IDs stored in run logs - All tool versions reported in `pipeline_versions.log` - Environment variables and configuration saved per run Scalability and Workflow Automation ----------------------------------- **Orchestration Support** - Compatible with: - **Nextflow** - **Snakemake** - **Cromwell/WDL** - Features: - Checkpointing between modules - Resume capability after failure - Automatic logging of: - Tool commands - Version info - Resource usage **Parallelisation** - Scatter–gather per sample or per chromosome - Interval-based parallelism for GATK **Cloud Support** - Ready for: - Terra - DNAnexus - Seven Bridges - GA4GH-compliant WDL/CWL wrappers **Customisation** - Configurable: - CPU/memory per step - Retry logic for failed jobs - Job monitoring and reporting **Future extensions** - ML-based variant scoring - Panel-specific annotation overlays - WGS/WES harmonisation into joint multi-sample VCFs Example Directory Layout ------------------------- .. code-block:: text pipeline/ ├── bin/ # Scripts ├── config/ # YAML/JSON configs ├── containers/ # Docker/Singularity manifests ├── docs/ # Documentation ├── modules/ # Pipeline stages ├── refs/ # Reference data ├── results/ # Final outputs └── work/ # Intermediate files Best Practices for Execution ---------------------------- - Pin container versions and reference datasets per project - Store all pipeline logs and QC outputs in a dedicated `logs/` directory - Verify checksums of references before starting - Maintain a central manifest of all PoN and germline resource files - Archive final VCFs with associated annotation databases for reproducibility