Advanced Analysis
This section covers advanced workflows and analysis modules extending the standard WGS/WES pipeline. These include family-based (trio/quad) analyses, joint-calling strategies for large cohorts, copy-number and mitochondrial genome analysis, and somatic tumor purity assessment and adjustment.
Trio and Family-Based Analysis
Overview
Family-based designs (e.g., trios: proband + parents; quads: proband + parents + sibling) enhance the detection of de novo variants, phase genotypes, and improve filtering accuracy.
Workflow
Joint Genotyping Run HaplotypeCaller in GVCF mode for all family members → GenotypeGVCFs → single multi-sample VCF.
Relationship Verification Use KING, PLINK, or Somalier to verify reported relationships and detect swaps.
De Novo Detection Tools: - GATK VariantAnnotator with -A PossibleDeNovo - DeNovoGear - TrioDeNovo
Example (GATK de novo annotation):
gatk VariantAnnotator \ -R refs/GRCh38.fa \ -V family.joint.vcf.gz \ -A PossibleDeNovo \ --pedigree family.ped \ -O family.denovo_annotated.vcf.gz
Mendelian Inconsistency Checks Use bcftools +mendelian plugin to identify violations.
Filtering and Prioritization - Restrict to high-confidence de novo calls (PASS, depth >10 in all samples). - Annotate with VEP or ANNOVAR to assess functional impact and allele frequency.
Tips - Include only variants in high-confidence regions (e.g., GIAB). - For exomes, restrict analysis to targets ± padding.
Joint Calling for Large Cohorts
Advantages - Consistent genotyping across samples. - Improved variant quality metrics (VQSR modeling). - Better sensitivity for rare variant discovery.
Best Practices - Always call variants in GVCF mode per sample. - Use GenomicsDBImport or CombineGVCFs for aggregation. - For >500 samples, use scatter/gather and sharded GenomicsDB workspaces.
Example:
gatk GenomicsDBImport \
--genomicsdb-workspace-path cohort_gdb \
--sample-name-map gvcf_samples.map \
--reader-threads 8 \
--batch-size 50
gatk GenotypeGVCFs \
-R refs/GRCh38.fa \
-V gendb://cohort_gdb \
-O cohort.unfiltered.vcf.gz
Note
VQSR requires several thousand high-quality SNPs/indels for stable model training. For small cohorts, use hard filters.
Copy Number Variation (CNV) Analysis
Germline CNVs - Tools: GATK gCNV, XHMM, ExomeDepth (WES-specific). - Requires matched control cohort or PoN.
Example (GATK gCNV WES mode):
gatk PreprocessIntervals \
-R refs/GRCh38.fa \
--interval-merging-rule OVERLAPPING_ONLY \
-L refs/targets_exome.bed \
-O targets.interval_list
gatk AnnotateIntervals \
-R refs/GRCh38.fa \
-L targets.interval_list \
-O targets.annotated.tsv
gatk CollectReadCounts \
-I SAMPLE.recal.bam \
-L targets.interval_list \
--interval-merging-rule OVERLAPPING_ONLY \
-O SAMPLE.counts.hdf5
Somatic CNVs - Tool: GATK ModelSegments with matched tumor-normal or tumor-only modes. - Input: read counts + allele counts from common SNPs.
Example (somatic CNV):
gatk CollectReadCounts \
-I tumor.bam -L genome_intervals.interval_list \
--interval-merging-rule OVERLAPPING_ONLY \
-O tumor.counts.hdf5
gatk CollectAllelicCounts \
-I tumor.bam -R refs/GRCh38.fa \
-L snps.interval_list \
-O tumor.allelicCounts.tsv
gatk ModelSegments \
--denoised-copy-ratios tumor.denoisedCR.tsv \
--allelic-counts tumor.allelicCounts.tsv \
-O cnv_segments/
gatk CallCopyRatioSegments \
-I cnv_segments/called_copy_ratios.seg \
-O cnv_segments/cnv_calls.seg
Mitochondrial Variant Analysis
Mitochondrial variants can be called from WGS (and sometimes WES) using GATK Mutect2 in mitochondrial mode.
Example:
gatk Mutect2 \
-R refs/GRCh38.fa \
-I SAMPLE.recal.bam \
--mitochondria-mode \
-O SAMPLE.mitochondria.unfiltered.vcf.gz
gatk FilterMutectCalls \
-V SAMPLE.mitochondria.unfiltered.vcf.gz \
-O SAMPLE.mitochondria.filtered.vcf.gz
Note
For mitochondrial analysis, use a dedicated circularized mtDNA reference to minimize alignment artifacts.
Tumor Purity Estimation & Adjustment
Why it matters - Low purity can reduce somatic variant allele fractions (VAFs), impacting sensitivity.
Tools - PureCN (integrates CNV + VAF). - FACETS, ABSOLUTE. - Sequenza (tumor/normal only).
Example (FACETS tumor-normal):
Rscript run_facets.R \
--tumor tumor.bam \
--normal normal.bam \
--genome hg38 \
--output facets_results/
Adjustment Strategies - Lower –min-af in Mutect2 for low-purity tumors. - Use CNV-aware filtering to retain real variants with low VAF in amplified regions. - Flag likely subclonal variants for separate interpretation.
Integration and Reporting
Combine SNVs/indels, CNVs, and mtDNA calls into a unified multi-omic report.
Summarize per-sample key metrics: purity, ploidy, mutation burden, % genome altered.
Annotate variants with clinical relevance (COSMIC, ClinVar, OncoKB) — interpretation only.
Next Steps
See Pipeline Structure and Containerisation for reproducibility and execution setup.
Refer to References for literature on family-based designs, CNV/mtDNA detection, and tumor purity modeling.