GGRaSP
GGRaSP (Gaussian Genome Representative Selector with Prioritization) is an R-package that generates a reduced subset of genomes that prioritizes maintaining genomes of interest to the user as well as minimizing the loss of genetic variation. GGRaSP also allows for unsupervised clustering by modeling the genomic relationships using a Gaussian Mixture Model to select an appropriate cluster threshold, thus allowing for both generalizable high-throughput and more dataset specific use.
Key Features
- Rapidly simplify large datasets containing up to multiple thousands of genomes.
- Optional run without any a priori knowledge of the shape of the data.
- Generation of images, tables, and annotation files enabling detailed analysis of the phylogeny and GGRaSP clusters.
Sample Output
The capabilities of GGRaSP is demonstrated by generating a reduced list of 315 genomes from a genomic dataset of 4,600 Escherichia coli genomes, prioritizing selection by type strain and by genome completeness. Original 4,600 genome set (A), clustered using cut-off (B), and reduced to 315 representatives genomes (C).
Publications
Bioinformatics (Oxford, England). 2018-09-01; 34.17: 3032-3034.
GGRaSP: a R-package for selecting representative genomes using Gaussian mixture models
Funding
This project has been funded in whole or part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under Award Number U19AI110819.