Genomic Data Analysis and Visualization for identification and characterization of WGS
Enrique Garcia-Assad, Indresh Singh, Pratap Venepally, Jason Inman
J. Craig Venter Institute, Rockville, MD
Whole-genome sequencing (WGS) is becoming available as a common place tool for clinical microbiology. If applied directly on clinical samples, this could further reduce diagnostic times and thereby improve control and treatment especially in outbreak events. A major bottleneck is the availability of fast and reliable bioinformatic tools to identify and characterize large numbers of microbial genomes. In this project, I applied NASP for single nucleotide polymorphism (SNP) identification and characterization, and used FigTree and PHYLOBar for phylogenetic analysis and visualization.The NASP method differs from other published SNP pipelines in terms of supported short-read aligners and SNP callers, the ability to call both monomorphic and polymorphic sites, and the ability to integrate the results from multiple SNP callers and identify the consensus set of SNPs that define the population structure. Accurate and comprehensive analysis of SNPs in a reference population is critical in outbreak investigations, source attribution and population genetics. NASP was originally developed for bacterial pathogens, but has been used to analyze the population structure of fungal and viral pathogens as well. The NASP output can be used for genome-wide association studies (GWAS) to correlate the genotype and phenotype, and can also be used for phylogenomics, which allows for an understanding of the relatedness of microbial isolates across temporal and spatial scales. To test the methods, I worked on the NASP analysis using a small set of 20 random E. coli genomes and later scaling up to 50, looking for relationships between the different strains. We downloaded the genomes from the NCBI GenBank website and subsequently ran the NASP analysis on the reads. NASP SNP data was then formatted for FigTree and PHYLOBar for phlogentic analysis.