View the peer-reviewed version (peerj.com/articles/1025), which is the preferred citable publication unless you specifically need to cite this preprint.McNair K, Edwards RA. 2015. GenomePeek-an online tool for prokaryotic genome and metagenome analysis. PeerJ 3:e1025 https://doi.org/10.7717/peerj.1025GenomePeek -An online tool for prokaryotic and metagenome analysis.As more and more prokaryotic sequencing takes place, a method to quickly and accurately analyze this data is needed. Previous tools are mainly designed for metagenomic analysis and have limitations; such as long runtimes and significant false positive error rates. The online tool GenomePeek (edwards.sdsu.edu/GenomePeek) was developed to analyze both single genome and metagenome sequencing files, quickly and with low error rates.GenomePeek uses a sequence assembly approach; where reads to a set of conserved genes are extracted, assembled and then aligned against the highly specific reference database. GenomePeek was found to be faster than traditional approaches while still keeping error rates low, as well as offering unique data visualization options.
PrePrints
IntroductionWith the cost of sequencing falling, microbial genomes are being sequenced at an increasing rate. Currently there are over 2,000 completed prokaryotic genomes in NCBI (Benson et al., 2009;Sayers et al., 2009), almost 15,000 prokaryote genomes in the SEED database (Overbeek, Disz, & Stevens, 2004) and about 75,000 more that are unassembled in the Sequence Read Archive. There are also about 35,000 metagenomes in NCBI and about 90,000 metagenomes available from MG-RAST (Meyer et al., 2008). While complete genome sequencing gives us detailed knowledge about a single prokaryotic species, metagenomic sequencing gives as a broad overview of the microbial environment (Dinsdale et al., 2008). Whether analyzing genomic or metagenomic sequencing, one of the main goals is to identify the taxonomic origin of species or species (Belda-Ferre et al., 2012;Mande, Mohammed, & Ghosh, 2012;Trindade-Silva et al., 2012;Carr, Shen-Orr, & Borenstein, 2013).There are two typical approaches to identifying the species present in a metagenome. The most common methods use homology searches against a reference database of known taxonomic lineage (Altschul et al., 1997;Meyer et al., 2008;Segata et al., 2012). In contrast, ensemble approaches use signature data from all of the reads, such as protein domain frequencies or k-mer composition (Meinicke, Aßhauer, & Lingner, 2011;Silva, Dutilh, & Edwards, 2013). Homologybased methods generally use protein level alignments due to the highly divergent and mutable nature of prokaryotic genomes. The problem with this approach is that metagenomic sequencing reads tend to be relatively short when compared to protein open reading frames. The average length of prokaryotic genes that encode proteins is about 750bp (Brocchieri & Karlin, 2005). The current sequencing technologies produce reads with an average length of 100 and 500bp. Regardless of the technology used, sequencing...