A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique N-mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles are used to train a naive Bayes classifier (NBC) that can be used to identify the genome of any fragment. We show that our method is comparable to BLAST for small 25 bp fragments but does not have the ambiguity of BLAST's tied top scores. We demonstrate that this approach is scalable to identify any fragment from hundreds of genomes. It also performs quite well at the strain, species, and genera levels and achieves strain resolution despite classifying ubiquitous genomic fragments (gene and nongene regions). Cross-validation analysis demonstrates that species-accuracy achieves 90% for highly-represented species containing an average of 8 strains. We demonstrate that such a tool can be used on the Sargasso Sea dataset, and our analysis shows that NBC can be further enhanced.
Traditionally, studies in microbial genomics have focused on single-genomes from cultured species, thereby limiting their focus to the small percentage of species that can be cultured outside their natural environment. Fortunately, recent advances in high-throughput sequencing and computational analyses have ushered in the new field of metagenomics, which aims to decode the genomes of microbes from natural communities without the need for cultivation. Although metagenomic studies have shed a great deal of insight into bacterial diversity and coding capacity, several computational challenges remain due to the massive size and complexity of metagenomic sequence data. Current tools and techniques are reviewed in this paper which address challenges in 1) genomic fragment annotation, 2) phylogenetic reconstruction, 3) functional classification of samples, and 4) interpreting complementary metaproteomics and metametabolomics data. Also surveyed are important applications of metagenomic studies, including microbial forensics and the roles of microbial communities in shaping human health and soil ecology.
Her research interest is understanding how technology can be used to improve K-12 mathematics education. She is interested in developing applications for classroom use that factor the computational resource limitations of urban public schools. Her future research will investigate methods for computer scientists to collaborate with educators to improve K-12 as well as computer science education. William Mongan, Drexel University Bill Mongan is a Ph.D. student at Drexel University in the Department of Computer Science. Concurrently, Bill is pursuing an MS in Science of Instruction in the School of Education at Drexel, with a concentration in Secondary Mathematics and Computer Science in Pennsylvania. His interests include educational outreach and for exposing the K-12 environment to computer science as an application of science, technology, math and engineering (STEM) education. Prior to studying at Drexel, Bill worked for the Upper Darby School District, working with students on both an educational and volunteer basis in the AP Computer Science program from 2002-2004. He has served on the UDSD School Board Technology and Grant committee in 2001, and interviewed for a vacant UDSD School Board seat in 2000.
In recent years, oligo microarrays, or more commonly-known DNA chips, have had a major impact in disease diagnosis, drug discovery, and gene identification. Microarrays contain Nmer DNA fragments, or oligos, in a series of "wells" placed across the chip, where each well contains thousands of the same fragments and acts as a probe that detects the amount of a specific fragment. A recent use for microarrays is for identification of genomes, such as pathogens. In current techniques, probes that detect unique gene regions of particular species are selected to be placed on the microarray, using the assumption that if one gene unique to a pathogen species can be detected, then the pathogen can be classified. This approach is useful, but the technology relies on finding the gene sequences that are divergent enough to be used as a genomic identifier and robust to cross-hybridization. In our work, we present a method to choose the most unique probes between two organisms. We accomplish this by choosing the oligo probes that maximize the level of divergence between the genomes, calculated by three different information-theoretic measures. We show the results for a 12-mer and 25-mer oligo pathogen probe set and that our method chooses probes less likely to cross-hybridize.Index Terms-DNA, Information Theory, Microorganisms MICROARRAYS AND PATHOGEN DETECTIONThe number of published sequenced genomes has been exponentially increasing since 1995, with over 600 completed by 2007 [1]. This makes an expanding database for genome analysis, including phylogeny (evolutionary tree) studies, comparative gene analysis, etc. A new application using apriori genome databases is to characterize genomic fingerprints to identify a particular genome, especially if it is a pathogen. In the field of metagenomics, the study of genetic material recovered from environmental samples, scientists now have the potential to classify a species based on genomic features and compare it to signatures found in a database rather than using taxonomy and classical phenotypic features.The two major methods for identifying pathogens are PCR (polymerase chain reaction) assays and microarray chips. In PCR methods, slight variations unique to certain genomes, such as SNPs (single nucleotide polymorphisms), are examined to characterize particular pathogens. Unfortunately, in genomes with high mutations, such as HIV virii, PCR needs more than a few identifiers [2]. Many pathogens contain mobile genetic elements that can potentially interfere with proper identification using single loci detection systems and therefore, many loci are needed.With the advent of microarrays, there have been several methods developed [3,4,5]. In [4], each probe selected is a substring of a gene, which acts as its fingerprint. Their Findprobe program takes only genes as input and designs probes which satisfy homogeneity, sensitivity, and specificity constraints. In [5], a highly redundant probe pair is placed on the chip for each diagnostic region specified. Almost all previous micr...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.