Symbiotic bacteria often help their hosts acquire nutrients from their diet, showing trends of coevolution and independent acquisition by hosts from the same trophic levels. While these trends hint at important roles for biotic factors, the effects of the abiotic environment on symbiotic community composition remain comparably understudied. In this investigation, we examined the influence of abiotic and biotic factors on the gut bacterial communities of fish from different taxa, trophic levels and habitats. Phylogenetic and statistical analyses of 25 16S rRNA libraries revealed that salinity, trophic level and possibly host phylogeny shape the composition of fish gut bacteria. When analysed alongside bacterial communities from other environments, fish gut communities typically clustered with gut communities from mammals and insects. Similar consideration of individual phylotypes (vs. communities) revealed evolutionary ties between fish gut microbes and symbionts of animals, as many of the bacteria from the guts of herbivorous fish were closely related to those from mammals. Our results indicate that fish harbour more specialized gut communities than previously recognized. They also highlight a trend of convergent acquisition of similar bacterial communities by fish and mammals, raising the possibility that fish were the first to evolve symbioses resembling those found among extant gut fermenting mammals. Data accessibility DNA sequences of Poecilia reticulata gut bacteria: GenBank accession numbers JQ253406-JQ253517. Additional information regarding the metadata included in this paper is available in Tables S1 and S2 (Supporting information). Supporting informationAdditional supporting information may be found in the online version of this article. Table S1 Habitats, lifestyles, and phylogenetic affiliations of fish gut bacteria and their closest GenBank relatives. Table S2 List of primers used in all studies included in this meta-analysis. Fig. S1 Ordinal classifications of sequences from whole libraries of fish that were derived from culture independent methods.
"Binning" (or taxonomic classification) of DNA sequence reads is an initial step to analyzing an environmental biological sample. Currently, a homology-based tool, BLAST, is one of the most commonly used tools to label DNA reads, but it is argued that BLAST will quickly lose its classification ability as the genome databases grow. In this paper, we compare the accuracies of a naïve Bayes classifier (NBC) and statistical language model to BLAST for binning reads and demonstrate that NBC obtains good performance for the low cost of computational complexity. On the other hand, the back-off n-gram language model can improve accuracy when only partial training data is available (such as in-progress sequencing projects). NBC demonstrates comparable performance to BLAST and can also be optimized on partial training datasets by adjusting the word feature size. A fivefold cross validation is conducted to compare each method's accuracy for determining novel genomes at different taxonomic levels, with NBC outperforming BLAST for species-level classification but BLAST outperforming NBC for genus-level and phyla-level classification. In conclusion, the NBC is a competitive taxonomic classifier, and language models can improve performance when only partial training data is available.
High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments (“reads”) from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between “known” and “unknown” taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for their ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an “unknown” class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate the performance of several algorithms on a real acid mine drainage dataset.
We have developed a platform for exposing high school students to machine learning techniques for signal processing problems, making use of relatively simple mathematics and engineering concepts. Along with this platform we have created two example scenarios which give motivation to the students for learning the theory underlying their solutions. The first scenario features a recycling sorting problem in which the students must setup a system so that the computer may learn the different types of objects to recycle so that it may automatically place them in the proper receptacle. The second scenario was motivated by a high school biology curriculum. The students are to develop a system that learns the different types of bacteria present in a pond sample. The system will then group the bacteria together based on similarity. One of the key strengths of this platform is that virtually any type of scenario may be built upon the concepts conveyed in this paper. This then permits student participation from a wide variety of educational motivation.
Abstract-Metagenomic studies inherently involve sampling genetic information from an environment potentially containing thousands of distinctly different microbial organisms. This genetic information is sequenced producing many short fragments (<500 base pair (bp)); each is tentatively a small representative of the DNA coding structure. Any of the fragments may belong to any of the organisms in the sample, but the relationship is unknown a priori. Furthermore, most of these organisms have not been identified and correspondingly are not represented in any of the publicly available search databases. Our goal is to be able to predict the taxonomic classification of an organism based on the fragments obtained from an environmental sample that may include many (some previously unidentified) organisms. To elucidate the diversity and composition of the sample, we first use a supervised naïve Bayes classifier to score the fragments of known genomes, followed by an unsupervised clustering to group fragments from similar organisms together. We are then free to analyze each cluster separately. This is challenging since we are not interested in similar sequences, but sequences that come from similar genomes, which are known to vary widely intra-genomically. Our dataset comprises of an extremely challenging scenario involving clustering fragments at the phyla level, where none of the phyla have been previously seen or identified. We present two variations of our proposed approach, one based on ART and Kmeans. We show that ART can cluster 500bp fragments from 17 novel phyla at an overall isolation/grouping that is 10% better than K-means and nearly 7 times over chance.
Metagenomics is the study of environmental samples. Because few tools exist for metagenomic analysis, a natural step has been to utilize the popular homology tool, BLAST, to search for sequence similarity between sample fragments and an administered database. Most biologists use this method today without knowing BLAST's accuracy, especially when a particular taxonomic class is underrepresented in the database. The aim of this paper is to benchmark the performance of BLAST for taxonomic classification of metagenomic datasets in a supervised setting; meaning that the database contains microbes of the same class as the 'unknown' query fragments. We examine well-and under-represented genera and phyla in order to study their effect on the accuracy of BLAST. We conclude that on fine-resolution classes, such as genera, the accuracy of BLAST does not degrade very much with underrepresentation, but in a highly variant class, such as phyla, performance degrades significantly. Our analysis includes five-fold cross validation to substantiate our findings.
Researchers are perpetually amassing biological sequence data. The computational approaches employed by ecologists for organizing this data (e.g. alignment, phylogeny, etc.) typically scale nonlinearly in execution time with the size of the dataset. This often serves as a bottleneck for processing experimental data since many molecular studies are characterized by massive datasets. To keep up with experimental data demands, ecologists are forced to choose between continually upgrading expensive in-house computer hardware or outsourcing the most demanding computations to the cloud. Outsourcing is attractive since it is the least expensive option, but does not necessarily allow direct user interaction with the data for exploratory analysis. Desktop analytical tools such as ARB are indispensable for this purpose, but they do not necessarily offer a convenient solution for the coordination and integration of datasets between local and outsourced destinations. Therefore, researchers are currently left with an undesirable tradeoff between computational throughput and analytical capability. To mitigate this tradeoff we introduce a software package to leverage the utility of the interactive exploratory tools offered by ARB with the computational throughput of cloud-based resources. Our pipeline serves as middleware between the desktop and the cloud allowing researchers to form local custom databases containing sequences and metadata from multiple resources and a method for linking data outsourced for computation back to the local database. A tutorial implementation of the toolkit is provided in the supporting information, S1 Tutorial. Availability: http://www.ece.drexel.edu/gailr/EESI/tutorial.php.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.