Metagenome Fragment Classification Using <i>N</i>‐Mer  Frequency Profiles

Rosen, Gail; Garbarine, Elaine; Caseiro, Diamantino; Polikar, Robi; Sokhansanj, Bahrad A.

doi:10.1155/2008/205969

Cited by 94 publications

(97 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Most of the current metagenomics classification programs either suffer from slow classification speed, a large index size, or both. For example, machine-learning-based approaches such as the Naive Bayes Classifier (NBC) (Rosen et al 2008) and PhymmBL Salzberg 2009, 2011) classify <100 reads per minute, which is too slow for data sets that contain millions of reads. In contrast, the pseudoalignment approach employed in Kraken (Wood and Salzberg 2014) processes reads far more quickly, more than 1 million reads per minute, but its exact k-mer matching algorithm requires a large index.…”

mentioning

confidence: 99%

Centrifuge: rapid and sensitive classification of metagenomic sequences

et al. 2016

View full text Add to dashboard Cite

Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species on desktop computers. The system uses an indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (4.2 GB for 4078 bacterial and 200 archaeal genomes) and classifies sequences at very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together, these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers. Because of its space-optimized indexing schemes, Centrifuge also makes it possible to index the entire NCBI nonredundant nucleotide sequence database (a total of 109 billion bases) with an index size of 69 GB, in contrast to k-mer-based indexing schemes, which require far more extensive space.

show abstract

mentioning

confidence: 99%

Centrifuge: rapid and sensitive classification of metagenomic sequences

et al. 2016

View full text Add to dashboard Cite

show abstract

“…This classifier was first implemented for organism classification by Sandberg in 2001 on a small set of just 28 genomes, and has since been further extended to a larger database of 635 genomes by Rosen [9,16]. The outputted scores for each fragment are then submitted as features to an unsupervised clustering algorithm.…”

Section: Methodsmentioning

confidence: 99%

“…Most supervised classification methods for metagenomics employ either a homology-based alignment or a composition-based frequency model [8,9,10,11]. However, as mentioned above, either of these techniques can only identify the 1-2% of organisms (those that are known), and perhaps classify another 50-70% to a higher taxonomic level (such as order or phylum).…”

Section: Neural Network-based Taxonomic Clustering For Metagenomicsmentioning

confidence: 99%

Neural network-based taxonomic clustering for metagenomics

Essinger

Polikar

Rosen

2010

The 2010 International Joint Conference on Neural Networks (IJCNN)

Self Cite

View full text Add to dashboard Cite

Abstract-Metagenomic studies inherently involve sampling genetic information from an environment potentially containing thousands of distinctly different microbial organisms. This genetic information is sequenced producing many short fragments (<500 base pair (bp)); each is tentatively a small representative of the DNA coding structure. Any of the fragments may belong to any of the organisms in the sample, but the relationship is unknown a priori. Furthermore, most of these organisms have not been identified and correspondingly are not represented in any of the publicly available search databases. Our goal is to be able to predict the taxonomic classification of an organism based on the fragments obtained from an environmental sample that may include many (some previously unidentified) organisms. To elucidate the diversity and composition of the sample, we first use a supervised naïve Bayes classifier to score the fragments of known genomes, followed by an unsupervised clustering to group fragments from similar organisms together. We are then free to analyze each cluster separately. This is challenging since we are not interested in similar sequences, but sequences that come from similar genomes, which are known to vary widely intra-genomically. Our dataset comprises of an extremely challenging scenario involving clustering fragments at the phyla level, where none of the phyla have been previously seen or identified. We present two variations of our proposed approach, one based on ART and Kmeans. We show that ART can cluster 500bp fragments from 17 novel phyla at an overall isolation/grouping that is 10% better than K-means and nearly 7 times over chance.

show abstract

“…Most of the existing clustering methods are supervised and depend on the availability of reference data for training [15,3,19,5]. A metagenome may however, contain reads from unexplored phyla which cannot be labeled into one of the existing classes.…”

Section: Related Workmentioning

confidence: 99%

“…The dominant patterns in the data are captured by its component distributions. Most mixture models assume an underlying normal distribution [19]. However, the distribution of word counts within a genome vary according to a Poisson distribution [17,18].…”

Section: Related Workmentioning

confidence: 99%

A Two-Way Bayesian Mixture Model for Clustering in Metagenomics

Prabhakara

Acharya

2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. We present a new and efficient Bayesian mixture model based on Poisson and Multinomial distributions for clustering metagenomic reads by their species of origin. We use the relative abundance of different words along a genome to distinguish reads from different species. The distribution of word counts within a genome is accurately represented by a Poisson distribution. The Multinomial mixture model is derived as a standardized Poisson mixture model. The Bayesian network efficiently encodes the conditional dependencies between word counts in a DNA due to overlaps and hence is most consistent with the data. We present a two-way mixture model that captures the high dimensionality and sparsity associated with the data. Our method can cluster reads as short as 50 bps with accuracy over 80%. The Bayesian mixture models clearly outperform their Naive Bayes counterparts on datasets of varying abundances, divergences and read lengths. Our method attains comparable accuracy to that of state-of-art Scimm and converges at least 5 times faster than Scimm for all the cases tested. The reduced time taken, by our method, to obtain accurate results is highly significant and justifies the use of our proposed method to evaluate large metagenome datasets.

show abstract

Metagenome Fragment Classification Using N‐Mer Frequency Profiles

Cited by 94 publications

References 35 publications

Centrifuge: rapid and sensitive classification of metagenomic sequences

Centrifuge: rapid and sensitive classification of metagenomic sequences

Neural network-based taxonomic clustering for metagenomics

A Two-Way Bayesian Mixture Model for Clustering in Metagenomics

Contact Info

Product

Resources

About