Improving MinHash via the containment index with applications to metagenomic analysis

Koslicki, David; Zabeti, Hooman

doi:10.1016/j.amc.2019.02.018

Cited by 32 publications

(71 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use KMC [18] to generate k-mers for the reads and each reference genome, and then, we utilize an implementation of the theoretical concept of containment min hash [14] (called CMash) to estimate the percent of k-mers in each reference genome that are also present in the reads (the "Methods" section). Intuitively, this gives us an estimate of how likely it is for each reference genome to be present in the sample.…”

Section: Methods Overviewmentioning

confidence: 99%

“…Alignment-based profiling is regarded as highly accurate, but aligning millions of reads against a reference database of tens to hundreds of gigabytes (GB) in size is computationally infeasible. Metalign minimizes computational cost with a high-speed, high-recall pre-filtering method based on the mathematical concept of containment min hash [14], which identifies a small number of candidate organisms that are potentially in the sample and creates a subset database consisting of these organisms. This pre-filtering approach reduced our comprehensive NCBI-based database of 243 GB, often by more than 100-fold, with some variance depending on the diversity of the sample.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Metalign: efficient alignment-based metagenomic profiling via containment min hash

et al. 2020

Self Cite

View full text Add to dashboard Cite

Metagenomic profiling, predicting the presence and relative abundances of microbes in a sample, is a critical first step in microbiome analysis. Alignment-based approaches are often considered accurate yet computationally infeasible. Here, we present a novel method, Metalign, that performs efficient and accurate alignment-based metagenomic profiling. We use a novel containment min hash approach to pre-filter the reference database prior to alignment and then process both uniquely aligned and multi-aligned reads to produce accurate abundance estimates. In performance evaluations on both real and simulated datasets, Metalign is the only method evaluated that maintained high performance and competitive running time across all datasets.

show abstract

Section: Methods Overviewmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Metalign: efficient alignment-based metagenomic profiling via containment min hash

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…First, KMC 17 is used to enumerate the k-mers in the reads, with the k-mers of the reference genomes having been pre-computed by KMC, and then intersect these sets. We then utilize the containment MinHash similarity metric (presented theoretically in 14 via an implementation by one of the coauthors (Koslicki) called CMash to efficiently estimate the similarity/containment index) between each reference genome and the input sample. The containment index is closely related to the Jaccard index, and, in this case, refers to the percent of k-mers in a reference genome that are also present in the reads.…”

Section: Database Pre-filtering With Cmashmentioning

confidence: 99%

Metalign: Efficient alignment-based metagenomic profiling via containment min hash

LaPierre

Alser

Eskin³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Whole-genome shotgun sequencing enables the analysis of microbial communities in unprecedented detail, with major implications in medicine and ecology. Predicting the presence and relative abundances of microbes in a sample, known as "metagenomic profiling", is a critical first step in microbiome analysis. Existing profiling methods have been shown to suffer from poor false positive or false negative rates, while alignment-based approaches are often considered accurate but computationally infeasible. Here we present a novel method, Metalign, that addresses these concerns by performing efficient alignment-based metagenomic profiling.We use a containment min hash approach to reduce the reference database size dramatically before alignment and a method to estimate organism relative abundances in the sample by resolving reads aligned to multiple genomes. We show that Metalign achieves significantly improved results over existing methods on simulated datasets from a large benchmarking study, CAMI, and performs well on in vitro mock community data and environmental data from the Tara Oceans project. Metalign is freely available at https://github.com/nlapier2/Metalign , along with the results and plots used in this paper, and a docker image is also available at https://hub.docker.com/repository/docker/nlapier2/metalign .

show abstract

“…However, Blast is not designed for high-throughput metagenomic reads classification, and its computationally expensive to get local alignments for hundreds of thousands and millions of reads. Other alignment-based techniques can be mapping-based – using methods like the Burrows-Wheeler transform (BWT) or variants of hash tables [ 11 , 34 ]. While enabling fast queries, mapping tools need to spend “training time” to compress/prefilter the reference database.…”

Section: Related Workmentioning

confidence: 99%

Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life

2020

View full text Add to dashboard Cite

Background It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of “incremental learning” addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. Results We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model’s knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4th of the non-incremental time with no accuracy loss. Conclusions It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.

show abstract

Improving MinHash via the containment index with applications to metagenomic analysis

Cited by 32 publications

References 15 publications

Metalign: efficient alignment-based metagenomic profiling via containment min hash

Metalign: efficient alignment-based metagenomic profiling via containment min hash

Metalign: Efficient alignment-based metagenomic profiling via containment min hash

Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life

Contact Info

Product

Resources

About