Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the community-driven initiative for the Critical Assessment of Metagenome Interpretation (CAMI). In its second challenge, CAMI engaged the community to assess their methods on realistic and complex metagenomic datasets with long and short reads, created from ∼1,700 novel and known microbial genomes, as well as ∼600 novel plasmids and viruses. Altogether 5,002 results by 76 program versions were analyzed, representing a 22x increase in results.Substantial improvements were seen in metagenome assembly, some due to using long-read data. The presence of related strains still was challenging for assembly and genome binning, as was assembly quality for the latter. Taxon profilers demonstrated a marked maturation, with taxon profilers and binners excelling at higher bacterial taxonomic ranks, but underperforming for viruses and archaea. Assessment of clinical pathogen detection techniques revealed a need to improve reproducibility. Analysis of program runtimes and memory usage identified highly efficient programs, including some top performers with other metrics. The CAMI II results identify current challenges, but also guide researchers in selecting methods for specific analyses.
Background It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of “incremental learning” addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. Results We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model’s knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4th of the non-incremental time with no accuracy loss. Conclusions It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.
Current metagenomic taxonomic classifiers cannot computationally keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of "incremental learning" addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model's knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4 th of the non-incremental time with no accuracy loss. In conclusion, it is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. 1/19Figure 1. Number of updates in the NCBI Bacteria Genome database A: Accumulative number of genomes updates per year; B: compared with last year, the number of new updates per yearreads in metagenome sequencing data -uses aligners, read mappers, classifiers and other "base techniques" to solve this problem [6][7][8][9][10]. Taxonomic classification is usually one of the first steps in a metagenomic pipeline [11]. Once these organisms are identified, they are then used in downstream analyses, such as alpha/beta diversity measures, ordination, feature selection, phenotype classification, etc.However, while many methods have been proposed for taxonomic classification [12,13], the accuracy of these methods using different training databases has not been fully tested. This is an important issue, because as new genome data are generated, training data sets, such as the commonly used the NCBI Reference Sequence Database (RefSeq) will change over time. Nasko et al. [14] recently demonstrated that more reads are classified (as opposed to be assigned to an unclassified/unknown class) by the Kraken classifier with newer database versions. Nasko's analysis also suggested that changes in RefSeq over time may influence but not necessarily reduce the misclassification rate, with genus/species false positives shown to be 1% and 8% respectively. These metrics were, however, calculated only on a selection of 10 genomes. Accordingly, given that there is an ongoing rapid expansion of genomic data of microbial diversity, it is imperative to update taxonomic classifiers as new genomes/genes are discovered. Simply failing to update the model will result in lower accuracy due to incomplete knowledge. In addition, the way that most researchers tra...
The Vehicle Routing Problem (VRP) is an NP hard problem where we need to optimize itineraries for agents to visit multiple targets. When considering real-world travel (road-network topology, speed limits and traffic), modern VRP solvers can only process small instances with a few hundred targets. We propose a framework (VRPDiv) that can scale any solver to support larger VRP instances with up to ten thousand targets (10k) by dividing them into smaller clusters. VRPDiv supports the multiple VRP scenarios and contains a pool of clustering algorithms from which it chooses the ideal one depending on properties of the instance. VRPDiv assigns agents based on cluster demand and targets compatibility (i.e. realizable time-windows and capacity limitations). We incorporate the framework into the Bing Maps Multi-Itinerary Optimization (MIO) 1 online service. This architecture allows MIO to scale up from solving instances with a few hundred to over 10k targets in under 10 minutes. We evaluate our framework on public datasets and publish a new dataset ourselves, as large enough instances supporting real-world travel were impossible to find. We investigate multiple clustering methods and show that choosing the correct one is critical with differences of up to 60% in quality. We compare with relevant baselines and report a 40% improvement in target allocation and a 9.8% improvement in itinerary durations. We compare with existing scores and report an average delta of 10%, with lower values (<5%) in instances with low workload (few targets per agent), which are acceptable for an online service.
In this paper, we describe multi-itinerary optimization (MIO)---a novel Bing Maps service that automates the process of building itineraries for multiple agents while optimizing their routes to minimize travel time or distance. MIO can be used by organizations with a fleet of vehicles and drivers, mobile salesforce, or a team of personnel in the field, to maximize workforce efficiency. It supports a variety of constraints, such as service time windows, duration, priority, pickup and delivery dependencies, and vehicle capacity. MIO also considers traffic conditions between locations, resulting in algorithmic challenges at multiple levels (e.g., calculating time-dependent travel-time distance matrices at scale and scheduling services for multiple agents). To support an end-to-end cloud service with turnaround times of a few seconds, our algorithm design targets a sweet spot between accuracy and performance. Toward that end, we build a scalable approach based on the ALNS metaheuristic. Our experiments show that accounting for traffic significantly improves solution quality: MIO finds efficient routes that avoid late arrivals, whereas traffic-agnostic approaches result in a 15% increase in the combined travel time and the lateness of an arrival. Furthermore, our approach generates itineraries with substantially higher quality than a cutting-edge heuristic (LKH), with faster running times for large instances.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.