MetaShot: an accurate workflow for taxon classification of host-associated microbiome from shotgun metagenomic data

Fosso, Bruno; Santamaría, Mónica; D’Antonio, Mattia; Lovero, Domenica; Corrado, Giacomo; Vizza, Enrico; Passaro, Nunzia; Garbuglia, Anna Rosa; Capobianchi, Maria Rosaria; Crescenzi, Marco; Valiente, Gabriel; Pesole, Graziano

doi:10.1093/bioinformatics/btx036

Cited by 21 publications

(33 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We have implemented the set cover approach to taxonomic annotation in a next release of the TANGO software (Clemente et al, 2011;Alonso et al, 2013), which belongs in the BioMaS (Fosso et al, 2015) and MetaShot (Fosso et al, 2017) pipelines. The new implementation of TANGO consists of the following: a first Python script for extracting the candidates matches for each read from the BLAST output, a second Python script for taxonomic annotation using the NCBI Taxonomy (Federhen, 2012(Federhen, , 2015, based on the ETE Toolkit (Huerta-Cepas et al, 2016), a third Python script for taxonomic annotation using the Greengenes taxonomy (McDonald et al, 2012), fourth Python script for resolving any remaining ambiguities by finding an exact solution to a set cover problem with the least total size of subsets, based on Gurobi Optimizer (Gurobi Optimization, Inc., 2017), and a fifth Python script for obtaining the relative abundance profile of the metagenomic sample.…”

Section: Resultsmentioning

confidence: 99%

“…Annotating a read as coming from the LCA of the candidate sequences in a reference taxonomy (Huson and Weber, 2013) maximizes precision, as in that case there are no TN and no FN, but at the expense of specificity, because the number of FP in a reference taxonomy can be very large. Annotating a read as coming from an internal node with the largest F-measure value (Clemente et al, 2011;Alonso et al, 2013;Fosso et al, 2015Fosso et al, , 2017 minimizes the classification error as a combination of precision and sensitivity.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Unbiased Taxonomic Annotation of Metagenomic Samples

Fosso¹,

Pesole²,

Rosselló³

et al. 2017

Bioinformatics Research and Applications

View full text Add to dashboard Cite

The classification of reads from a metagenomic sample using a reference taxonomy is usually based on first mapping the reads to the reference sequences and then classifying each read at a node under the lowest common ancestor of the candidate sequences in the reference taxonomy with the least classification error. However, this taxonomic annotation can be biased by an imbalanced taxonomy and also by the presence of multiple nodes in the taxonomy with the least classification error for a given read. In this article, we show that the Rand index is a better indicator of classification error than the often used area under the receiver operating characteristic (ROC) curve and F-measure for both balanced and imbalanced reference taxonomies, and we also address the second source of bias by reducing the taxonomic annotation problem for a whole metagenomic sample to a set cover problem, for which a logarithmic approximation can be obtained in linear time and an exact solution can be obtained by integer linear programming. Experimental results with a proof-of-concept implementation of the set cover approach to taxonomic annotation in a next release of the TANGO software show that the set cover approach further reduces ambiguity in the taxonomic annotation obtained with TANGO without distorting the relative abundance profile of the metagenomic sample.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Unbiased Taxonomic Annotation of Metagenomic Samples

Fosso¹,

Pesole²,

Rosselló³

et al. 2017

Bioinformatics Research and Applications

View full text Add to dashboard Cite

show abstract

“…Increasing threshold to 100 again decreased the number of filtered eukaryotic reads (to 74-86%) with only slight improvement on the number of retained virus reads (0-9%). For the simulated metagenome (Fosso et al, 2017), setting the threshold to 50 results in filtering 99.93% of host reads and only 0.05% of viral reads (excluding the endogenous retroviral reads, which are filtered to a large extent). Thus, we selected 50 as a working threshold although we recognize that a more robust optimization can be performed.…”

Section: Unix Pipeline For Assembly Taxonomic Profiling and Binning mentioning

confidence: 99%

“…Pipelines in the third group, such as MetaPhlan2 (Truong et al, 2015), Kraken2 (Wood et al, 2019) and Centrifuge (Kim et al, 2016a), can perform composition analysis for all known taxa. There are also a number of tools, pipelines, and algorithms for virus discovery, including Genome Detective (Vilsker et al, 2019), VIP (Li et al, 2016), PathSeq (Kostic et al, 2011), SURPI (Ho and Tzanetakis, 2014), READSCAN (Naeem et al, 2013) , VirusFinder (Wang et al, 2013) and MetaShot (Fosso et al, 2017). Most of these tools depend exclusively on nucleotide-level sequence alignments and can detect viruses with highly similar sequences to a known virus.…”

Section: Introductionmentioning

confidence: 99%

Novel NGS Pipeline for Virus Discovery from a Wide Spectrum of Hosts and Sample Types

Plyusnin

Kant

Jääskeläinen

et al. 2020

Preprint

View full text Add to dashboard Cite

The study of the microbiome data holds great potential for elucidating the biological and metabolic functioning of living organisms and their role in the environment. Metagenomic analyses have shown that humans, along with e.g. domestic animals, wildlife and arthropods, are colonized by an immense community of viruses. The current Coronavirus pandemic (COVID-19) heightens the need to rapidly detect previously unknown viruses in an unbiased way. The increasing availability of metagenomic data in this era of next-generation sequencing (NGS), along with increasingly affordable sequencing technologies, highlight the need for reliable and comprehensive methods to manage such data. In this article, we present a novel stand-alone pipeline called LAZYPIPE for identifying both previously known and novel viruses in host-associated or environmental samples and give examples of virus discovery based on it. LAZYPIPE is a Unix-based pipeline for automated assembling and taxonomic profiling of NGS libraries implemented as a collection of C++, Perl, and R scripts.

show abstract

“…Thus, it allows for an unbiased diagnostic analysis. There is a variety of tool able to address NGS-based pathogen related questions with different focuses: either aiming to discover yet unknown genomes [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22] or to detect known species in a sample [23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40]. Among both groups, there are different underlying algorithms, the main distinction running between alignment-based [15-17, 19, 23, 25, 26, 28-31, 33, 35-37, 39, 40] and alignment-free methods [6,9,12,21,32,38].…”

Section: Introductionmentioning

confidence: 99%

PathoLive – Real-time pathogen identification from metagenomic Illumina datasets

Tausch

Loka

et al. 2018

Preprint

View full text Add to dashboard Cite

Over the past years, NGS has been applied in time critical applications such as pathogen diagnostics with promising results. Yet, long turnaround times have to be accepted to generate sufficient data, as the analysis can only be performed sequentially after the sequencing has finished. Additionally, the interpretation of results can be further complicated by various types of contaminations, clinically irrelevant sequences, and the sheer amount and complexity of the data. We designed and implemented PathoLive, a real-time diagnostics pipeline which allows the detection of pathogens from clinical samples up to several days before the sequencing procedure is even finished and currently available tools may start to run. We adapted the core algorithm of HiLive, a real-time read mapper, and enhanced its accuracy for our use case. Furthermore, common contaminations, low-entropy areas, and sequences of widespread, nonpathogenic organisms are automatically marked beforehand using NGS datasets from healthy humans as a baseline. The results are visualized in an interactive taxonomic tree that provides an intuitive overview and detailed measures regarding the relevance of each identified potential pathogen. We applied the pipeline on a human plasma sample that was spiked in vitro with vaccinia virus, yellow fever virus, mumps virus, Rift Valley fever virus, adenovirus, and mammalian orthoreovirus. The sample was then sequenced on an Illumina HiSeq. All spiked agents were detected after the completion of only 12% of the sequencing procedure and were ranked more accurately throughout the run than by any of the tested tools on the complete data. We also found a large number of other sequences and these were correctly marked as clinically irrelevant in the resulting visualization. This tagging allows the user to obtain the correct assessment of the situation at first glance.

show abstract

MetaShot: an accurate workflow for taxon classification of host-associated microbiome from shotgun metagenomic data

Cited by 21 publications

References 12 publications

Unbiased Taxonomic Annotation of Metagenomic Samples

Unbiased Taxonomic Annotation of Metagenomic Samples

Novel NGS Pipeline for Virus Discovery from a Wide Spectrum of Hosts and Sample Types

PathoLive – Real-time pathogen identification from metagenomic Illumina datasets

Contact Info

Product

Resources

About