2020
DOI: 10.1093/nar/gkaa265
|View full text |Cite
|
Sign up to set email alerts
|

To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics

Abstract: As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
15
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 24 publications
(19 citation statements)
references
References 99 publications
0
15
0
Order By: Relevance
“…k -mers and their frequencies can be obtained with a linear scan of a dataset. However, due to the massive size of the modern datasets and the exponential growth of the k -mers number (with respect to k ), the extraction of k -mers is an extremely computationally intensive task, both in terms of running time and memory ( Elworth et al , 2020 ), and several algorithms have been proposed to reduce the running time and memory requirements (see Section 1.2). Nonetheless, the extraction of all k -mers and their frequencies from a reads dataset is still highly demanding in terms of time and memory [e.g.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…k -mers and their frequencies can be obtained with a linear scan of a dataset. However, due to the massive size of the modern datasets and the exponential growth of the k -mers number (with respect to k ), the extraction of k -mers is an extremely computationally intensive task, both in terms of running time and memory ( Elworth et al , 2020 ), and several algorithms have been proposed to reduce the running time and memory requirements (see Section 1.2). Nonetheless, the extraction of all k -mers and their frequencies from a reads dataset is still highly demanding in terms of time and memory [e.g.…”
Section: Introductionmentioning
confidence: 99%
“…The problem of exactly counting k -mers in datasets has been extensively studied, with several methods proposed for its solution ( Audano and Vannberg, 2014 ; Kokot et al , 2017 ; Kurtz et al , 2008 ; Marçais and Kingsford, 2011 ; Melsted and Pritchard, 2011 ; Pandey et al , 2017 ; Rizk et al , 2013 ; Roy et al , 2014 ). Such methods are typically highly demanding in terms of time and memory when analyzing large high-throughput sequencing datasets ( Elworth et al , 2020 ). For this reason, many methods have been recently developed to compute approximations of the k -mers abundances to reduce the computational cost of the task (e.g.…”
Section: Introductionmentioning
confidence: 99%
“…Several data analytics tasks require analysing massive data stream such as real-time IP traffic analysis [7], metagenomics [6], email/tweets/SMS, time-series data [22], web clicks and crawls, sensor/IoT readings [18]. In many of these applications we may not have enough space/memory available to store the entire data stream.…”
Section: Introductionmentioning
confidence: 99%
“…Sketching algorithms, or simply sketches, are compact randomized data structures that can be easily updated and queried to perform a time and memory efficient estimation of statistics of large data streams of tokens. Sketches have found applications in machine learning (Aggarwal and Yu, 2010), security analysis (Dwork et al, 2010), natural language processing (Goya et al, 2009), computational biology (Zhang et al, 2014;Leo Elworth et al, 2020), social networks (Song et al, 2009) and games (Harrison, 2010). Of particular interest is the problem of estimating the frequency of a token in the stream, also referred to as "point query", and more generally the problem of estimating the overall frequency of a collection of s ≥ 1 tokens, also referred to as "s-range query".…”
Section: Introductionmentioning
confidence: 99%