To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics

Elworth, R. A. Leo; Wang, Qi; Kota, Pavan K.; Barberan, CJ; Coleman, Benjamin; Balaji, Advait; Gupta, Gaurav; Baraniuk, Richard G.; Shrivastava, Anshumali; Treangen, Todd J.

doi:10.1093/nar/gkaa265

Cited by 24 publications

(19 citation statements)

References 99 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…k -mers and their frequencies can be obtained with a linear scan of a dataset. However, due to the massive size of the modern datasets and the exponential growth of the k -mers number (with respect to k ), the extraction of k -mers is an extremely computationally intensive task, both in terms of running time and memory ( Elworth et al , 2020 ), and several algorithms have been proposed to reduce the running time and memory requirements (see Section 1.2). Nonetheless, the extraction of all k -mers and their frequencies from a reads dataset is still highly demanding in terms of time and memory [e.g.…”

Section: Introductionmentioning

confidence: 99%

“…The problem of exactly counting k -mers in datasets has been extensively studied, with several methods proposed for its solution ( Audano and Vannberg, 2014 ; Kokot et al , 2017 ; Kurtz et al , 2008 ; Marçais and Kingsford, 2011 ; Melsted and Pritchard, 2011 ; Pandey et al , 2017 ; Rizk et al , 2013 ; Roy et al , 2014 ). Such methods are typically highly demanding in terms of time and memory when analyzing large high-throughput sequencing datasets ( Elworth et al , 2020 ). For this reason, many methods have been recently developed to compute approximations of the k -mers abundances to reduce the computational cost of the task (e.g.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

SPRISS: approximating frequentk-mers by sampling reads, and applications

et al. 2022

View full text Add to dashboard Cite

Motivation The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. Results In this work we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS employs a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset. Availability SPRISS* is available at https://github.com/VandinLab/SPRISS. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

SPRISS: approximating frequentk-mers by sampling reads, and applications

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Several data analytics tasks require analysing massive data stream such as real-time IP traffic analysis [7], metagenomics [6], email/tweets/SMS, time-series data [22], web clicks and crawls, sensor/IoT readings [18]. In many of these applications we may not have enough space/memory available to store the entire data stream.…”

Section: Introductionmentioning

confidence: 99%

Improving Tug-of-War sketch using Control-Variates method

Pratap¹,

Verma²,

Kulkarni³

2021

SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21)

View full text Add to dashboard Cite

Computing space-efficient summary, or a.k.a. sketches, of large data, is a central problem in the streaming algorithm. Such sketches are used to answer post-hoc queries in several data analytics tasks. The algorithm for computing sketches typically requires to be fast, accurate, and space-efficient. A fundamental problem in the streaming algorithm framework is that of computing the frequency moments of data streams. The frequency moments of a sequence containing f i elements of type i, are the numbers. This is also called as k norm of the frequency vector (f 1 , f 2 , . . . f n ). Another important problem is to compute the similarity between two data streams by computing the inner product of the corresponding frequency vectors. The seminal work of Alon, Matias, and Szegedy [2], a.k.a. Tug-ofwar (or AMS) sketch gives a randomized sublinear space (and linear time) algorithm for computing the frequency moments, and the inner product between two frequency vectors corresponding to the data streams. However, the variance of these estimates typically tends to be large. In this work, we focus on minimizing the variance of these estimates. We use the techniques from the classical Control-Variate method [16] which is primarily known for variance reduction in Monte-Carlo simulations, and as a result, we are able to obtain significant variance reduction, at the cost of a little computational overhead. We present a theoretical analysis of our proposal and complement it with supporting experiments on synthetic as well as real-world datasets.

show abstract

“…Sketching algorithms, or simply sketches, are compact randomized data structures that can be easily updated and queried to perform a time and memory efficient estimation of statistics of large data streams of tokens. Sketches have found applications in machine learning (Aggarwal and Yu, 2010), security analysis (Dwork et al, 2010), natural language processing (Goya et al, 2009), computational biology (Zhang et al, 2014;Leo Elworth et al, 2020), social networks (Song et al, 2009) and games (Harrison, 2010). Of particular interest is the problem of estimating the frequency of a token in the stream, also referred to as "point query", and more generally the problem of estimating the overall frequency of a collection of s ≥ 1 tokens, also referred to as "s-range query".…”

Section: Introductionmentioning

confidence: 99%

Learning-augmented count-min sketches via Bayesian nonparametrics

Dolera¹,

Favaro²,

Peluchetti³

2021

Preprint

View full text Add to dashboard Cite

The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens' frequencies in a data stream, i.e. point queries, based on random hashed data. Learning-augmented CMSs improve the CMS by learning models that allow to better exploit data properties. In this paper, we focus on the learning-augmented CMS of Cai, Mitzenmacher and Adams (NeurIPS 2018), which relies on Bayesian nonparametric (BNP) modeling of a data stream via Dirichlet process (DP) priors. This is referred to as the CMS-DP, and it leads to BNP estimates of a point query as posterior means of the point query given the hashed data. While BNPs is proved to be a powerful tool for developing robust learning-augmented CMSs, ideas and methods behind the CMS-DP are tailored to point queries under DP priors, and they can not be used for other priors or more general queries. In this paper, we present an alternative, and more flexible, derivation of the CMS-DP such that: i) it allows to make use of the Pitman-Yor process (PYP) prior, which is arguably the most popular generalization of the DP prior; ii) it can be readily applied to the more general problem of estimating range queries. This leads to develop a novel learning-augmented CMS under powerlaw data streams, referred to as the CMS-PYP, which relies on BNP modeling of the stream via PYP priors. Applications to synthetic and real data show that the CMS-PYP outperforms the CMS and the CMS-DP in the estimation of low-frequency tokens; this known to be a critical feature in natural language processing, where it is indeed common to encounter power-law data streams.

show abstract

To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics

Cited by 24 publications

References 99 publications

SPRISS: approximating frequentk-mers by sampling reads, and applications

SPRISS: approximating frequentk-mers by sampling reads, and applications

Improving Tug-of-War sketch using Control-Variates method

Learning-augmented count-min sketches via Bayesian nonparametrics

Contact Info

Product

Resources

About