Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems

Pan, Tony; Flick, Patrick; Jain, Chirag; Liu, Yongchao; Aluru, Srinivas

doi:10.1109/tcbb.2017.2760829

Cited by 19 publications

(18 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the key features of fastv rely on unique k-mer mapping and extension, it is important to obtain high quality unique k-mer sets for microorganisms of interest. Although a number of k-mer generation tools are currently available [30,31], none are suitable for our application because we must both generate unique k-mers for tens of thousands of viruses and/or microorganisms, and filter the k-mer keys based on the reference genome. These unmet needs have led us to develop UniqueKMER, a new unique k-mer generation tool.…”

Section: Uniquekmer: Efficient Unique K-mer Generation For Large Datamentioning

confidence: 99%

A Computational Toolset for Rapid Identification of SARS-CoV-2, other Viruses, and Microorganisms from Sequencing Data

Chen

et al. 2020

Preprint

View full text Add to dashboard Cite

In this paper, we present a toolset and related resources for rapid identification of viruses and microorganisms from short-read or long-read sequencing data. We present fastv as an ultra-fast tool to detect microbial sequences present in sequencing data, identify target microorganisms, and visualize coverage of microbial genomes. This tool is based on the k-mer mapping and extension method. K-mer sets are generated by UniqueKMER, another tool provided in this toolset.UniqueKMER can generate complete sets of unique k-mers for each genome within a large set of viral or microbial genomes. For convenience, unique k-mers for microorganisms and common viruses that afflict humans have been generated and are provided with the tools. As a lightweight tool, fastv accepts FASTQ data as input, and directly outputs the results in both HTML and JSON formats. Prior to the k-mer analysis, fastv automatically performs adapter trimming, quality pruning, base correction, and other pre-processing to ensure the accuracy of k-mer analysis. Specifically, fastv provides built-in support for rapid SARS-CoV-2 identification and typing. Experimental results showed that fastv achieved 100% sensitivity and 100% specificity for detecting SARS-CoV-2 from sequencing data; and can distinguish SARS-CoV-2 from SARS, MERS, and other coronaviruses. This toolset is available at: https://github.com/OpenGene/fastv. As part of the OpenGene projects, fastv and UniqueKMER are open-sourced through the MIT license.Fastv is available at https://github.com/OpenGene/fastv, and UniqueKMER is available at https://github.com/OpenGene/UniqueKMER. The pre-computed unique k-mer resources are also provided in these repositories. Key PointsThis tool presents a new tool fastv for rapid identification of SARS-Cov-2, other viruses and microorganisms. Another tool UniqueKMER is presented for generation of high-quality unique k-mers.Unique k-mer resources for tens of thousands of viruses and microorganisms have been precomputed, and uploaded to the tools' repositories. Supplementary DataA pipeline for alignment-based SARS-CoV-2 identification was provided in Supplementary file 1.

show abstract

Section: Uniquekmer: Efficient Unique K-mer Generation For Large Datamentioning

confidence: 99%

A Computational Toolset for Rapid Identification of SARS-CoV-2, other Viruses, and Microorganisms from Sequencing Data

Chen

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Indeed, the availability of an arbitrary number of independent computation nodes allows to virtually extend to any size the data structure used to keep the k-mer statistics in memory, while using the network as a temporary buffer between the extraction phase and the aggregation phase. This is the approach followed by Kmernator [42] and Kmerind [43]. Both these tools are developed as MPI-based parallel applications and are able to handle data sets whose size is proportional to the overall memory of the MPIbased system where they are run.…”

Section: Distributed Systemsmentioning

confidence: 99%

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

et al. 2019

View full text Add to dashboard Cite

Background Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k -mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k -mer statistics from large collection of biological sequences, with arbitrary values of k . Results One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a full exploitation of the underlying distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. Conclusions We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.

show abstract

“…K-mer counting has been extensively studied over the past decade [8,[58][59][60][61][62][63][64][65]. Counting is accomplished mainly through incremental updates to hash tables [8,58,64,65], including hash based probabilistic data data structures [60][61][62] such as Bloom Filters [66] and Countmin Sketch [67], or through sorting and aggregation [59,63].…”

Section: Use Case and Related Workmentioning

confidence: 99%

“…Intra-task parallelism is achieved generally via concurrent updates of a shared data structure [8,60], while inter-task parallelism via data partitioning followed by sequential computation for each partition. Partitioning minimizes subsequent synchronization and may occur on disk [58,59,63,64], or in memory [59,61,63,65]. Many-core accelerators, such as GPGPU, may also be employed [64] for compute intensive phases.…”

Section: Use Case and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Performance extraction and suitability analysis of multi- and many-core architectures for next generation sequencing secondary analysis

Misra

Pan

Mahadik

et al. 2018

Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Self Cite

View full text Add to dashboard Cite

High-throughput next generation sequencers (NGS) can rapidly read billions of short DNA fragments, called reads, at low cost. Moreover, their throughput is increasing and cost is decreasing at rates much faster than the Moore's law. This demands commensurate acceleration for NGS secondary analysis that process the reads to identify variations between genomes. Conventional architectural improvements can at best improve performance at the rate of Moore's law even if the software tools efficiently utilize the underlying architecture. Unfortunately, most of the dozens of software products developed for this purpose fail to exploit the underlying architecture well. Therefore, to match the pace of development of the sequencers, we will need architecture that is more tailored for the computational requirements of NGS secondary analysis as well as software that uses the architecture optimally. To this end, in this work, we study the performance characteristics of NGS secondary analysis and investigate the suitability of modern Intel Xeon and Xeon Phi processors for the same. To keep the study manageable, we rely on recent studies that attribute a majority of the run-time to a few key kernels. We present detailed optimization efforts to accelerate these kernels on the latest Intel Xeon and Xeon Phi processors with the goal of extracting maximum performance. A comparison of our optimized implementations, along with published results on GPGPU implementations, shows that our * Kanak Mahadik was a research intern at Intel when she worked on this project.

show abstract

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems

Cited by 19 publications

References 43 publications

A Computational Toolset for Rapid Identification of SARS-CoV-2, other Viruses, and Microorganisms from Sequencing Data

A Computational Toolset for Rapid Identification of SARS-CoV-2, other Viruses, and Microorganisms from Sequencing Data

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

Performance extraction and suitability analysis of multi- and many-core architectures for next generation sequencing secondary analysis

Contact Info

Product

Resources

About