Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

Rheinländer, Astrid; Knobloch, Martin; Hochmuth, Nicky; Leser, Ulf

doi:10.1007/978-3-642-13818-8_36

Cited by 13 publications

(12 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In PeARL, we support Hamming and edit distance. We focus on edit distance based operations in this paper, but see [14] for the key ideas on Hamming distancebased queries. In general, the edit distance of r and s is computed in O(|r| * |s|) using dynamic programming.…”

Section: Preliminariesmentioning

confidence: 99%

“…These filters, namely prefix and edit distance pruning [16], character frequency pruning [1], and q−gram filtering [7], have been introduced in slightly different contexts before. Their concrete usage and efficiency for trie-based search and join queries is shown in [14]. Therefore, we only briefly summarize our search and join strategies in the following and concentrate on our novel parallelization scheme later.…”

Section: Algorithmsmentioning

confidence: 99%

“…Whenever a new child of the current node is reached, we first check whether we can prune this node (see [14] for details on filtering). If all filters have been passed successfully, we compute the edit distance between the query and the prefix of the node.…”

Section: Algorithmsmentioning

confidence: 99%

“…In order to retain these advantages for similarity-based queries, we store additional information at each node that enable early pruning of whole subtries. Previously, we demonstrated that these strategies effectively speed up similarity-based queries in PETER [14], a disk-based index structure and predecessor of PeARL.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Scalable Sequence Similarity Search and Join in Main Memory on Multi-cores

Rheinländer

Leser

2012

Euro-Par 2011: Parallel Processing Workshops

Self Cite

View full text Add to dashboard Cite

Abstract. Similarity-based queries play an important role in many large scale applications. In bioinformatics, DNA sequencing produces huge collections of strings, that need to be compared and merged. One strategy to speed up similarity-based queries is parallelization on clusters using MapReduce. However, distributing data over a cluster also incurs high cost. At the same time, modern hardware offers parallelization through multi-cores and can be equipped with large main memories at low cost. We present PeARL, a data structure and algorithms for similarity-based queries on many-core servers. PeARL indexes large string collections in compressed tries which are entirely held in main memory. Parallelization of searches and joins is performed using MapReduce as the underlying execution paradigm. We show that our data structure is capable of performing many real-world applications in sequence comparisons in main memory. Our evaluation reveals that PeARL reaches a significant performance gain compared to single-threaded solutions. However, the evaluation also shows that scalability should be further improved, e.g., by reducing sequential parts of the algorithms.

show abstract

Section: Preliminariesmentioning

confidence: 99%

Section: Algorithmsmentioning

confidence: 99%

Section: Algorithmsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Scalable Sequence Similarity Search and Join in Main Memory on Multi-cores

Rheinländer

Leser

2012

Euro-Par 2011: Parallel Processing Workshops

Self Cite

View full text Add to dashboard Cite

show abstract

“…Beside command-line tools, approaches exist that use relational database management systems (DBMSs) to facilitate downstream analyses as DBMSs provide excellent data integration capabilities for heterogeneous data sources (e.g., Atlas [22]). Furthermore, approaches exist that already integrate some of the genome analysis steps into relational DBMSs [18,19]. However, an approach that efficiently integrates all genome analysis steps into a DBMS does not yet exist.…”

Section: Introductionmentioning

confidence: 99%

Toward efficient and reliable genome analysis using main-memory database systems

Dorok

Breß

Läpple

et al. 2014

Proceedings of the 26th International Conference on Scientific and Statistical Database Management

View full text Add to dashboard Cite

Improvements in DNA sequencing technologies allow to sequence complete human genomes in a short time and at acceptable cost. Hence, the vision of genome analysis as standard procedure to support and improve medical treatment becomes reachable. In this vision paper, we describe important data-management challenges that have to be met to make this vision come true. Besides genome-analysis performance, data-management capabilities such as data provenance and data integrity become increasingly important to enable comprehensible and reliable genome analysis. We argue to meet these challenges by using main-memory database technologies, which combine fast processing capabilities with extensive data-management capabilities. Finally, we discuss possibilities of integrating genome-analysis tasks into DBMSs and derive new research questions.

show abstract

An efficient enhanced prefix hash tree model for optimizing the storage and image deduplication in cloud

Sujatha

Raj

2022

Concurrency and Computation

View full text Add to dashboard Cite

Summary The popularity of the cloud storage space mainly attracted organizations to store their data in them. Therefore, the avoidance of duplicate data contents is unavoidable and several users share the cloud storage space for data storage, and sometimes this makes higher storage space utilization. Because of the extremely high duplicate copy, memory wastage arises in the case of multimedia data. Identifying the final duplicate copies in the cloud takes more time. To overcome this problem, we employ a significant storage optimization model for deduplication. The digital data hash value is stored by requiring an additional memory space. This study proposed an enhanced prefix hash tree (EPHT) method to optimize the image and text deduplication system to reduce the overhead caused by this procedure. The efficiency of the proposed approach is compared with the interpolation search technique using different levels of tree height (2, 4, 2, 8, 16) in terms of space and time complexity. The proposed EPHT technique shows improvements in terms of speed and space complexity when the number of levels in the EPHT increases.

show abstract

Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

Cited by 13 publications

References 19 publications

Scalable Sequence Similarity Search and Join in Main Memory on Multi-cores

Scalable Sequence Similarity Search and Join in Main Memory on Multi-cores

Toward efficient and reliable genome analysis using main-memory database systems

An efficient enhanced prefix hash tree model for optimizing the storage and image deduplication in cloud

Contact Info

Product

Resources

About