2010
DOI: 10.1007/978-3-642-13818-8_36
|View full text |Cite
|
Sign up to set email alerts
|

Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

Abstract: Abstract. Similarity search and similarity join on strings are important for applications such as duplicate detection, error detection, data cleansing, or comparison of biological sequences. Especially DNA sequencing produces large collections of erroneous strings which need to be searched, compared, and merged. However, current RDBMS offer similarity operations only in a very limited and inefficient form that does not scale to the amount of data produced in Life Science projects. We present PETER, a prefix tr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2012
2012
2022
2022

Publication Types

Select...
2
2
2

Relationship

1
5

Authors

Journals

citations
Cited by 13 publications
(12 citation statements)
references
References 19 publications
0
11
0
Order By: Relevance
“…In PeARL, we support Hamming and edit distance. We focus on edit distance based operations in this paper, but see [14] for the key ideas on Hamming distancebased queries. In general, the edit distance of r and s is computed in O(|r| * |s|) using dynamic programming.…”
Section: Preliminariesmentioning
confidence: 99%
See 3 more Smart Citations
“…In PeARL, we support Hamming and edit distance. We focus on edit distance based operations in this paper, but see [14] for the key ideas on Hamming distancebased queries. In general, the edit distance of r and s is computed in O(|r| * |s|) using dynamic programming.…”
Section: Preliminariesmentioning
confidence: 99%
“…These filters, namely prefix and edit distance pruning [16], character frequency pruning [1], and q−gram filtering [7], have been introduced in slightly different contexts before. Their concrete usage and efficiency for trie-based search and join queries is shown in [14]. Therefore, we only briefly summarize our search and join strategies in the following and concentrate on our novel parallelization scheme later.…”
Section: Algorithmsmentioning
confidence: 99%
See 2 more Smart Citations
“…Beside command-line tools, approaches exist that use relational database management systems (DBMSs) to facilitate downstream analyses as DBMSs provide excellent data integration capabilities for heterogeneous data sources (e.g., Atlas [22]). Furthermore, approaches exist that already integrate some of the genome analysis steps into relational DBMSs [18,19]. However, an approach that efficiently integrates all genome analysis steps into a DBMS does not yet exist.…”
Section: Introductionmentioning
confidence: 99%