Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ . Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far greater than the lower bound, and this results in high query time and index space complexities.In this paper, we show that the minimum signature size lower bound is τ +1. We then propose asymmetric signature schemes that achieve this lower bound. We develop efficient query processing algorithms based on the new scheme. Several dynamic programming-based candidate pruning methods are also developed to further speed up the performance. We have conducted a comprehensive experimental study involving nine state-of-the-art algorithms. The experiment results clearly demonstrate the efficiency of our methods.
Abstract-An important facility to aid keyword search on XML data is suggesting alternative queries when user queries contain typographical errors. Query suggestion thus can improve users' search experience by avoiding returning empty result or results of poor qualities.In this paper, we study the problem of effectively and efficiently providing quality query suggestions for keyword queries on an XML document. We illustrate certain biases in previous work and propose a principled and general framework, XClean, based on the state-of-the-art language model. Compared with previous methods, XClean can accommodate different error models and XML keyword query semantics without losing rigor. Algorithms have been developed that compute the top-k suggestions efficiently. We performed an extensive experiment study using two large-scale real datasets. The experiment results demonstrate the effectiveness and efficiency of the proposed methods.
Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ . Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overlap queries on query and data signatures.In this paper, we show that for any such signature scheme, the lower bound of the minimum number of signatures is τ + 1, which is lower than what are achieved by existing methods. We then propose several asymmetric signature schemes, i.e., extracting different numbers of signatures for the data and query strings, which achieve this lower bound. A basic asymmetric scheme is first established on the basis of matching q-chunks and q-grams between two strings. Two efficient query processing algorithms (IndexGram and IndexChunk) are developed on top of this scheme. We also propose novel candidate pruning methods to further improve the efficiency. We then generalize the basic scheme by incorporating novel ideas of floating q-chunks, optimal selection of q-chunks, and reducing the number of signatures using global ordering. As a result, the Super and Turbo families of schemes are developed together with their corresponding query processing algorithms. We have conducted a comprehensive experimental study using the six asymmetric algorithms and nine previous state-of-the-art algorithms. The experiment results clearly showcase the efficiency of our methods and demonstrate space and time characteristics of our proposed algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.