XML and other semi-structured data may have partially specified or missing schema information, motivating the use of a structural summary which can be automatically computed from the data. These summaries also serve as indices for evaluating the complex path expressions common to XML and semi-structured query languages. However, to answer all path queries accurately, summaries must encode information about long, seldom-queried paths, leading to increased size and complexity with little added value. We introduce the A(k)-indices, a family of approximate structural summaries. They are based on the concept of k-bisimilarity, in which nodes are grouped based on local structure, i.e., the incoming paths of length up to k. The parameter k thus smoothly varies the level of detail (and accuracy) of the A(k)-index. For small values of k, the size of the index is substantially reduced. While smaller, the A(k) index is approximate, and we describe techniques for efficiently extracting exact answers to regular path queries. Our experiments show that, for moderate values of k, path evaluation using the A(k)-index ranges from being very efficient for simple queries to competitive for most complex queries, while using significantly less space than comparable structures.
In this paper, we ask if the traditional relational query acceleration techniques of summary tables and covering indexes have analogs for branching path expression queries over tree-or graph-structured XML data. Our answer is yes --the forward-and-backward index already proposed in the literature can be viewed as a structure analogous to a summary table or covering index. We also show that it is the smallest such index that covers all branching path expression queries. While this index is very general, our experiments show that it can be so large in practice as to offer little performance improvement over evaluating queries directly on the data. Likening the forward-and-backward index to a covering index on all the attributes of several tables, we devise an index definition scheme to restrict the class of branching path expressions being indexed. The resulting index structures are dramatically smaller and perform better than the full forward-and-backward index for these classes of branching path expressions. This is roughly analogous to the situation in multidimensional or OLAP workloads, in which more highly aggregated summary tables can service a smaller subset of queries but can do so at increased performance. We evaluate the performance of our indexes on both relational decompositions of XML and a native storage technique. As expected, the performance benefit of an index is maximized when the query matches the index definition.
We consider the problem of learning a record matching package (classifier) in an active learning setting. In active learning, the learning algorithm picks the set of examples to be labeled, unlike more traditional passive learning setting where a user selects the labeled examples. Active learning is important for record matching since manually identifying a suitable set of labeled examples is difficult. Previous algorithms that use active learning for record matching have serious limitations: The packages that they learn lack quality guarantees and the algorithms do not scale to large input sizes. We present new algorithms for this problem that overcome these limitations. Our algorithms are fundamentally different from traditional active learning approaches, and are designed ground up to exploit problem characteristics specific to record matching. We include a detailed experimental evaluation on realworld data demonstrating the effectiveness of our algorithms.
Incorporating the skyline operator inside the relational engine requires solving the cardinality estimation and the cost estimation problem, hitherto unaddressed. We propose robust techniques to estimate the cardinality and the computational cost of Skyline, and through an empirical comparison, show that our technique is substantially more effective than traditional approaches. Finally, we show through an implementation in Microsoft SQL Server that skyline queries can substantially benefit from our techniques.
Several methods have been proposed to evaluate queries over a native XML DBMS, where the queries specify both path and keyword constraints. These broadly consist of graph traversal approaches, optimized with auxiliary structures known as structure indexes; and approaches based on information-retrieval style inverted lists. However, no published literature addresses methods of combining structure indexes and inverted lists. We bridge this gap by proposing a strategy that combines the two forms of auxiliary indexes and a query evaluation algorithm for branching path expressions based on this strategy. Our technique is general and applicable for a wide range of choices of structure indexes and inverted list join algorithms. Our experiments over a native XML DBMS show the benefit of integrating the two forms of indexes. We also consider algorithmic issues in evaluating path expression queries when the notion of relevance ranking is incorporated. By integrating the above techniques with the Threshold Algorithm proposed by Fagin et al., we obtain instance optimal algorithms to push down top k computation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.