Regular expressions have served as the dominant workhorse of practical information extraction for several years. However, there has been little work on reducing the manual effort involved in building high-quality, complex regular expressions for information extraction tasks. In this paper, we propose Re-LIE, a novel transformation-based algorithm for learning such complex regular expressions. We evaluate the performance of our algorithm on multiple datasets and compare it against the CRF algorithm. We show that ReLIE, in addition to being an order of magnitude faster, outperforms CRF under conditions of limited training data and cross-domain data. Finally, we show how the accuracy of CRF can be improved by using features extracted by ReLIE.
Several methods have been proposed to evaluate queries over a native XML DBMS, where the queries specify both path and keyword constraints. These broadly consist of graph traversal approaches, optimized with auxiliary structures known as structure indexes; and approaches based on information-retrieval style inverted lists. However, no published literature addresses methods of combining structure indexes and inverted lists. We bridge this gap by proposing a strategy that combines the two forms of auxiliary indexes and a query evaluation algorithm for branching path expressions based on this strategy. Our technique is general and applicable for a wide range of choices of structure indexes and inverted list join algorithms. Our experiments over a native XML DBMS show the benefit of integrating the two forms of indexes. We also consider algorithmic issues in evaluating path expression queries when the notion of relevance ranking is incorporated. By integrating the above techniques with the Threshold Algorithm proposed by Fagin et al., we obtain instance optimal algorithms to push down top k computation.
Abstract. Recently, the database research literature has seen an explosion of publications with the goal of using an RDBMS to store and/or query XML data. The problems addressed and solved in this area are diverse. This diversity renders it difficult to know how the various results presented fit together, and even makes it hard to know what open problems remain. As a first step to rectifying this situation, we present a classification of the problem space and discuss how almost 40 papers fit into this classification. As a result of this study, we find that some basic questions are still open. In particular, for the XML publishing of relational data and for "schema-based" shredding of XML documents into relations, there is no published algorithm for translating even simple path expression queries (with the // axis) into SQL when the XML schema is recursive.
There has been recent interest in using relational database systems to store and query XML documents. Each of the techniques proposed in this context works by (a) creating tables for the purpose of storing XML documents (also called relational schema generation), (b) storing XML documents by shredding them into rows in the created tables, and (c) converting queries over XML documents into SQL queries over the created tables. Since relational schema generation is a physical database design issue -dependent on factors such as the nature of the data, the query workload and availability of schemas -there have been many techniques proposed for this purpose. Currently, each relational schema generation technique requires its own query processor to efficiently convert queries over XML documents into SQL queries over the created tables. In this paper, we present an efficient technique whereby the same query-processor can be used for all such relational schema generation techniques. This greatly simplifies the task of relational schema generation by eliminating the need to write a special-purpose query processor for each new solution to the problem. In addition, our proposed technique enables users to query seamlessly across relational data and XML documents. This provides users with unified access to both relational and XML data without them having to deal with separate databases.
Sparse" data, in which relations have many attributes that are null for most tuples, presents a challenge for relational database management systems. If one uses the normal "horizontal" schema to store such data sets in any of the three leading commercial RDBMS, the result is tables that occupy vast amounts of storage, most of which is devoted to nulls. If one attempts to avoid this storage blowup by using a "vertical" schema, the storage utilization is indeed better, but query performance is orders of magnitude slower for certain classes of queries. In this paper, we argue that the proper way to handle sparse data is not to use a vertical schema, but rather to extend the RDBMS tuple storage format to allow the representation of sparse attributes as interpreted fields. The addition of interpreted storage allows for efficient and transparent querying of sparse data, uniform access to all attributes, and schema scalability. We show, through an implementation in PostgreSQL, that the interpreted storage approach dominates in query efficiency and ease-of-use over the current horizontal storage and vertical schema approaches over a wide range of queries and sparse data sets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.