Structured document retrieval has established itself as a new research area in the overlap between Database Systems and Information Retrieval. This work proposes a filtering technique, that can be added to already existing index structures of many structured document retrieval systems. This new technique takes the contextual structure information of query and document database into account and reduces the occurrence sets returned by the original index structure drastically. This improves the performance of query evaluation. A measure is introduced that allows to quantify the added value of the proposed index structure. Based on this measure a heuristic is presented that allows to include only valuable context information in the index structure. 1 Introduction With the growing importance of Information Retrieval in the presence of a vast amount of structured documents in formalisms like SGML ([6]) or the future WWW language XML ([18]), sophisticated and efficient indexing techniques for structured documents become more and more important. In general, index structures are crucial for the efficiency of Database Systems (DBS) and Information Retrieval (IR) systems. With an appropriate index structure irrelevant parts of the database can be disregarded in the search. Very sophisticated index structures have been proposed in the research in DBS and IR, some of them dedicated to a special class of data only, e.g. geographical data ([2]). Index structures in DBS try to support access to data by organizing it in an appropriate way. The notion of ordering the data plays a key-role in this task. Some data has a natural topology (like geographical data), for other data the index structure defines a topology. One of the problems of an efficient index structure is to map this (usually multidimensional) topology onto the linear layout of the storage medium. So far, index structures in IR are confronted with the one-dimensional form of the problem only: They implement a mapping from terms (i.e. words) in a set of documents to occurrences (i.e. offsets in the files storing the documents). The mapping problem becomes trivial, since text is seen in traditional IR as a linear medium.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.