Most of the work on XML query and search has stemmed from the publishing and database communities, mostly for the needs of business applications. Recently, the Information Retrieval community began investigating the XML search issue to answer information discovery needs. Following this trend, we present here an approach where information needs can be expressed in an approximate manner as pieces of XML documents or "XML fragments" of the same nature as the documents that are being searched. We present an extension of the vector space model for searching XML collections via XML fragments and ranking results by relevance. We describe how we have extended a fulltext search engine to comply with this model. The value of the proposed method is demonstrated by the relative high precision of our system, which was among the top performers in the recent INEX workshop. Our results indicate that certain queries are more appropriate than others for the extended vector space model. Specifically, queries with relatively specific contexts but vague information needs are best situated to reap the benefit of this model. Finally our results show that one method may not fit all types of queries and that it could be worthwhile to use different solutions for different applications.
Queries over XML documents challenge search engines to return the most relevant XML components that satisfy the query concepts. In a previous work[6] we described an algorithm to retrieve the most relevant XML components that performed relatively well in INEX'03. In this paper we show an improvement to that algorithm by introducing a document pivot that compensates for missing terms statistics in small components. Using this new algorithm we achieved improvements of 30%-50% in the Mean Average Precision over the previous algorithm. We then describe a general mechanism to apply existing Automatic Query Refinement (AQR) methods on top of our XML retrieval algorithm and demonstrate a particular such method that achieved top results in INEX'04.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.