Abstract. Prior-art search is a critical step in the examination procedure of a patent application. This study explores automatic query generation from patent documents to facilitate the time-consuming and labor-intensive search for relevant patents. It is essential for this task to identify discriminative terms in different sections of a query patent, which enable us to distinguish relevant patents from non-relevant patents. To this end we investigate the term distribution of words occurring in different sections of the query patent and compare them with the rest of the collection using language modeling estimation techniques. We experiment with term weighting based on the Kullback-Leibler divergence between the query patent and the collection and also with parsimonious language model estimation. Both of these techniques promote words that are common in the query patent and are rare in the collection. We also incorporate the classification assigned to patent documents into our model, to exploit the available human judgements in the form of a hierarchical classification. Experimental results show the effectiveness of generated queries particularly in terms of recall while patent description showed to be the most useful source for extracting terms.
Retrieving topically-relevant text passages in documents has been studied many times, but finding non-factoid, multiple sentence answers to web queries is a different task that is becoming increasingly important for applications such as mobile search. As the first stage of developing retrieval models for "answer passages", we describe the process of creating a test collection of questions and multiple-sentence answers based on the TREC GOV2 queries and documents. This annotation shows that most of the description-length TREC queries do in fact have passage-level answers. We then examine the effectiveness of current passage retrieval models in terms of finding passages that contain answers. We show that the existing methods are not effective for this task, and also observe that the relative performance of these methods in retrieving answers does not correspond to their performance in retrieving relevant documents.
Passage-based retrieval models have been studied for some time and have been shown to have some benefits for document ranking. Finding passages that are not only topically relevant, but are also answers to the users' questions would have a significant impact in applications such as mobile search. To develop models for answer passage retrieval, we need to have appropriate test collections and evaluation measures. Making annotations at the passage level is, however, expensive and can have poor coverage. In this paper, we describe the advantages of document summarization measures for evaluating answer passage retrieval and show that these measures have high correlation with existing measures and human judgments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.