Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents. This can be done by extending traditional evaluation methods, that is, recall and precision based on binary relevance judgments, to graded relevance judgments. Alternatively, novel measures based on graded relevance judgments may be developed. This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. The first one accumulates the relevance scores of retrieved documents along the ranked result list. The second one is similar but applies a discount factor to the relevance scores in order to devaluate late-retrieved documents. The third one computes the relative-to-the-ideal performance of IR techniques, based on the cumulative gain they are able to yield. These novel measures are defined and discussed and their use is demonstrated in a case study using TREC data: sample system run results for 20 queries in TREC-7. As a relevance base we used novel graded relevance judgments on a four-point scale. The test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences. The graphs based on the measures also provide insight into the performance IR techniques and allow interpretation, for example, from the user point of view.
This paper proposes evaluation methods based on the use of non-dichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable from the user point of view in modem large IR environments. The proposed methods are (1) a novel application of P-R curves and average precision computations based on separate recall bases for documents of different degrees of relevance, and (2) two novel measures computing the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. We then demonstrate the use of these evaluation methods in a case study on the effectiveness of query types, based on combinations of query structures and expansion, in retrieving documents of various degrees of relevance. The test was run with a best match retrieval system (In- Query I) in a text database consisting of newspaper articles. The results indicate that the tested strong query structures are most effective in retrieving highly relevant documents. The differences between the query types are practically essential and statistically significant. More generally, the novel evaluation methods and the case demonstrate that non-dichotomous relevance assessments are applicable in IR experiments, may reveal interesting phenomena, and allow harder testing of IR methods.
This article proposes evaluation methods based on the use of nondichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable from the user point of view in modern large IR environments. The proposed methods are (1) a novel application of P-R curves and average precision computations based on separate recall bases for documents of different degrees of relevance, and (2) generalized recall and precision based directly on multiple grade relevance assessments (i.e., not dichotomizing the assessments). We demonstrate the use of the traditional and the novel evaluation measures in a case study on the effectiveness of query types, based on combinations of query structures and expansion, in retrieving documents of various degrees of relevance. The test was run with a best match retrieval system (InQuery 1 ) in a text database consisting of newspaper articles. To gain insight into the retrieval process, one should use both graded relevance assessments and effectiveness measures that enable one to observe the differences, if any, between retrieval methods in retrieving documents of different levels of relevance. In modern times of information overload, one should pay attention, in particular, to the capability of retrieval methods retrieving highly relevant documents.
Evaluation is central in research and development of information retrieval (IR). In addition to designing and implementing new retrieval mechanisms, one must also show through rigorous evaluation that they are effective. A major focus in IR is IR mechanisms' capability of ranking relevant documents optimally for the users, given a query. Searching for information in practice involves searchers, however, and is highly interactive. When human searchers have been incorporated in evaluation studies, the results have often suggested that better ranking does not necessarily lead to better search task, or work task, performance. Therefore, it is not clear which system or interface features should be developed to improve the effectiveness of human task performance. In the present article, we focus on the evaluation of task-based information interaction (TBII). We give special emphasis to learning tasks to discuss TBII in more concrete terms. Information interaction is here understood as behavioral and cognitive activities related to task planning, searching information items, selecting between them, working with them, and synthesizing and reporting. These five generic activities contribute to task performance and outcome and can be supported by information systems. In an attempt toward task-based evaluation, we introduce program theory as the evaluation framework. Such evaluation can investigate whether a program consisting of TBII activities and tools works and how it works and, further, provides a causal description of program (in)effectiveness. Our goal in the present article is to structure TBII on the basis of the five generic activities and consider the evaluation of each activity using the program theory framework. Finally, we combine these activity-based program theories in an overall evaluation framework for TBII. Such an evaluation is complex due to the large number of factors affecting information interaction. Instead of presenting tested program theories, we illustrate how the evaluation of TBII should be accomplished using the program theory framework in the evaluation of systems and behaviors, and their interactions, comprehensively in context.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.