2002
DOI: 10.1002/asi.10137
|View full text |Cite
|
Sign up to set email alerts
|

Using graded relevance assessments in IR evaluation

Abstract: This article proposes evaluation methods based on the use of nondichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable from the user point of view in modern large IR environments. The proposed methods are (1) a novel application of P-R curves and average precision computations based on separate recall bases for documents of different degrees of relevance, and (2) generalized rec… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
112
0
1

Year Published

2004
2004
2013
2013

Publication Types

Select...
8
2

Relationship

2
8

Authors

Journals

citations
Cited by 174 publications
(118 citation statements)
references
References 30 publications
0
112
0
1
Order By: Relevance
“…We were only interested in data sets with more than two relevance levels, firstly because the calibration for binary relevance labels does not make too much sense, and secondly, because the difference between various learning algorithms can be more pronounced in the multi-label case. Note also that the general consensus in the IR community is that graded relevance labels are superior to the binary setup when large document collections are involved Kekäläinen and Järvelin 2002;Sakai 2007). The features were normalized querywise in the LETOR data sets (OHSUMED, MQ 2007 , MQ 2008 ) so we did not preprocess them.…”
Section: Data Setsmentioning
confidence: 99%
“…We were only interested in data sets with more than two relevance levels, firstly because the calibration for binary relevance labels does not make too much sense, and secondly, because the difference between various learning algorithms can be more pronounced in the multi-label case. Note also that the general consensus in the IR community is that graded relevance labels are superior to the binary setup when large document collections are involved Kekäläinen and Järvelin 2002;Sakai 2007). The features were normalized querywise in the LETOR data sets (OHSUMED, MQ 2007 , MQ 2008 ) so we did not preprocess them.…”
Section: Data Setsmentioning
confidence: 99%
“…This assumes binary relevance. Graded relevance assessment measures such as the Discounted Cumulative Gain (DGC) measure can also be tackled (Kekäläinen and Järvelin, 2002).…”
Section: Areas Of Leaning In Ir Supportable By Mcq'smentioning
confidence: 99%
“…They can be negative or positive. Typically the relevance judgements are binary (0 or 1), but there is increasing interest in using graded relevance (Kekäläinen and Järvelin, 2002). The distribution of relevant documents differs significantly between collections (Hawking and Robertson, 2003), a further aspect any practitioner should think about when choosing a test collection.…”
Section: Benchmarking and Test Collectionsmentioning
confidence: 99%