2005
DOI: 10.1007/11562382_1
|View full text |Cite
|
Sign up to set email alerts
|

The Reliability of Metrics Based on Graded Relevance

Abstract: NTCIR was the first large-scale IR evaluation conference to construct test collections with graded relevance assessments: the NTCIR-1 test collections from 1998 already featured relevant and partially relevant documents. In this paper, I first describe a few graded-relevance measures that originated from NTCIR (and a few variants) which are used across different NTCIR tasks. I then provide a survey on the use of graded relevance assessments and of graded relevance measures in the past NTCIR tasks which primari… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
15
0

Year Published

2006
2006
2017
2017

Publication Types

Select...
5
1

Relationship

2
4

Authors

Journals

citations
Cited by 6 publications
(16 citation statements)
references
References 21 publications
1
15
0
Order By: Relevance
“…At AIRS 2005, Sakai [8,14] reported on stability and swap experiments that included IR metrics based on graded relevance, namely, Q-measure, R-measure and normalised (Discounted) Cumulative Gain (n(D)CG) [7]. But this study was limited to IR metrics for the task of finding as many relevant documents as possible.…”
Section: Previous Workmentioning
confidence: 98%
See 3 more Smart Citations
“…At AIRS 2005, Sakai [8,14] reported on stability and swap experiments that included IR metrics based on graded relevance, namely, Q-measure, R-measure and normalised (Discounted) Cumulative Gain (n(D)CG) [7]. But this study was limited to IR metrics for the task of finding as many relevant documents as possible.…”
Section: Previous Workmentioning
confidence: 98%
“…As for the task of finding one highly relevant document, two previous studies based on the stability and the swap methods reported that P-measure [12] and O-measure [10] may be more stable and sensitive than RR. As Table 1 shows, all of these studies involving graded-relevance metrics [8,10,12,14] used either the NTCIR-3 or the NTCIR-5 data, but not both.…”
Section: Previous Workmentioning
confidence: 98%
See 2 more Smart Citations
“…Evaluation measures for the iUnit Ranking subtask are nDCG k (k = 3, 5, 10, and 20) [1] and the Q-measure [21]. Metric normalized discounted cumulative gain (nDCG) is commonly used as a measure to evaluate document retrieval.…”
Section: Iunit Ranking Subtaskmentioning
confidence: 99%