2007
DOI: 10.1016/j.ipm.2006.07.020
|View full text |Cite
|
Sign up to set email alerts
|

On the reliability of information retrieval metrics based on graded relevance

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
83
0

Year Published

2008
2008
2017
2017

Publication Types

Select...
7
2
1

Relationship

3
7

Authors

Journals

citations
Cited by 86 publications
(87 citation statements)
references
References 14 publications
3
83
0
Order By: Relevance
“…Büttcher et al (2007) also used Precision at l judged documents, which relies on condensed lists just like Q 0 , AP 0 and nDCG 0 . However, Precision is not a satisfactory metric for us because: (1) It ignores the ranks of retrieved relevant documents; (2) It does not average well, especially with a large document cut-off; (3) With a small document cutoff, it gives unreliable results as systems are evaluated based on a small number of observations, i.e., documents near the top of the ranked list (Sakai 2007f).…”
Section: Related Workmentioning
confidence: 99%
“…Büttcher et al (2007) also used Precision at l judged documents, which relies on condensed lists just like Q 0 , AP 0 and nDCG 0 . However, Precision is not a satisfactory metric for us because: (1) It ignores the ranks of retrieved relevant documents; (2) It does not average well, especially with a large document cut-off; (3) With a small document cutoff, it gives unreliable results as systems are evaluated based on a small number of observations, i.e., documents near the top of the ranked list (Sakai 2007f).…”
Section: Related Workmentioning
confidence: 99%
“…Studies have also investigated the stability of graded relevance-based measures and in comparison with those based on binary relevance. Sakai (2007) found that NDCG was as stable and sensitive as MAP while Radlinski and Craswell (2010) found that the first is more stable when a small number of queries is used. When tested with query set sizes from 5 to 30, the authors showed that MAP results could be misleading since the worse ranking was sometimes statistically significantly better.…”
Section: R-precision (R-prec)mentioning
confidence: 99%
“…As we consider DTI prediction to be a per-drug ranking problem, we evaluate ranked lists for each drug separately and use normalized discounted cumulative gain (nDCG) as evaluation metric, which was shown to be the best graded relevance ranking metric with respect to the stability and sensitivity [43]. For the ranked list p DCG is calculated as follows: There are several advantages of using nDCG as evaluation metric for DTI.…”
Section: Normalized Discounted Cumulative Gainmentioning
confidence: 99%