On the reliability of information retrieval metrics based on graded relevance

Sakai, Tetsuya

doi:10.1016/j.ipm.2006.07.020

Cited by 86 publications

(87 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Büttcher et al (2007) also used Precision at l judged documents, which relies on condensed lists just like Q 0 , AP 0 and nDCG 0 . However, Precision is not a satisfactory metric for us because: (1) It ignores the ranks of retrieved relevant documents; (2) It does not average well, especially with a large document cut-off; (3) With a small document cutoff, it gives unreliable results as systems are evaluated based on a small number of observations, i.e., documents near the top of the ranked list (Sakai 2007f).…”

Section: Related Workmentioning

confidence: 99%

On information retrieval metrics designed for evaluation with incomplete relevance assessments

Sakai¹,

Kando

2008

Inf Retrieval

Self Cite

103

View full text Add to dashboard Cite

Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, where many documents exist that were never examined by the relevance assessors, is receiving a lot of attention. This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs-the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Following previous work, we artificially reduce the original relevance data to simulate IR evaluation environments with extremely incomplete relevance data. We then investigate the effect of this reduction on discriminative power, which we define as the proportion of system pairs with a statistically significant difference for a given probability of Type I Error, and on Kendall's rank correlation, which reflects the overall resemblance of two system rankings according to two different metrics or two different relevance data sets. According to these experiments, Q 0 , nDCG 0 and AP 0 proposed by Sakai are superior to bpref proposed by Buckley and Voorhees and to Rank-Biased Precision proposed by Moffat and Zobel. We also point out some weaknesses of bpref and Rank-Biased Precision by examining their formal definitions.

show abstract

Section: Related Workmentioning

confidence: 99%

On information retrieval metrics designed for evaluation with incomplete relevance assessments

Sakai¹,

Kando

2008

Inf Retrieval

Self Cite

103

View full text Add to dashboard Cite

show abstract

“…Studies have also investigated the stability of graded relevance-based measures and in comparison with those based on binary relevance. Sakai (2007) found that NDCG was as stable and sensitive as MAP while Radlinski and Craswell (2010) found that the first is more stable when a small number of queries is used. When tested with query set sizes from 5 to 30, the authors showed that MAP results could be misleading since the worse ranking was sometimes statistically significantly better.…”

Section: R-precision (R-prec)mentioning

confidence: 99%

An overview of semantic search evaluation initiatives

Elbedweihy

Wrigley

Clough

et al. 2015

Journal of Web Semantics

View full text Add to dashboard Cite

“…As we consider DTI prediction to be a per-drug ranking problem, we evaluate ranked lists for each drug separately and use normalized discounted cumulative gain (nDCG) as evaluation metric, which was shown to be the best graded relevance ranking metric with respect to the stability and sensitivity [43]. For the ranked list p DCG is calculated as follows: There are several advantages of using nDCG as evaluation metric for DTI.…”

Section: Normalized Discounted Cumulative Gainmentioning

confidence: 99%

Drug-target interaction prediction: A Bayesian ranking approach

Peška

Búza

Koller

2017

Computer Methods and Programs in Biomedicine

View full text Add to dashboard Cite

Background and Objective: In silico prediction of drug-target interactions (DTI) could provide valuable information and speed-up the process of drug repositioning -finding novel usage for existing drugs. In our work, we focus on machine learning algorithms supporting drug-centric repositioning approach, which aims to find novel usage for existing or abandoned drugs. We aim at proposing a per-drug ranking-based method, which reflects the needs of drug-centric repositioning research better than conventional drug-target prediction approaches. Methods: We propose Bayesian Ranking Prediction of Drug-Target Interactions(BRDTI). The method is based on Bayesian Personalized Ranking matrix factorization (BPR) which has been shown to be an excellent approach for various preference learning tasks, however, it has not been used for DTI prediction previously. In order to successfully deal with DTI challenges, we extended BPR by proposing: (i) the incorporation of target bias, (ii) a technique to handle new drugs and (iii) content alignment to take structural similarities of drugs and targets into account. Conclusions:Based on the evaluation, we can conclude that BRDTI is an appropriate choice for researchers looking for an in silico DTI prediction technique to be used in drugcentric repositioning scenarios. BRDTI Software and supplementary materials are available online at www.ksi.mff.cuni.cz/~peska/BRDTI.

show abstract

On the reliability of information retrieval metrics based on graded relevance

Cited by 86 publications

References 14 publications

On information retrieval metrics designed for evaluation with incomplete relevance assessments

On information retrieval metrics designed for evaluation with incomplete relevance assessments

An overview of semantic search evaluation initiatives

Drug-target interaction prediction: A Bayesian ranking approach

Contact Info

Product

Resources

About