Evaluating Relevance Judgments with Pairwise Discriminative Power

Chu, Zhumin; Mao, Jiaxin; Zhang, Fan; Liu, Yiqun; Sakai, Tetsuya; Zhang, Min; Ma, Shaoping

doi:10.1145/3459637.3482428

Cited by 2 publications

(4 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In future work, we plan to extend our study to other IR metrics like bpref [11] and infAP [57] for KGC evaluation. Other factors over meta-evaluation on metrics will also be examined, such as pairwise discriminative power [16], user satisfaction [15,29,59], and diversity [4,40].…”

Section: Discussionmentioning

confidence: 99%

Re-thinking Knowledge Graph Completion Evaluation from an Information Retrieval Perspective

Zhou,

Chen,

et al. 2022

Preprint

View full text Add to dashboard Cite

Knowledge graph completion (KGC) aims to infer missing knowledge triples based on known facts in a knowledge graph. Current KGC research mostly follows an entity ranking protocol, wherein the effectiveness is measured by the predicted rank of a masked entity in a test triple. The overall performance is then given by a micro(-average) metric over all individual answer entities. Due to the incomplete nature of the large-scale knowledge bases, such an entity ranking setting is likely affected by unlabelled top-ranked positive examples, raising questions on whether the current evaluation protocol is sufficient to guarantee a fair comparison of KGC systems. To this end, this paper presents a systematic study on whether and how the label sparsity affects the current KGC evaluation with the popular micro metrics. Specifically, inspired by the TREC paradigm for large-scale information retrieval (IR) experimentation, we create a relatively "complete" judgment set based on a sample from the popular FB15k-237 dataset following the TREC pooling method. According to our analysis, it comes as a surprise that switching from the original labels to our "complete" labels results in a drastic change of system ranking of a variety of 13 popular KGC models in terms of micro metrics. Further investigation indicates that the IR-like macro(-average) metrics are more stable and discriminative under different settings, meanwhile, less affected by label sparsity. Thus, for KGC evaluation, we recommend conducting TREC-style pooling to balance between human efforts and label completeness, and reporting also the IR-like macro metrics to reflect the ranking nature of the KGC task.

show abstract

Section: Discussionmentioning

confidence: 99%

Re-thinking Knowledge Graph Completion Evaluation from an Information Retrieval Perspective

Zhou,

Chen,

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Additionally, a comparison conducted by Yang et al (2018) reveals that preference judgment is more reliable than other paradigms. Chu et al (2021) propose a combined evaluation metric named pairwise discriminative power (PDP) to evaluate the quality of relevance judgment collections with both pair-wise signals and point-wise signals. A novel combined metric proposed by Arabzadeh et al (2023) is applicable for instant search rather than offline search.…”

Section: Relevance Judgmentmentioning

confidence: 99%

“…For example, to the best of our knowledge, there is no universal grading scheme (i.e., how many levels to use and what those levels mean) in point-wise relevance judgment (Xie et al, 2020). Different numerical scales will significantly affect evaluation performance in various scenarios (Chu et al, 2021), as they determine the granularity of judgment and the interpretation of each level, which hurts the reliability of point-wise relevance judgment in practice.…”

Section: Introductionmentioning

confidence: 99%

“…Additionally, investigating why users find it easier and more efficient to make relevance judgments in the pair-wise paradigm than in the point-wise paradigm is of interest. With efforts being made to integrate pair-wise and pointwise relevance judgments for improved annotation efficiency (Chu et al, 2021;Yan et al, 2022), a neurological understanding of these paradigms can inform the design of more user-friendly annotation tasks.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Comparing point‐wise and pair‐wise relevance judgment with brain signals

Zhu,

Xie,

et al. 2024

Asso for Info Science & Tech

Self Cite

View full text Add to dashboard Cite

How to collect relevance judgment has long been an important problem in Information Retrieval (IR). A popular method is to collect relevance judgment in a point‐wise manner, in which assessors examine and give an absolute relevance score for each item independently of the others. As an alternative, pair‐wise relevance judgment, also named preference judgment, allows an assessor to compare two items side‐by‐side and express their preference for one over the other. Previous work has explored the differences between these two paradigms of relevance judgments from many different aspects. Most of these works are conducted through explicit/implicit feedback. However, few works investigate the underlying neurological mechanisms of the two paradigms. In this paper, we conduct a lab study to investigate and compare point‐wise and pair‐wise relevance judgment in image search scenarios. We study the neurological mechanisms of the two paradigms through an event‐related potential (ERP) analysis of the users' brain signals while viewing images during a search process. We have obtained several observations, such as search engine users tend to pay more attention to preferred items in the point‐wise paradigm but unpreferred items in the pair‐wise paradigm. Furthermore, we test the adoption of brain signals as implicit feedback for predicting pair‐wise relevance judgment, highlighting the feasibility of leveraging brain signals to understand users' relevance judgments.

show abstract

Evaluating Relevance Judgments with Pairwise Discriminative Power

Cited by 2 publications

References 25 publications

Re-thinking Knowledge Graph Completion Evaluation from an Information Retrieval Perspective

Re-thinking Knowledge Graph Completion Evaluation from an Information Retrieval Perspective

Comparing point‐wise and pair‐wise relevance judgment with brain signals

Contact Info

Product

Resources

About