VIRTUE: A visual tool for information retrieval performance evaluation and failure analysis

Angelini, Marco; Ferro, Nicola; Santucci, Giuseppe; Silvello, Gianmaria

doi:10.1016/j.jvlc.2013.12.003

Cited by 19 publications

(12 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We stem from [Angelini et al, 2014;Ferro et al, 2016b] for defining the basic concepts of topics, documents, ground-truth, run, and judged run. To the best of our knowledge, these basic concepts have not been explicitly defined in previous works [Amigó et al, 2013;Busin and Mizzaro, 2013;Maddalena and Mizzaro, 2014;Moffat, 2013].…”

Section: Preliminary Definitionsmentioning

confidence: 99%

Exploiting User Signals and Stochastic Models to Improve Information Retrieval Systems and Evaluation

Maistro

2019

SIGIR Forum

View full text Add to dashboard Cite

The leitmotiv throughout this thesis is represented by IR evaluation. We discuss different issues related to effectiveness measures and novel solutions that we propose to address these challenges. We start by providing a formal definition of utility-oriented measurement of retrieval effectiveness, based on the representational theory of measurement. The proposed theoretical framework contributes to a better understanding of the problem complexities, separating those due to the inherent problems in comparing systems, from those due to the expected numerical properties of measures. We then propose AWARE, a probabilistic framework for dealing with the noise and inconsistencies introduced when relevance labels are gathered with multiple crowd assessors. By modeling relevance judgements and crowd assessors as sources of uncertainty, we directly combine the performance measures computed on the ground-truth generated by each crowd assessor, instead of adopting a classification technique to merge the labels at pool level. Finally, we investigate evaluation measures able to account for user signals. We propose a new user model based on Markov chains, that allows the user to scan the result list with many degrees of freedom. We exploit this Markovian model in order to inject user models into precision, defining a new family of evaluation measures, and we embed this model as objective function of an LtR algorithm to improve system performances. Table of contents List of figures ix List of tables xi Nomenclature xv Nomenclature xv MP Markov Precision MV Majority Vote nCG normalized Cumulated Gain nDCG normalized Discounted Cumulated Gain nMCG Normalized Markov Cumulated Gain RBP Rank-Biased Precision SERP Search Engine Result Page SMART System for the Mechanical Analysis and Retrieval of Text T REC Text REtrieval Conference With the development of IR systems it became necessary to design a framework to evaluate and compare different retrieval strategies. Indeed, progress and innovation are driven by experiments, but experimentation is useless without an objective evaluation measure that allow researchers to detect the improvements and identify the successful strategies. In Chapter 4 we propose our upstream approach called Assessor-driven Weighted Averages for Retrieval Evaluation (AWARE) [Ferrante et al., 2017]. AWARE is defined as an upstream approach because it directly combines the scores of the evaluation measures computed from the relevance labels of each assessor, instead of merging the labels and then computing the measures. The focus is then shifted from the documents and the labels to the evaluation measures. This allows to account for the error introduced by incorrect labels and to develop a framework which estimates performance measures in a way more robust to crowd assessors. Up to now we provide a formal definition of utility-oriented measurement of retrieval effectiveness and we developed an approach to estimate performance measures when there is some noise due to crowd assessors variability. Thus the effective...

show abstract

Section: Preliminary Definitionsmentioning

confidence: 99%

Exploiting User Signals and Stochastic Models to Improve Information Retrieval Systems and Evaluation

Maistro

2019

SIGIR Forum

View full text Add to dashboard Cite

show abstract

“…In this section, we introduce some relevant concepts regarding experimental evaluation that are necessary to define τ . We rely on the formalization of these concepts as introduced in (Angelini, Ferro, Santucci, & Silvello, )and reported here in short. However, in “Preliminary Definitions” we fully report it for completeness and support the demonstration of the properties of the proposed measure.…”

Section: Definition Of the Twist Measurementioning

confidence: 99%

“…It was first proposed in Ferro, Sabetta, Santucci, and Tino () as a support to the creation of an interactive system for exploring DCG plots via the addition of a bar visually showing the effect of the misplacement at each rank position. It was then exploited for a visual interactive failure analysis system as reported in Angelini, Ferro, Santucci, and Silvello (, ), Di Buccio et al. (, ), and Ferro et al.…”

Section: Definition Of the Twist Measurementioning

confidence: 99%

“…Once these runs have been identified, the CRP together with the DCG curve allow for a deeper analysis of a run by helping the researcher to understand its behavior rank‐by‐rank from both the utility and the effort viewpoint. Such a tool can be seen as an extension of visual systems that help the researcher to perform failure analysis; in particular, the state‐of‐the‐art Visual Information Retrieval Tool for Upfront Evaluation (VIRTUE) system (Angelini et al., ) could benefit from the CRP visual tool because it could extend the use of RP, which is currently employed in the system for spotting critical regions of a ranking and grasp possible causes of a failure.…”

Section: Crp As a Visual Toolmentioning

confidence: 99%

See 1 more Smart Citation

The twist measure for IR evaluation: Taking user's effort into account

Ferro

Silvello

Keskustalo

et al. 2015

Asso for Info Science & Tech

Self Cite

View full text Add to dashboard Cite

We present a novel measure for ranking evaluation, called Twist (τ). It is a measure for informational intents, which handles both binary and graded relevance. τ stems from the observation that searching is currently a that searching is currently taken for granted and it is natural for users to assume that search engines are available and work well. As a consequence, users may assume the utility they have in finding relevant documents, which is the focus of traditional measures, as granted. On the contrary, they may feel uneasy when the system returns nonrelevant documents because they are then forced to do additional work to get the desired information, and this causes avoidable effort. The latter is the focus of τ, which evaluates the effectiveness of a system from the point of view of the effort required to the users to retrieve the desired information. We provide a formal definition of τ, a demonstration of its properties, and introduce the notion of effort/gain plots, which complement traditional utility-based measures. By means of an extensive experimental evaluation, τ is shown to grasp different aspects of system performances, to not require extensive and costly assessments, and to be a robust tool for detecting differences between systems.

show abstract

“…To reach the purpose of defining a class of measures which can be used with both batch and on-line strategies, we plan to rely on the Markov chains framework (Norris 1998) as proposed in our work (Ferrante et al 2014). Furthermore, regarding the management and the visualisation of the experimental data we will use VA techniques, which are a quite new idea to the IR field (Angelini et al 2014), and which allow the experimental results to be more efficiently and effectively explained. Therefore, we can summarize the aim of this research project in four main objectives:…”

Section: Objectivesmentioning

confidence: 99%

Improving Information Retrieval Evaluation via Markovian User Models and Visual Analytics

Maistro¹

2015

Electronic Workshops in Computing

View full text Add to dashboard Cite

To address the challenge of adapting experimental evaluation to the constantly evolving user tasks and needs, we develop a new family of Markovian Information Retrieval (IR) evaluation measures, called Markov Precision (MP), where the interaction between the user and the ranked result list is modelled via Markov chains, and which will be able to explicitly link lab-style and on-line evaluation methods. Moreover, since experimental results are often not so easy to understand, we will develop a Web-based Visual Analytics (VA) prototype where an animated state diagram of the Markov chain will explain how the user is interacting with the ranked result list in order to offer a support for a careful failure analysis.

show abstract

VIRTUE: A visual tool for information retrieval performance evaluation and failure analysis

Cited by 19 publications

References 54 publications

Exploiting User Signals and Stochastic Models to Improve Information Retrieval Systems and Evaluation

Exploiting User Signals and Stochastic Models to Improve Information Retrieval Systems and Evaluation

The twist measure for IR evaluation: Taking user's effort into account

Improving Information Retrieval Evaluation via Markovian User Models and Visual Analytics

Contact Info

Product

Resources

About