The twist measure for <scp>IR</scp> evaluation: Taking user's effort into account

Ferro, Nicola; Silvello, Gianmaria; Keskustalo, Heikki; Pirkola, Ari; Järvelin, Kalervo

doi:10.1002/asi.23416

Cited by 9 publications

(6 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Twist (Ferro, Silvello, Keskustalo, Pirkola, & Järvelin, ) is a measure for informational intents, which handles both binary and graded relevance. Twist adopts a user model where the user scans the ranked list from top to bottom until s/he stops, and returns an estimate of the effort required by the user to traverse the ranked list.…”

Section: Grid Of Points Measures and Setupmentioning

confidence: 99%

Toward an anatomy of IR system component performances

Ferro

Silvello

2017

Asso for Info Science & Tech

Self Cite

View full text Add to dashboard Cite

Information retrieval (IR) systems are the prominent means for searching and accessing huge amounts of unstructured information on the web and elsewhere. They are complex systems, constituted by many different components interacting together, and evaluation is crucial to both tune and improve them. Nevertheless, in the current evaluation methodology, there is still no way to determine how much each component contributes to the overall performances and how the components interact together. This hampers the possibility of a deep understanding of IR system behavior and, in turn, prevents us from designing ahead which components are best suited to work together for a specific search task. In this paper, we move the evaluation methodology one step forward by overcoming these barriers and beginning to devise an “anatomy” of IR systems and their internals. In particular, we propose a methodology based on the General Linear Mixed Model (GLMM) and analysis of variance (ANOVA) to develop statistical models able to isolate system variance and component effects as well as their interaction, by relying on a grid of points (GoP) containing all the combinations of the analyzed components. We apply the proposed methodology to the analysis of two relevant search tasks—news search and web search—by using standard TREC collections. We analyze the basic set of components typically part of an IR system, namely, stop lists, stemmers, and n‐grams, and IR models. In this way, we derive insights about English text retrieval.

show abstract

Section: Grid Of Points Measures and Setupmentioning

confidence: 99%

Toward an anatomy of IR system component performances

Ferro

Silvello

2017

Asso for Info Science & Tech

Self Cite

View full text Add to dashboard Cite

show abstract

“…Finally, it would be interesting to experiment what happens in the case of graded-relevance judgments. Not only is this a natural setting for nDCG and ERR, it also opens up to other evaluation measures such as Graded Average Precision (GAP) and its extensions [Ferrante et al, 2014b;Robertson et al, 2010] or effort-based measures such as Twist [Ferro et al, 2016b].…”

Section: Discussionmentioning

confidence: 99%

“…We stem from [Angelini et al, 2014;Ferro et al, 2016b] for defining the basic concepts of topics, documents, ground-truth, run, and judged run. To the best of our knowledge, these basic concepts have not been explicitly defined in previous works [Amigó et al, 2013;Busin and Mizzaro, 2013;Maddalena and Mizzaro, 2014;Moffat, 2013].…”

Section: Preliminary Definitionsmentioning

confidence: 99%

“…For example, given two different ranked lists generated from two different IR systems on the same collection as a response to the same query, how is it possible to correctly determine which system is the best performing one? Finding an answer to this question is quite complex and still represents an open issue for the IR community, which constantly develops new approaches and solutions to tackle the evaluation task [Carterette et al, 2012;Ferro et al, 2016b;Smucker and Clarke, 2012a].…”

Section: Introduction To Information Retrievalmentioning

confidence: 99%

See 1 more Smart Citation

Exploiting User Signals and Stochastic Models to Improve Information Retrieval Systems and Evaluation

Maistro

2019

SIGIR Forum

View full text Add to dashboard Cite

The leitmotiv throughout this thesis is represented by IR evaluation. We discuss different issues related to effectiveness measures and novel solutions that we propose to address these challenges. We start by providing a formal definition of utility-oriented measurement of retrieval effectiveness, based on the representational theory of measurement. The proposed theoretical framework contributes to a better understanding of the problem complexities, separating those due to the inherent problems in comparing systems, from those due to the expected numerical properties of measures. We then propose AWARE, a probabilistic framework for dealing with the noise and inconsistencies introduced when relevance labels are gathered with multiple crowd assessors. By modeling relevance judgements and crowd assessors as sources of uncertainty, we directly combine the performance measures computed on the ground-truth generated by each crowd assessor, instead of adopting a classification technique to merge the labels at pool level. Finally, we investigate evaluation measures able to account for user signals. We propose a new user model based on Markov chains, that allows the user to scan the result list with many degrees of freedom. We exploit this Markovian model in order to inject user models into precision, defining a new family of evaluation measures, and we embed this model as objective function of an LtR algorithm to improve system performances. Table of contents List of figures ix List of tables xi Nomenclature xv Nomenclature xv MP Markov Precision MV Majority Vote nCG normalized Cumulated Gain nDCG normalized Discounted Cumulated Gain nMCG Normalized Markov Cumulated Gain RBP Rank-Biased Precision SERP Search Engine Result Page SMART System for the Mechanical Analysis and Retrieval of Text T REC Text REtrieval Conference With the development of IR systems it became necessary to design a framework to evaluate and compare different retrieval strategies. Indeed, progress and innovation are driven by experiments, but experimentation is useless without an objective evaluation measure that allow researchers to detect the improvements and identify the successful strategies. In Chapter 4 we propose our upstream approach called Assessor-driven Weighted Averages for Retrieval Evaluation (AWARE) [Ferrante et al., 2017]. AWARE is defined as an upstream approach because it directly combines the scores of the evaluation measures computed from the relevance labels of each assessor, instead of merging the labels and then computing the measures. The focus is then shifted from the documents and the labels to the evaluation measures. This allows to account for the error introduced by incorrect labels and to develop a framework which estimates performance measures in a way more robust to crowd assessors. Up to now we provide a formal definition of utility-oriented measurement of retrieval effectiveness and we developed an approach to estimate performance measures when there is some noise due to crowd assessors variability. Thus the effective...

show abstract

“…For T09, T10, T13, T14, and T15, we perform a lenient mapping of the relevance judgments by considering as relevant both highly relevant and relevant documents. • Graded: normalized Discounted Cumulated Gain (nDCG) [30], Expected Reciprocal Rank (ERR) [13], and Twist [23]. For T07, we calculate nDCG using binary relevance by setting gain to 0 for non-relevant documents and to 5 for relevant.…”

Section: Methodsmentioning

confidence: 99%

Using Collection Shards to Study Retrieval Performance Effect Sizes

Ferro

Kim

Sanderson

2019

ACM Trans. Inf. Syst.

Self Cite

View full text Add to dashboard Cite

Despite the bulk of research studying how to more accurately compare the performance of IR systems, less attention is devoted to better understanding the different factors that play a role in such performance and how they interact. This is the case of shards, i.e., partitioning a document collection into sub-parts, which are used for many different purposes, ranging from efficiency to selective search or making test collection evaluation more accurate. In all these cases, there is empirical knowledge supporting the importance of shards, but we lack actual models that allow us to measure the impact of shards on system performance and how they interact with topics and systems. We use the general linear mixed model framework and present a model that encompasses the experimental factors of system, topic, shard, and their interaction effects. This detailed model allows us to more accurately estimate differences between the effect of various factors. We study shards created by a range of methods used in prior work and better explain observations noted in prior work in a principled setting and offer new insights. Notably, we discover that the topic*shard interaction effect, in particular, is a large effect almost globally across all datasets, an observation that, to our knowledge, has not been measured before.

show abstract

The twist measure for IR evaluation: Taking user's effort into account

Cited by 9 publications

References 52 publications

Toward an anatomy of IR system component performances

Toward an anatomy of IR system component performances

Exploiting User Signals and Stochastic Models to Improve Information Retrieval Systems and Evaluation

Using Collection Shards to Study Retrieval Performance Effect Sizes

Contact Info

Product

Resources

About