Toward an anatomy of IR system component performances

Ferro, Nicola; Silvello, Gianmaria

doi:10.1002/asi.23910

Cited by 26 publications

(25 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The effectiveness of IR systems heavily depends on a large number of configurations that need to be tuned [28,57]. Configurations range from the choice of different system components, e.g., stopword lists, stemming methods, retrieval models, to model parameters.…”

Section: Optimizationmentioning

confidence: 99%

Effective collection construction for information retrieval evaluation and optimization

2020

SIGIR Forum

View full text Add to dashboard Cite

The availability of test collections in Cranfield paradigm has significantly benefited the development of models, methods and tools in information retrieval. Such test collections typically consist of a set of topics, a document collection and a set of relevance assessments. Constructing these test collections requires effort of various perspectives such as topic selection, document selection, relevance assessment, and relevance label aggregation etc. The work in the thesis provides a fundamental way of constructing and utilizing test collections in information retrieval in an effective, efficient and reliable manner. To that end, we have focused on four aspects. We first study the document selection issue when building test collections. We devise an active sampling method for efficient large-scale evaluation [Li and Kanoulas, 2017]. Different from past sampling-based approaches, we account for the fact that some systems are of higher quality than others, and we design the sampling distribution to over-sample documents from these systems. At the same time, the estimated evaluation measures are unbiased, and assessments can be used to evaluate new, novel systems without introducing any systematic error. Then a natural further step is determining when to stop the document selection and assessment procedure. This is an important but understudied problem in the construction of test collections. We consider both the gain of identifying relevant documents and the cost of assessing documents as the optimization goals. We handle the problem under the continuous active learning framework by jointly training a ranking model to rank documents, and estimating the total number of relevant documents in the collection using a "greedy" sampling method [Li and Kanoulas, 2020]. The next stage of constructing a test collection is assessing relevance. We study how to denoise relevance assessments by aggregating from multiple crowd annotation sources to obtain high-quality relevance assessments. This helps to boost the quality of relevance assessments acquired in a crowdsourcing manner. We assume a Gaussian process prior on query-document pairs to model their correlation. The proposed model shows good performance in terms of interring true relevance labels. Besides, it allows predicting relevance labels for new tasks that has no crowd annotations, which is a new functionality of CrowdGP. Ablation studies demonstrate that the effectiveness is attributed to the modelling of task correlation based on the axillary information of tasks and the prior relevance information of documents to queries. After a test collection is constructed, it can be used to either evaluate retrieval systems or train a ranking model. We propose to use it to optimize the configuration of retrieval systems. We use Bayesian optimization approach to model the effect of a δ -step in the configuration space to the effectiveness of the retrieval system, by suggesting to use different similarity functions (covariance functions) for continuous and categorical values, and examine their ability to effectively and efficiently guide the search in the configuration space [Li and Kanoulas, 2018]. Beyond the algorithmic and empirical contributions, work done as part of this thesis also contributed to the research community as the CLEF Technology Assisted Reviews in Empirical Medicine Tracks in 2017, 2018, and 2019 [Kanoulas et al., 2017, 2018, 2019]. Awarded by: University of Amsterdam, Amsterdam, The Netherlands. Supervised by: Evangelos Kanoulas. Available at: https://dare.uva.nl/search?identifier=3438a2b6-9271-4f2c-add5-3c811cc48d42.

show abstract

Section: Optimizationmentioning

confidence: 99%

Effective collection construction for information retrieval evaluation and optimization

2020

SIGIR Forum

View full text Add to dashboard Cite

show abstract

“…The factors used in an ANOVA analysis do not have to be the components of a test collection. Ferro and Silvello [21,22] systematically varied the components of an IR system: stop list, stemmer, ranking model, and so on, by using the grid-of-points approach proposed by Ferro and Harman [19]. The analysis allowed the researchers to understand the relative impact of each system component on performance.…”

Section: Anovamentioning

confidence: 99%

Using Collection Shards to Study Retrieval Performance Effect Sizes

Ferro

Kim

Sanderson

2019

ACM Trans. Inf. Syst.

Self Cite

View full text Add to dashboard Cite

Despite the bulk of research studying how to more accurately compare the performance of IR systems, less attention is devoted to better understanding the different factors that play a role in such performance and how they interact. This is the case of shards, i.e., partitioning a document collection into sub-parts, which are used for many different purposes, ranging from efficiency to selective search or making test collection evaluation more accurate. In all these cases, there is empirical knowledge supporting the importance of shards, but we lack actual models that allow us to measure the impact of shards on system performance and how they interact with topics and systems. We use the general linear mixed model framework and present a model that encompasses the experimental factors of system, topic, shard, and their interaction effects. This detailed model allows us to more accurately estimate differences between the effect of various factors. We study shards created by a range of methods used in prior work and better explain observations noted in prior work in a principled setting and offer new insights. Notably, we discover that the topic*shard interaction effect, in particular, is a large effect almost globally across all datasets, an observation that, to our knowledge, has not been measured before.

show abstract

“…the Grid of Points (GoP) 1 -arising from the combinatorial composition of several open-source publicly available components such as stop lists, stemmers, and IR models, and run against 6 different public test collections shared by the Text REtrieval Conference (TREC) international evaluation initiative. Thanks to this GoP, in [8] we presented the deep statistical analyses we run and the insights we gathered about the individual contributions of single IR components to the overall performances of fully working IR systems.…”

Section: Motivationsmentioning

confidence: 99%

An InfoVis Tool for Interactive Component-Based Evaluation

Rocco,

Silvello

2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Toward an anatomy of IR system component performances

Cited by 26 publications

References 26 publications

Effective collection construction for information retrieval evaluation and optimization

Effective collection construction for information retrieval evaluation and optimization

Using Collection Shards to Study Retrieval Performance Effect Sizes

An InfoVis Tool for Interactive Component-Based Evaluation

Contact Info

Product

Resources

About