Topic set size design

Sakai, Tetsuya

doi:10.1007/s10791-015-9273-z

Cited by 37 publications

(24 citation statements)

References 32 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More in detail, we can note how ERR and ERR@20 do not have enough power to detect the LUG effects, both stemmers and n ‐grams on both news and web search tasks, and how they also lack some power for detecting stop lists effects when n ‐grams are employed. This confirms that ERR and its variants are not robust measures because they require more topics than other measures to detect reliable effect sizes (Sakai, ).…”

Section: Power Analysis and Measures Analysissupporting

confidence: 61%

Toward an anatomy of IR system component performances

Ferro

Silvello

2017

Asso for Info Science & Tech

View full text Add to dashboard Cite

Information retrieval (IR) systems are the prominent means for searching and accessing huge amounts of unstructured information on the web and elsewhere. They are complex systems, constituted by many different components interacting together, and evaluation is crucial to both tune and improve them. Nevertheless, in the current evaluation methodology, there is still no way to determine how much each component contributes to the overall performances and how the components interact together. This hampers the possibility of a deep understanding of IR system behavior and, in turn, prevents us from designing ahead which components are best suited to work together for a specific search task. In this paper, we move the evaluation methodology one step forward by overcoming these barriers and beginning to devise an “anatomy” of IR systems and their internals. In particular, we propose a methodology based on the General Linear Mixed Model (GLMM) and analysis of variance (ANOVA) to develop statistical models able to isolate system variance and component effects as well as their interaction, by relying on a grid of points (GoP) containing all the combinations of the analyzed components. We apply the proposed methodology to the analysis of two relevant search tasks—news search and web search—by using standard TREC collections. We analyze the basic set of components typically part of an IR system, namely, stop lists, stemmers, and n‐grams, and IR models. In this way, we derive insights about English text retrieval.

show abstract

Section: Power Analysis and Measures Analysissupporting

confidence: 61%

Toward an anatomy of IR system component performances

Ferro

Silvello

2017

Asso for Info Science & Tech

View full text Add to dashboard Cite

show abstract

“…Past work has investigated the ideal size of test collections and how many topics are needed for a reliable evaluation. While traditional TREC test collections employ 50 topics, a number of researchers claimed that 50 topics are not sufficient for a reliable evaluation (Jones & van Rijsbergen, 1975;Voorhees, 2009;Urbano, Marrero, & Martín, 2013;Sakai, 2016c). Many researchers reported that wide and shallow judging is preferable than narrow and deep judging (Sanderson & Zobel, 2005;Carterette & Smucker, 2007;Bodoff & Li, 2007).…”

Section: How Many Topics Are Needed?mentioning

confidence: 99%

“…In his follow-up studies, Sakai investigated the effect of score standardization (2016b) in topic set design (2016a) and provided guidelines for test collection design for a given fixed budget (2016c). Sakai, Shang, Lu, and Li (2015) applied the method of Sakai (2016c) to decide the number of topics for evaluation measures of a Short Text Conversation task 1 . Sakai and Shang (2016) explored how many topics and IR systems are needed for a reliable topic set size estimation.…”

Section: How Many Topics Are Needed?mentioning

confidence: 99%

Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging

Kutlu

Elsayed

Lease

2018

Information Processing & Management

View full text Add to dashboard Cite

While test collections provide the cornerstone for Cranfield-based evaluation of information retrieval (IR) systems, it has become practically infeasible to rely on traditional pooling techniques to construct test collections at the scale of today's massive document collections (e.g., ClueWeb12's 700M+ Webpages). This has motivated a flurry of studies proposing more cost-effective yet reliable IR evaluation methods. In this paper, we propose a new intelligent topic selection method which reduces the number of search topics (and thereby costly human relevance judgments) needed for reliable IR evaluation. To rigorously assess our method, we integrate previously disparate lines of research on intelligent topic selection and deep vs. shallow judging (i.e., whether it is more cost-effective to collect many relevance judgments for a few topics or a few judgments for many topics). While prior work on intelligent topic selection has never been evaluated against shallow judging baselines, prior work on deep vs. shallow judging has largely argued for shallowed judging, but assuming random topic selection. We argue that for evaluating any topic selection method, ultimately one must ask whether it is actually useful to select topics, or should one simply perform shallow judging over many topics? In seeking a rigorous answer to this over-arching question, we conduct a comprehensive investigation over a set of relevant factors never previously studied together: 1) method of topic selection; 2) the effect of topic familiarity on human judging speed; and 3) how different topic generation processes (requiring varying human effort) impact (i) budget utilization and (ii) the resultant quality of judgments. Experiments on NIST TREC Robust 2003 and Robust 2004 test collections show that not only can we reliably evaluate IR systems with fewer topics, but also that: 1) when topics are intelligently selected, deep judging is often more cost-effective than shallow judging in evaluation reliability; and 2) topic familiarity and topic generation costs greatly impact the evaluation cost vs. reliability trade-off. Our findings challenge conventional wisdom in showing that deep judging is often preferable to shallow judging when topics are selected intelligently.

show abstract

“…In Topic Set Size Design, Sakai (2016) investigates the issue of how many topics should be selected for test collection-based evaluation. Previous analysis suggested that a minimum of 50 topics should be used to obtain stable effectiveness estimates, and this number has been used for many years as a heuristic to guide test collection construction, even when the context of the test collection (including the type of search task being evaluated, and the range of effectiveness metrics being used) varied.…”

Section: Overview Of Papersmentioning

confidence: 99%