2015
DOI: 10.1007/s10791-015-9273-z
|View full text |Cite
|
Sign up to set email alerts
|

Topic set size design

Abstract: Traditional pooling-based information retrieval (IR) test collections typically have n ¼ 50-100 topics, but it is difficult for an IR researcher to say why the topic set size should really be n. The present study provides details on principled ways to determine the number of topics for a test collection to be built, based on a specific set of statistical requirements. We employ Nagata's three sample size design techniques, which are based on the paired t test, one-way ANOVA, and confidence intervals, respectiv… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
23
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 37 publications
(24 citation statements)
references
References 32 publications
(45 reference statements)
1
23
0
Order By: Relevance
“…More in detail, we can note how ERR and ERR@20 do not have enough power to detect the LUG effects, both stemmers and n ‐grams on both news and web search tasks, and how they also lack some power for detecting stop lists effects when n ‐grams are employed. This confirms that ERR and its variants are not robust measures because they require more topics than other measures to detect reliable effect sizes (Sakai, ).…”
Section: Power Analysis and Measures Analysissupporting
confidence: 61%
“…More in detail, we can note how ERR and ERR@20 do not have enough power to detect the LUG effects, both stemmers and n ‐grams on both news and web search tasks, and how they also lack some power for detecting stop lists effects when n ‐grams are employed. This confirms that ERR and its variants are not robust measures because they require more topics than other measures to detect reliable effect sizes (Sakai, ).…”
Section: Power Analysis and Measures Analysissupporting
confidence: 61%
“…Past work has investigated the ideal size of test collections and how many topics are needed for a reliable evaluation. While traditional TREC test collections employ 50 topics, a number of researchers claimed that 50 topics are not sufficient for a reliable evaluation (Jones & van Rijsbergen, 1975;Voorhees, 2009;Urbano, Marrero, & Martín, 2013;Sakai, 2016c). Many researchers reported that wide and shallow judging is preferable than narrow and deep judging (Sanderson & Zobel, 2005;Carterette & Smucker, 2007;Bodoff & Li, 2007).…”
Section: How Many Topics Are Needed?mentioning
confidence: 99%
“…In his follow-up studies, Sakai investigated the effect of score standardization (2016b) in topic set design (2016a) and provided guidelines for test collection design for a given fixed budget (2016c). Sakai, Shang, Lu, and Li (2015) applied the method of Sakai (2016c) to decide the number of topics for evaluation measures of a Short Text Conversation task 1 . Sakai and Shang (2016) explored how many topics and IR systems are needed for a reliable topic set size estimation.…”
Section: How Many Topics Are Needed?mentioning
confidence: 99%
“…In Topic Set Size Design, Sakai (2016) investigates the issue of how many topics should be selected for test collection-based evaluation. Previous analysis suggested that a minimum of 50 topics should be used to obtain stable effectiveness estimates, and this number has been used for many years as a heuristic to guide test collection construction, even when the context of the test collection (including the type of search task being evaluated, and the range of effectiveness metrics being used) varied.…”
Section: Overview Of Papersmentioning
confidence: 99%