Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2006
DOI: 10.1145/1148170.1148263
|View full text |Cite
|
Sign up to set email alerts
|

A statistical method for system evaluation using incomplete judgments

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
100
0

Year Published

2009
2009
2018
2018

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 124 publications
(100 citation statements)
references
References 7 publications
0
100
0
Order By: Relevance
“…This ensures a sufficiently large training set for the initial classifier without losing much in the query selection performance. 2 The results on P @100 are not reported due to lack of space. As seen in Table 2, the required subset sizes for τ ={0.7, 0.8, 0.9} are statistically significantly smaller than those required for random sampling.…”
Section: Results Of the Web Datamentioning
confidence: 99%
See 2 more Smart Citations
“…This ensures a sufficiently large training set for the initial classifier without losing much in the query selection performance. 2 The results on P @100 are not reported due to lack of space. As seen in Table 2, the required subset sizes for τ ={0.7, 0.8, 0.9} are statistically significantly smaller than those required for random sampling.…”
Section: Results Of the Web Datamentioning
confidence: 99%
“…However, other sources of uncertainty could be considered. Recent research is particularly concerned with measuring uncertainty of the systems performance due to (i ) partial relevance judgments [2,4,20] and (ii ) errors in the relevance judgments made by human assessors [6,12]. Our future work will expand the theoretical model to incorporate additional sources of uncertainty and explore more general cost models for constructing test collections.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Besides sampling queries, it is also possible to sample subsets of documents to be labeled for a given query. Carterette et al [18] use document sampling to decide which of two ranking functions achieves higher precision at k. Aslam et al [19] use document sampling to obtain unbiased estimates of mean average precision and mean R-precision. Carterette and Smucker [20] study statistical significance testing from reduced document sets.…”
Section: Related Workmentioning
confidence: 99%
“…With small numbers of relevance assessments (say 1 or 2) it is difficult to use BPREF measure as there are too few pairs to usefully distinguish between objects leading to severe overfitting. There is a question on the stability of the BPREF function at low sampling rates -correlations between systems rankings tend to deteriorate at this level (Aslam et al, 2006). However, it is not clear that using little evidence is much use for optimisation problems in IR (see section 6.1 above).…”
Section: Equationmentioning
confidence: 99%