Proceedings of the 15th ACM International Conference on Information and Knowledge Management - CIKM '06 2006
DOI: 10.1145/1183614.1183699
|View full text |Cite
|
Sign up to set email alerts
|

Estimating corpus size via queries

Abstract: We consider the problem of estimating the size of a collection of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.Using this basic estimator, we propose two approaches to estimating corpus size. The first approa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
76
1

Year Published

2008
2008
2023
2023

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 46 publications
(77 citation statements)
references
References 13 publications
(22 reference statements)
0
76
1
Order By: Relevance
“…For example, in [25] the authors develop an algorithm to estimate the size of any set of documents defined by certain conditions based on previously executed queries. Whereas [26] describes an algorithm to estimate the corpus size for a meta-search engine in order to better direct queries to search engines.…”
Section: Related Workmentioning
confidence: 99%
“…For example, in [25] the authors develop an algorithm to estimate the size of any set of documents defined by certain conditions based on previously executed queries. Whereas [26] describes an algorithm to estimate the corpus size for a meta-search engine in order to better direct queries to search engines.…”
Section: Related Workmentioning
confidence: 99%
“…The average degree of the graph is 2. The sample degrees taken by RN, RE, and RW sampling methods are (1, 1, 1, 1, 2, 8), (1,8,1,8,2,4), and (4,3,8,1,8,1), respectively. The estimations for RN, RE, and RW samples are:…”
Section: Rn Re and Rw Samplingmentioning
confidence: 99%
“…Regardless of the causes, a common challenge is to reveal the properties of such datasets when we do not own the entire data. In the past, extensive research was carried out to explore the profile of search engines [17] and other data collections [4,6,33]. Most of them focused on obtaining uniform random node (RN) samples, such as uniform random web pages from the Web [11] and search engines [2], and uniform random bloggers from online social networks [9].…”
Section: Introductionmentioning
confidence: 99%
“…By applying our techniques continuously, an advertiser can track the popularity of her keywords over time. 6 Search engine evaluation and ImpressionRank sampling.…”
Section: Google Toolbarmentioning
confidence: 99%
“…The limited access to search engines' indices via their public interfaces make the problem of evaluating the quality of these indices very challenging. Previous work [5,3,6,4] has focused on generating random uniform sample pages from the index. The samples have then been used to estimate index quality metrics, like index size and index freshness.…”
Section: Google Toolbarmentioning
confidence: 99%