Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007
DOI: 10.1145/1277741.1277910
|View full text |Cite
|
Sign up to set email alerts
|

Estimating collection size with logistic regression

Abstract: Collection size is an important feature to represent the content summaries of a collection, and plays a vital role in collection selection for distributed search. In uncooperative environments, collection size estimation algorithms are adopted to estimate the sizes of collections with their search interfaces. This paper proposes heterogeneous capture (HC) algorithm, in which the capture probabilities of documents are modeled with logistic regression. With heterogeneous capture probabilities, HC algorithm estim… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
22
0

Year Published

2009
2009
2013
2013

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(22 citation statements)
references
References 10 publications
0
22
0
Order By: Relevance
“…The size estimation of a corpus is also a key feature in the selection of search engines in federated search and distributed search [22,25]. In our scenario, the estimation of the vertical size is crucial since this statistic is needed in the best performing resource selection methods such as ReDDe.…”
Section: Vertical Size Estimationmentioning
confidence: 99%
See 1 more Smart Citation
“…The size estimation of a corpus is also a key feature in the selection of search engines in federated search and distributed search [22,25]. In our scenario, the estimation of the vertical size is crucial since this statistic is needed in the best performing resource selection methods such as ReDDe.…”
Section: Vertical Size Estimationmentioning
confidence: 99%
“…Request permissions from permissions@acm.org. JCDL'13, July [22][23][24][25][26]2013, Indianapolis, Indiana, USA. which implies having access to the query logs and to detailed statistical descriptors of the verticals.…”
Section: Introductionmentioning
confidence: 99%
“…Multiple capture-recapture (Shokouhi et al 2006) is an extension of this approach to account for any number of samples. This has been used to estimate the size of different populations by observing the overlap between different samples (Xu et al 2007). A single urn model is used to assess the overall population size, by relating the total observed sample to the number of distinct items sampled.…”
Section: Single Urn Models For Protein Interaction Datamentioning
confidence: 99%
“…Also, in the federated search engines, this information is helpful in the selection of search engines to satisfy the information needs of a posed query. This is also useful in the resource/collection selection in the distributed search [19]. In addition to these advantages, knowing about the size of a data collection can give an insight over some useful statistics which can be interesting for public sectors and governments.…”
Section: Introductionmentioning
confidence: 99%
“…This tendency to know the sizes of data sources is increased by the competition among businesses on the Web in which the data coverage is critical. In the context of quality assessment of search engines [7], search engine selection in the federated search engines, and in the resource/collection selection in the distributed search field [19], this information is also helpful. In addition, it can give an insight over some useful statistics for public sectors like governments.…”
mentioning
confidence: 99%