Proceedings of the 10th ACM Workshop on Web Information and Data Management 2008
DOI: 10.1145/1458502.1458510
|View full text |Cite
|
Sign up to set email alerts
|

A comparison of techniques for estimating IDF values to generate lexical signatures for the web

Abstract: For bounded datasets such as the TREC Web Track the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, since IDF cannot be directly calculated for the entire web, it must be estimated. We see a need to estimate accurate IDF values to generate TF-IDF based lexical signatures (LSs) of web pages. Future applications for generating such LSs require a real time IDF computation. Therefore we conducted a comparison study of different methods to estimate IDF values of we… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2008
2008
2015
2015

Publication Types

Select...
3
3
1

Relationship

5
2

Authors

Journals

citations
Cited by 9 publications
(6 citation statements)
references
References 15 publications
(22 reference statements)
0
6
0
Order By: Relevance
“…We used the "screen scraping" approach and queried the Yahoo BOSS API to determine document frequency values for all terms and used numbers published by the website www.worldwidewebsize.com in October 2009 to estimate the size of the Yahoo index. We have shown in [23] that this approach is feasible and performs very well compared to other methods.…”
Section: Ls Generation Of Web Pagesmentioning
confidence: 87%
“…We used the "screen scraping" approach and queried the Yahoo BOSS API to determine document frequency values for all terms and used numbers published by the website www.worldwidewebsize.com in October 2009 to estimate the size of the Yahoo index. We have shown in [23] that this approach is feasible and performs very well compared to other methods.…”
Section: Ls Generation Of Web Pagesmentioning
confidence: 87%
“…As a common approach researchers use search engines to estimate the document frequency of a term ( [13,19,31,39]). Even though the obtained values are only estimates ( [1]) our earlier work [20] has shown that this approach actually works well compared to using a modern text corpus.…”
Section: Lexical Signature Generationmentioning
confidence: 99%
“…Computing IDF values requires knowledge about: 1) the size of the entire corpus (the Internet) in terms of number of documents and 2) the number of documents the term appears in. A related study (6) investigates different techniques for creating IDF values for web pages.…”
Section: Experiments Designmentioning
confidence: 99%