Methodologies for crawler based Web surveys

Thelwall, Mike

doi:10.1108/10662240210422503

Cited by 50 publications

(32 citation statements)

References 54 publications

(76 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finding a representative sample of websites is not trivial (14). For simplicity we randomly sampled 300 websites from dmoz.org as our initial set of URLs.…”

Section: Experiments Designmentioning

confidence: 99%

Revisiting Lexical Signatures to (Re-)Discover Web Pages

Klein

Nelson

2008

Research and Advanced Technology for Digital Libraries

View full text Add to dashboard Cite

Abstract.A lexical signature (LS) is a small set of terms derived from a document that capture the "aboutness" of that document. A LS generated from a web page can be used to discover that page at a different URL as well as to find relevant pages in the Internet. From a set of randomly selected URLs we took all their copies from the Internet Archive between 1996 and 2007 and generated their LSs. We conducted an overlap analysis of terms in all LSs and found only small overlaps in the early years (1996 − 2000) but increasing numbers in the more recent past (from 2003 on). We measured the performance of all LSs in dependence of the number of terms they consist of. We found that LSs created more recently perform better than early LSs created between 1996 and 2000. All LSs created from year 2000 on show a similar pattern in their performance curve. Our results show that 5-, 6-and 7-term LSs perform best with returning the URLs of interest in the top ten of the result set. In about 50% of all cases these URLs are returned as the number one result and in 30% of all times we considered the URLs as not discoved.

show abstract

“…Finding a representative sample of websites is not trivial (14). For simplicity we randomly sampled 300 websites from dmoz.org as our initial set of URLs.…”

Section: Experiments Designmentioning

confidence: 99%

Revisiting Lexical Signatures to (Re-)Discover Web Pages

Klein

Nelson

2008

Research and Advanced Technology for Digital Libraries

View full text Add to dashboard Cite

show abstract

“…National biases in total indexed sizes for sites would thus be hard to prove because it would be difficult to show that a search engine had a higher site limit for one country than another because it would involve finding and crawling a sample of very large sites. One paper has surveyed techniques for collecting data through crawlers, including random selection techniques (Thelwall, 2002a). It recommended random sampling of a subset of domains by domain name that should generate a relatively random samples of Websites that are not dependent on commercial search engines.…”

Section: Literature Reviewmentioning

confidence: 99%

“…An alternative approach is random IP address sampling, (e.g. O'Neill et al, 1997;Lawrence & Giles, 1999) within the range allocated to a country, but the virtual server facility now makes this approach ineffective (Thelwall, 2002a). The virtual server HTTP facility was introduced partly to combat the problem of an insufficient number of IP addresses to allocate one to each domain name.…”

Section: Step 1: Sampling Sites and Crawling Valid Sitesmentioning

confidence: 99%

“…There is likely also to be some element of cultural and linguistic variation in this too: perhaps Web sites using a non-ASCII language would tend not to want a long domain name which could be difficult for users to remember. Nevertheless, given that it is not possible to obtain a genuinely random sample of Web sites (Thelwall, 2002a), we believe this to be an acceptable compromise.…”

Section: Step 1: Sampling Sites and Crawling Valid Sitesmentioning

confidence: 99%

See 1 more Smart Citation

Search engine coverage bias: evidence and possible causes

Vaughan

Thelwall

2004

Information Processing & Management

Self Cite

218

130

View full text Add to dashboard Cite

Commercial search engines are now playing an increasingly important role in Web information dissemination and access. Of particular interest to business and national governments is whether the big engines have coverage biased towards the U.S. or other countries. In our study we tested for national biases in three major search engines and found significant differences in their coverage of commercial Web sites. The U.S. sites were much better covered than the others in the study: sites from China, Taiwan and Singapore. We then examined the possible technical causes of the differences and found that the language of a site does not affect its coverage by search engines. However, the visibility of a site, measured by the number of links to it, affects its chance to be covered by search engines. We conclude that the coverage bias does exist but this is due not to deliberate choices of the search engines but occurs as a natural result of cumulative advantage effects of U.S. sites on the Web. Nevertheless, the bias remains a cause for international concern.

show abstract

“…As shown in [22], finding a representative sample of websites is not trivial. Since it is our objective to build a local universe of web sites and our study is focused on their textual content we decided to randomly sample 300 websites from the Open Directory Project 1 .…”

Section: Selecting the Test Corpusmentioning

confidence: 99%

A comparison of techniques for estimating IDF values to generate lexical signatures for the web

Klein

Nelson

2008

Proceedings of the 10th ACM Workshop on Web Information and Data Management

View full text Add to dashboard Cite

For bounded datasets such as the TREC Web Track the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, since IDF cannot be directly calculated for the entire web, it must be estimated. We see a need to estimate accurate IDF values to generate TF-IDF based lexical signatures (LSs) of web pages. Future applications for generating such LSs require a real time IDF computation. Therefore we conducted a comparison study of different methods to estimate IDF values of web pages. Our objective is to investigate how accurate these estimation methods are compared to the a baseline. We use the Google N-grams as our baseline and compare it against two IDF estimation techniques which are based on: 1) a "local universe" consisting of textual content and the according document frequencies from copies of URLs from the Internet Archive and 2) "screen scraping", a technique to query the Google web interface for document frequencies. We found a term overlap of 70 to 80% between the results of the two methods and the baseline. We further discovered a great agreement in rank correlation of TF-IDF ranked terms between our methods. Kendall τ is approximately 0.8 and the M-Score (penalizing discordances in higher ranks) is even higher, it peaks at well above 0.9. These preliminary results lead us to the conclusion that both methods are appropriate for creating accurate IDF values for web pages.

show abstract

Methodologies for crawler based Web surveys

Cited by 50 publications

References 54 publications

Revisiting Lexical Signatures to (Re-)Discover Web Pages

Revisiting Lexical Signatures to (Re-)Discover Web Pages

Search engine coverage bias: evidence and possible causes

A comparison of techniques for estimating IDF values to generate lexical signatures for the web

Contact Info

Product

Resources

About