2002
DOI: 10.1108/10662240210422503
|View full text |Cite
|
Sign up to set email alerts
|

Methodologies for crawler based Web surveys

Abstract: There have been many attempts to study the content of the web, either through human or automatic agents. Five different previously used web survey methodologies are described and analysed, each justifiable in its own right, but a simple experiment is presented that demonstrates concrete differences between them. The concept of crawling the web also bears further inspection, including the scope of the pages to crawl, the method used to access and index each page, and the algorithm for the identification of dupl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0

Year Published

2002
2002
2008
2008

Publication Types

Select...
6
1
1

Relationship

6
2

Authors

Journals

citations
Cited by 50 publications
(32 citation statements)
references
References 54 publications
(76 reference statements)
0
32
0
Order By: Relevance
“…Finding a representative sample of websites is not trivial (14). For simplicity we randomly sampled 300 websites from dmoz.org as our initial set of URLs.…”
Section: Experiments Designmentioning
confidence: 99%
“…Finding a representative sample of websites is not trivial (14). For simplicity we randomly sampled 300 websites from dmoz.org as our initial set of URLs.…”
Section: Experiments Designmentioning
confidence: 99%
“…National biases in total indexed sizes for sites would thus be hard to prove because it would be difficult to show that a search engine had a higher site limit for one country than another because it would involve finding and crawling a sample of very large sites. One paper has surveyed techniques for collecting data through crawlers, including random selection techniques (Thelwall, 2002a). It recommended random sampling of a subset of domains by domain name that should generate a relatively random samples of Websites that are not dependent on commercial search engines.…”
Section: Literature Reviewmentioning
confidence: 99%
“…An alternative approach is random IP address sampling, (e.g. O'Neill et al, 1997;Lawrence & Giles, 1999) within the range allocated to a country, but the virtual server facility now makes this approach ineffective (Thelwall, 2002a). The virtual server HTTP facility was introduced partly to combat the problem of an insufficient number of IP addresses to allocate one to each domain name.…”
Section: Step 1: Sampling Sites and Crawling Valid Sitesmentioning
confidence: 99%
See 1 more Smart Citation
“…As shown in [22], finding a representative sample of websites is not trivial. Since it is our objective to build a local universe of web sites and our study is focused on their textual content we decided to randomly sample 300 websites from the Open Directory Project 1 .…”
Section: Selecting the Test Corpusmentioning
confidence: 99%