Proceedings of the Seventeenth Conference on Hypertext and Hypermedia 2006
DOI: 10.1145/1149941.1149972
|View full text |Cite
|
Sign up to set email alerts
|

Evaluation of crawling policies for a web-repository crawler

Abstract: We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live ver… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
0

Year Published

2007
2007
2013
2013

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 24 publications
(18 citation statements)
references
References 40 publications
0
18
0
Order By: Relevance
“…Ten randomly selected repositories running EPrints were selected from the Registry of Open Access Repositories (ROAR) [27]. The same methodology from previous reconstruction experiments were used to determine reconstruction success [19,16,22]: the repositories were crawled using the Heritrix web crawler [23] and then reconstructed with Warrick, and the reconstructions were compared to the crawled sites. Of course the reconstructions were only able to recover the client-side representation of the sites and none of the server components.…”
Section: Reconstructing 10 Digital Repositoriesmentioning
confidence: 99%
See 1 more Smart Citation
“…Ten randomly selected repositories running EPrints were selected from the Registry of Open Access Repositories (ROAR) [27]. The same methodology from previous reconstruction experiments were used to determine reconstruction success [19,16,22]: the repositories were crawled using the Heritrix web crawler [23] and then reconstructed with Warrick, and the reconstructions were compared to the crawled sites. Of course the reconstructions were only able to recover the client-side representation of the sites and none of the server components.…”
Section: Reconstructing 10 Digital Repositoriesmentioning
confidence: 99%
“…For 19 weeks, the Monarch Repository was crawled with Heritrix and reconstructed with Warrick (using the Comprehensive policy which attempts to recover a maximum of all lost resources [19]) at the end of the week as was performed in the previous reconstruction experiment. The crawls were matched with the reconstructions to produce an accurate assessment as to how much of the website was being successfully reconstructed each week.…”
Section: Setupmentioning
confidence: 99%
“…For example, we reconstructed an academic conference website that was lost due to a fire [12]. On behalf of the Library of Congress, we reconstructed a Congressman's website when it was suddenly shutdown due to allegations of misconduct [8].…”
Section: Background and Related Workmentioning
confidence: 99%
“…Seven different hosts are currently deployed, allowing seven reconstructions to be running concurrently. Warrick is executed using the Exhaustive policy [12] which means all four web repositories are initially asked to list all URLs they have stored for a website (we call these lister queries). This discovery process is somewhat limited for large websites since search engines will only reveal at most 1000 URLs they have cached.…”
Section: Reconstruction Mechanicsmentioning
confidence: 99%
“…A URL selected by the crawler from the frontier, downloads the web resource, collects URLs from the downloaded resource and adds the new URLs to the frontier. The crawler proceeds in this manner till the frontier is empty or some other condition causes it to stop [15], [16]. Duplicate and near duplicate web page detection becomes an important step in web Crawling [5].…”
Section: Introductionmentioning
confidence: 99%