2003
DOI: 10.1145/958942.958945
|View full text |Cite
|
Sign up to set email alerts
|

Effective page refresh policies for Web crawlers

Abstract: In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and r… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
189
0
1

Year Published

2005
2005
2015
2015

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 204 publications
(192 citation statements)
references
References 28 publications
2
189
0
1
Order By: Relevance
“…In measurement literature, the majority of effort was spent on the behavior of web pages, including analysis of server logs [27], page-modification frequency during crawling [2], [4], [17], [24], RSS feed dynamics [34], and content change between consecutive observations [1], [14], [26]. Problems related to estimation of F U (x) have also emerged in prediction of future updates [5], [6], [13], [18], [31], [38], with a good survey in [25], and user lifetime measurement in decentralized P2P networks [3], [33], [37], [40].…”
Section: Related Workmentioning
confidence: 99%
“…In measurement literature, the majority of effort was spent on the behavior of web pages, including analysis of server logs [27], page-modification frequency during crawling [2], [4], [17], [24], RSS feed dynamics [34], and content change between consecutive observations [1], [14], [26]. Problems related to estimation of F U (x) have also emerged in prediction of future updates [5], [6], [13], [18], [31], [38], with a good survey in [25], and user lifetime measurement in decentralized P2P networks [3], [33], [37], [40].…”
Section: Related Workmentioning
confidence: 99%
“…Junghoo Cho and Hector Garcia-Molina in Effective Page Refresh Policies for Web Crawlers' [8] study how to maintain local copies of remote data sources fresh, when the source data is updated autonomously and independently. In particular, authors study the problem of web crawler that maintains local copies of remote web pages for web search engines.…”
Section: Related Workmentioning
confidence: 99%
“…In [8], Cho et al estimate the frequency of page changes based on the Poisson process. In other studies [6,7], they propose efficient policies to improve the freshness of web pages. In [9], they propose a crawl strategy to download the most important pages first based on different metrics (e.g similarity between pages and queries, rank of a page, etc.).…”
Section: Related Workmentioning
confidence: 99%
“…Frequency [7] selects pages to be archived according to their frequency of changes estimated by the Poisson model [8]. Hot pages that change too often are penalized to maximize the freshness of pages.…”
Section: Pattern-based Web Crawlingmentioning
confidence: 99%