Efficient crawling through URL ordering

Cho, Junghoo; García-Molina, Héctor; Page, Lawrence

doi:10.1016/s0169-7552(98)00108-1

Cited by 570 publications

(296 citation statements)

References 5 publications

Supporting

Mentioning

287

Contrasting

Unclassified

Order By: Relevance

“…A few systems that gather specialized content have been very successful. Cho et al compare several crawl ordering schemes based on link degree, perceived prestige, and keyword matches on the 9 See the press articles archived at http://www.cs.berkeley.edu/ ¾ soumen/focus/ Stanford University Web [12]. Terveen and Hill use similar techniques to discover related "clans" of Web pages [30].…”

Section: Related Workmentioning

confidence: 99%

Focused crawling: a new approach to topic-specific Web resource discovery

1999

View full text Add to dashboard Cite

The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date.To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.

show abstract

Section: Related Workmentioning

confidence: 99%

Focused crawling: a new approach to topic-specific Web resource discovery

1999

View full text Add to dashboard Cite

show abstract

“…In other studies [6,7], they propose efficient policies to improve the freshness of web pages. In [9], they propose a crawl strategy to download the most important pages first based on different metrics (e.g similarity between pages and queries, rank of a page, etc.). The research of Castillo et al [5] goes in same direction.…”

Section: Related Workmentioning

confidence: 99%

“…We start by describing related strategies considered in this work: Relevance [9] downloads the most important pages (i.e based on PageRank) first, in a fixed order. Frequency [7] selects pages to be archived according to their frequency of changes estimated by the Poisson model [8].…”

Section: Pattern-based Web Crawlingmentioning

confidence: 99%

Improving the Quality of Web Archives through the Importance of Changes

Saâd¹,

Gançarski²

2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Due to the growing importance of the Web, several archiving institutes (national libraries, Internet Archive, etc.) are harvesting sites to preserve (a part of) the Web for future generations. A major issue encountered by archivists is to preserve the quality of web archives. One way of assessing the quality of an archive is to quantify its completeness and the coherence of its page versions. Due to the large number of pages to be captured and the limitations of resources (storage space, bandwidth, etc.), it is impossible to have a complete archive (containing all the versions of all the pages). Also it is impossible to assure the coherence of all captured versions because pages are changing very frequently during the crawl of a site. Nonetheless, it is possible to maximize the quality of archives by adjusting web crawlers strategy. Our idea for that is (i) to improve the completeness of the archive by downloading the most important versions and (ii) to keep the most important versions as coherent as possible. Moreover, we introduce a pattern model which describes the behavior of the importance of pages changes over time. Based on patterns, we propose a crawl strategy to improve both the completeness and the coherence of web archives. Experiments based on real patterns show the usefulness and the effectiveness of our approach.

show abstract

“…The effect of exploiting other hypertext features such as segmenting Document Object Model (DOM) tag-trees that characterise a web document and propose a fine-grained topic distillation technique that combines this information with HITS is studied in [20]. Keyword-sensitive crawling strategies such as URL string analysis and other location metrics are investigated in [21]. An intelligent crawler that can adapt online the queue link-extraction strategy using a self-learning mechanism is discussed in [22].…”

Section: Related Work In Focused Crawlingmentioning

confidence: 99%

Focused Crawling Using Latent Semantic Indexing – An Application for Vertical Search Engines

Almpanidis

Kotropoulos

Pitas

2005

Research and Advanced Technology for Digital Libraries

View full text Add to dashboard Cite

Abstract. Vertical search engines and web portals are gaining ground over the general-purpose engines due to their limited size and their high precision for the domain they cover. The number of vertical portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain specific web documents. We compare its efficiency with other well-known web information retrieval techniques. Our implementation presents a different approach to focused crawling and aims to overcome the size limitations of the initial training data while maintaining a high recall/precision ratio.

show abstract

Efficient crawling through URL ordering

Cited by 570 publications

References 5 publications

Focused crawling: a new approach to topic-specific Web resource discovery

Focused crawling: a new approach to topic-specific Web resource discovery

Improving the Quality of Web Archives through the Importance of Changes

Focused Crawling Using Latent Semantic Indexing – An Application for Vertical Search Engines

Contact Info

Product

Resources

About