Information retrieval in the World-Wide Web: Making client-based searching feasible

Bra, Paul De; Post, R.

doi:10.1016/0169-7552(94)90132-5

Cited by 153 publications

(62 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Crawlers and agents have grown more sophisticated [11]. To our knowledge the earliest example of using a query to direct a limited Web crawl is the Fish Search system [14]. Similar results are reported for the WebCrawler [11, chapter 4], Shark Search [17], and by Chen et al [10].…”

Section: Related Worksupporting

confidence: 60%

Focused crawling: a new approach to topic-specific Web resource discovery

1999

View full text Add to dashboard Cite

The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date.To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.

show abstract

Section: Related Worksupporting

confidence: 60%

Focused crawling: a new approach to topic-specific Web resource discovery

1999

View full text Add to dashboard Cite

show abstract

“…The main challenges in focused crawling relate to the prioritization of URLs not yet visited, which may be based on similarity measures [24,26], hyperlink distance-based limits [30,31], or combinations of text and hyperlink analysis with Latent Semantic Indexing (LSI) [32]. Machine learning approaches, including naïve Bayes classifiers [25,33], Hidden Markov Models [34], reinforcement learning [35], genetic algorithms [36], and neural networks [37], have also been applied to prioritize the unvisited URLs.…”

Section: Focused and Deep-web Crawlingmentioning

confidence: 99%

ARCOMEM Crawling Architecture

Plachouras¹,

Carpentier

Faheem

et al. 2014

Future Internet

View full text Add to dashboard Cite

The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limitations and to provide flexible, adaptive and intelligent content acquisition, relying on social media to create topical Web archives. In this article, we focus on ARCOMEM's crawling architecture. We introduce the overall architecture and we describe its modules, such as the online analysis module, which computes a priority for the Web pages to be crawled, and the Application-Aware Helper which takes into account the type of Web sites and applications to extract structure from crawled content. We also describe a large-scale distributed crawler that has been developed, as well as the modifications we have implemented to adapt Heritrix, an open source crawler, to the needs of the project. Our experimental results from real crawls show that ARCOMEM's crawling architecture is effective in acquiring focused information about a topic and leveraging the information from social media. Future Internet2014, 6 519

show abstract

“…In some early works on the subject of focused collection of data from the web, web crawling was simulated by a group of fish migrating on the Web [3]. In the so called fish search, each url corresponds to a fish whose survivability is dependant on visited page relevance and remote server speed.…”

Section: Related Workmentioning

confidence: 99%

“…In last years of 90th decade, Alta Vista's crawler, called the Scooter, was running on a 1.5GB memory, 30GB RAID disk, 4x533MHz AlphaServer 4100 5/300 with 1 GB/s I/O bandwidth 2 . In spite of these heroic efforts with high-end multiprocessors and clever crawling software, the largest crawls cover only 30-40% of the web, and refreshes take weeks to a month 3 . The Web in many ways simulates a social network: links do not point to pages at random but reflect the page authors' idea of what other relevant or interesting pages exists.…”

Section: Introductionmentioning

confidence: 99%

A Method for Focused Crawling Using Combination of Link Structure and Content Similarity

Jamali

Sayyadi

Hariri

et al. 2006

2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)

View full text Add to dashboard Cite

Abstract-The rapid growth of the world-wide web poses unprecedented scaling challenges for general-purpose crawlers and search engines. A focused crawler aims at selectively seek out pages that are relevant to a pre-defined set of topics. Besides specifying topics by some keywords, it is customary also to use some exemplary documents to compute the similarity of a given web document to the topic. In this paper we introduce a new hybride focused crawler, which uses link structure of documents as well as similarity of pages to the topic to crawl the web

show abstract

Information retrieval in the World-Wide Web: Making client-based searching feasible

Cited by 153 publications

References 1 publication

Focused crawling: a new approach to topic-specific Web resource discovery

Focused crawling: a new approach to topic-specific Web resource discovery

ARCOMEM Crawling Architecture

A Method for Focused Crawling Using Combination of Link Structure and Content Similarity

Contact Info

Product

Resources

About