A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources

Kc, Milly; Hagenbuchner, Markus; Tsoi, Ah Chung

doi:10.1109/wiiat.2008.234

Cited by 10 publications

(5 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Web crawlers are distributed over different systems, allowing them to operate independently. Distributed crawling techniques were introduced due to the inherent problems faced by centralized crawling solutions, such as reduced throughput of the crawl and link congestion [59]. Figure 7 denotes the general arrangement of components in a distributed crawling system.…”

Section: Distributedmentioning

confidence: 99%

See 1 more Smart Citation

Change Detection and Notification of Web Pages

et al. 2020

View full text Add to dashboard Cite

The majority of currently available webpages are dynamic in nature and are changing frequently. New content gets added to webpages, and existing content gets updated or deleted. Hence, people find it useful to be alert for changes in webpages that contain information that is of value to them. In the current context, keeping track of these webpages and getting alerts about different changes have become significantly challenging. Change Detection and Notification (CDN) systems were introduced to automate this monitoring process, and to notify users when changes occur in webpages. This survey classifies and analyzes different aspects of CDN systems and different techniques used for each aspect. Furthermore, the survey highlights the current challenges and areas of improvement present within the field of research.

show abstract

Section: Distributedmentioning

confidence: 99%

“…Kc et al [59] have introduced LiDi Crawl (which stands for Lightweight Distributed Crawler), which is a centralized crawling application with limited resources. It consists of a central node and several individual crawling components.…”

Section: Web Crawlers and Crawling Techniquesmentioning

confidence: 99%

Change Detection and Notification of Web Pages

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Numerous projects offer different strategies for workload distribution among crawling nodes. For example, the LiDi Crawl proposal offers two strategies to perform crawling task: 1) distributed crawlers get fixed set of URLs which they must work on; and 2) distributed crawlers get initial seed pages they must start from [10]. First strategy gives more control (and workload) to the master node while second would give more decision power to the remote crawlers but require extra coordination mechanism to reduce duplicated crawling efforts from different distributed (slave) crawlers.…”

Section: Related Researchmentioning

confidence: 99%

Crowdcrawling approach for community based plagiarism detection service

Butakov

2014

Proceedings of the 23rd International Conference on World Wide Web

View full text Add to dashboard Cite

In the era of exponentially growing web and exploding online education the problem of digital plagiarism has become one of the most burning ones in many areas. Efficient internet plagiarism detection tools should have a capacity similar to that of conventional web search engines. This requirement makes commercial plagiarism detection services expensive and therefore less accessible to smaller education institutions. This work-inprogress paper proposes the concept of crowdcrawling as a tool to distribute the most laborious part of the web search among community servers thus providing scalability and sustainability to the community driven plagiarism detection. It outlines roles for community members depending on the resources they are willing to contribute to the service.

show abstract

“…As [1] presented, a strategy using Bayesian object to build a vertical search engine can keep focus the user interests on the topic [7] implemented a topic spider system which used vector space model to calculate the relationship of web pages and modified shark-search strategy to determine the visiting order of hyperlinks that are waiting to crawl. On the base of [8], [9], and [10] which focused on improving the crawling efficiency, [11] proposes a distributed crawling concept which is designed to avoid approximations, to limit the network overhead, and to run on relatively inexpensive hardware. [12] implements a TSVS (Time Sensitive Vertical Search Engine) prototype named Velocisaurus that focused on time-critical airfare discount information search to investigate the time critical requirements for vertical search and proposes a QTC (query triggered crawling) strategy to coordinate the crawling systems by real-time user queries.…”

Section: Literature Reviewmentioning

confidence: 99%

Design and Implementation of a Crawling System in Shopping Search Engine

Wang

2009

2009 Second International Workshop on Computer Science and Engineering

View full text Add to dashboard Cite

This paper presents a crawling system， that helps building a shopping search engine. The system's core is a vertical crawler that is specially used in shopping search engines. The crawler is divided into 5 parts and it realized automatic generation of crawling templates by its analysis module which is a regular expressions method set. A manual generation of crawling template is designed to support the automatic function, and these modified templates 100% passed the test. Five parts' perfectly combination makes the system being such a fool-proof system that anyone who has basic computer operation ability can use it.With the fast development of search engine, topical crawler that severs vertical search engines has been an important research direction. There are many literatures describing the crawler design algorithm and crawling strategies, however, due to the competitive nature of the shopping search engine business, there are few papers in the literature describing the design and implementation in shopping search engines' topical crawlers. This paper's main contribution is to fill that gap. The crawling system described in this paper is a prototype of crawler system in shopping search engine. It will influence the later shopping search engines' searching and development.

show abstract

A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources

Cited by 10 publications

References 6 publications

Change Detection and Notification of Web Pages

Change Detection and Notification of Web Pages

Crowdcrawling approach for community based plagiarism detection service

Design and Implementation of a Crawling System in Shopping Search Engine

Contact Info

Product

Resources

About