2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology 2008
DOI: 10.1109/wiiat.2008.234
|View full text |Cite
|
Sign up to set email alerts
|

A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources

Abstract: Web page crawlers are an essential component in a number of Web applications. The sheer size of the Internet can pose problems in the design of Web crawlers. All currently known crawlers implement approximations or have limitations so as to maximize the throughput of the crawl, and hence, maximize the number of pages that can be retrieved within a given time frame. This paper proposes a distributed crawling concept which is designed to avoid approximations, to limit the network overhead, and to run on relative… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2009
2009
2021
2021

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 10 publications
(5 citation statements)
references
References 6 publications
0
5
0
Order By: Relevance
“…Web crawlers are distributed over different systems, allowing them to operate independently. Distributed crawling techniques were introduced due to the inherent problems faced by centralized crawling solutions, such as reduced throughput of the crawl and link congestion [59]. Figure 7 denotes the general arrangement of components in a distributed crawling system.…”
Section: Distributedmentioning
confidence: 99%
See 1 more Smart Citation
“…Web crawlers are distributed over different systems, allowing them to operate independently. Distributed crawling techniques were introduced due to the inherent problems faced by centralized crawling solutions, such as reduced throughput of the crawl and link congestion [59]. Figure 7 denotes the general arrangement of components in a distributed crawling system.…”
Section: Distributedmentioning
confidence: 99%
“…Kc et al [59] have introduced LiDi Crawl (which stands for Lightweight Distributed Crawler), which is a centralized crawling application with limited resources. It consists of a central node and several individual crawling components.…”
Section: Web Crawlers and Crawling Techniquesmentioning
confidence: 99%
“…Numerous projects offer different strategies for workload distribution among crawling nodes. For example, the LiDi Crawl proposal offers two strategies to perform crawling task: 1) distributed crawlers get fixed set of URLs which they must work on; and 2) distributed crawlers get initial seed pages they must start from [10]. First strategy gives more control (and workload) to the master node while second would give more decision power to the remote crawlers but require extra coordination mechanism to reduce duplicated crawling efforts from different distributed (slave) crawlers.…”
Section: Related Researchmentioning
confidence: 99%
“…As [1] presented, a strategy using Bayesian object to build a vertical search engine can keep focus the user interests on the topic [7] implemented a topic spider system which used vector space model to calculate the relationship of web pages and modified shark-search strategy to determine the visiting order of hyperlinks that are waiting to crawl. On the base of [8], [9], and [10] which focused on improving the crawling efficiency, [11] proposes a distributed crawling concept which is designed to avoid approximations, to limit the network overhead, and to run on relatively inexpensive hardware. [12] implements a TSVS (Time Sensitive Vertical Search Engine) prototype named Velocisaurus that focused on time-critical airfare discount information search to investigate the time critical requirements for vertical search and proposes a QTC (query triggered crawling) strategy to coordinate the crawling systems by real-time user queries.…”
Section: Literature Reviewmentioning
confidence: 99%