2004
DOI: 10.1002/spe.587
|View full text |Cite
|
Sign up to set email alerts
|

UbiCrawler: a scalable fully distributed Web crawler

Abstract: We report our experience in implementing UbiCrawler, a scalable distributed Web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitations of the Ja… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
243
0
4

Year Published

2007
2007
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 401 publications
(247 citation statements)
references
References 15 publications
0
243
0
4
Order By: Relevance
“…[3] first proposed a series of basic concepts on parallel crawling and distributed crawling, including classification methods, evaluation metrics and so on. After 2003, several systems were proposed: UbiCrawler [6] is the first crawling system announced to be deployed on WAN. The consistent hashing [7] method it adopts guarantees the load balancing among crawlers.…”
Section: Related Workmentioning
confidence: 99%
“…[3] first proposed a series of basic concepts on parallel crawling and distributed crawling, including classification methods, evaluation metrics and so on. After 2003, several systems were proposed: UbiCrawler [6] is the first crawling system announced to be deployed on WAN. The consistent hashing [7] method it adopts guarantees the load balancing among crawlers.…”
Section: Related Workmentioning
confidence: 99%
“…The ordinates are calculated through Eq. (6). N represents the number of Web hosts crawled by the system; D i represents the total data size downloaded from host i; T i represents the total download time of host i which is fixed to 15 minutes.…”
Section: Rtt-best (Rtt-top-1)mentioning
confidence: 99%
“…The execution time (ms) of each Web host is mapped to steps by dividing the execution time by 2 · 10 4 . For example, assuming that for one Web page the RTT = 100 ms, DLT = 100 ms, PWT = 100 ms, the total download time (execution time) of the Web host (or piece) with 10000 pages is (100+100+100)·10000 = 3·10 6 . Then the execution steps is 3·10 6 2·10 4 = 150.…”
Section: Simulation Setupsmentioning
confidence: 99%
See 2 more Smart Citations