Parallel Collection of Live Data Using Hadoop

Talattinis, Kyriacos; Sidiropoulou, Aikaterini; Chalkias, Konstantinos; Stephanides, George

doi:10.1109/pci.2010.47

Cited by 8 publications

(6 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors in previous study used Hadoop for collecting live large amount of data. They explained how combining Hadoop with crawling programs could improve the efficiency of Big Data computations.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

Samadi

Zbakh

Tadonki

2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary Big Data has become one of the major areas of research for cloud service providers due to a large amount of data produced every day and the inefficiency of traditional algorithms and technologies to handle these large amounts of data. Big Data with its characteristics such as volume, variety, and veracity (3V) requires efficient technologies to process in real time. To solve this problem and to process and analyze this vast amount of data, there are many powerful tools like Hadoop and Spark, which are mainly used in the context of Big Data. They work following the principles of parallel computing. The challenge is to specify which Big Data's tool is better depending on the processing context. In this paper, we present and discuss a performance comparison between two popular Big Data frameworks deployed on virtual machines. Hadoop MapReduce and Apache Spark are used to efficiently process a vast amount of data in parallel and distributed mode on large clusters, and both of them suit for Big Data processing. We also present the execution results of Apache Hadoop in Amazon EC2, a major cloud computing environment. To compare the performance of these two frameworks, we use HiBench benchmark suite, which is an experimental approach for measuring the effectiveness of any computer system. The comparison is made based on three criteria: execution time, throughput, and speedup. We test Wordcount workload with different data sizes for more accurate results. Our experimental results show that the performance of these frameworks varies significantly based on the use case implementation. Furthermore, from our results we draw the conclusion that Spark is more efficient than Hadoop to deal with a large amount of data in major cases. However, Spark requires higher memory allocation, since it loads the data to be processed into memory and keeps them in caches for a while, just like standard databases. So the choice depends on performance level and memory constraints.

show abstract

Section: Related Workmentioning

confidence: 99%

“…The authors in previous study 11 with the number of machines in the cluster. The peculiarity of Hadoop is that it can handle different types of data stored in any kind of infrastructure.…”

Section: Related Workmentioning

confidence: 99%

Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

Samadi

Zbakh

Tadonki

2017

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…People can make full use of the cluster of high-speed computing and powerful storage capacity, and they don't need to know the underlying details of the distributed framework. It is obviously that Hadoop provides a solution to the problem of massive data storage and processing [3] . Figure 2.…”

Section: Data Analysis Based On Micro Service Frameworkmentioning

confidence: 99%

Processing Technology of Trading Data on the Energy of Internet

Chen¹,

Luo²,

Guo³

2017

dtetr

View full text Add to dashboard Cite

show abstract

“…It hides the "messy" details of parallelization, allowing even inexperienced programmers to easily utilize the resources of a large distributed system. Although it is written in Java, Hadoop streaming allows its implementation using any programming language [10]. HDFS is highly fault-tolerant distributed file system designed to run on low-cost commodity hardware.…”

Section: Hadoop Distributed File System (Hdfs)mentioning

confidence: 99%

Comparative Analysis of Andrew Files System and Hadoop Distributed File System

Mavani¹

2013

LNSE

View full text Add to dashboard Cite

Abstract-Sharing of resources is the main goal of distributed system. The sharing of stored information is the most important aspect of distributed resource sharing. A file system was originally developed to provide convenient programming interface to disk storage for the centralized system. With the advent of distributed systems distributed storage has become very prominent. A distributed file system enables users to store and access remote files exactly as they do local ones, allowing users to access files from any computer on a network. The objective of this paper is to compare very first open source wide distribution of distributed file system called Andrew file system and the latest widely used distributed file system -Hadoop distributed file system. Parameters which are taken for comparison are Design Goals, Processes, File management, Scalability, Protection, Security, cache management replication etc.Index Terms-Andrew file system, Google file system, Hadoop distributed file system.

show abstract

Parallel Collection of Live Data Using Hadoop

Cited by 8 publications

References 6 publications

Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

Processing Technology of Trading Data on the Energy of Internet

Comparative Analysis of Andrew Files System and Hadoop Distributed File System

Contact Info

Product

Resources

About