Summary
Big Data has become one of the major areas of research for cloud service providers due to a large amount of data produced every day and the inefficiency of traditional algorithms and technologies to handle these large amounts of data. Big Data with its characteristics such as volume, variety, and veracity (3V) requires efficient technologies to process in real time. To solve this problem and to process and analyze this vast amount of data, there are many powerful tools like Hadoop and Spark, which are mainly used in the context of Big Data. They work following the principles of parallel computing. The challenge is to specify which Big Data's tool is better depending on the processing context. In this paper, we present and discuss a performance comparison between two popular Big Data frameworks deployed on virtual machines. Hadoop MapReduce and Apache Spark are used to efficiently process a vast amount of data in parallel and distributed mode on large clusters, and both of them suit for Big Data processing. We also present the execution results of Apache Hadoop in Amazon EC2, a major cloud computing environment. To compare the performance of these two frameworks, we use HiBench benchmark suite, which is an experimental approach for measuring the effectiveness of any computer system. The comparison is made based on three criteria: execution time, throughput, and speedup. We test Wordcount workload with different data sizes for more accurate results. Our experimental results show that the performance of these frameworks varies significantly based on the use case implementation. Furthermore, from our results we draw the conclusion that Spark is more efficient than Hadoop to deal with a large amount of data in major cases. However, Spark requires higher memory allocation, since it loads the data to be processed into memory and keeps them in caches for a while, just like standard databases. So the choice depends on performance level and memory constraints.
Cloud computing is one of the most widely spreaded platforms for executing tasks through virtual machines as processing elements. However, there are various issues that need to be addressed in order to be efficiently utilized for workflow applications. One of the fundamental issues in cloud computing is related to task scheduling. Optimal scheduling of tasks in cloud computing is an NP-complete optimization problem, and many algorithms have been proposed to solve it. Furthermore, the existing algorithms fail to either meet the user's Quality of Service (QoS) requirements such as minimizing the makespan and satisfying budget constraints, or to incorporate some basic principles of cloud computing such as elasticity and heterogeneity of computing resources. Among these algorithms, the Heterogeneous Earliest Finish Time (HEFT) heuristic is known to give good results in short time for tasks scheduling in heterogeneous systems. Generally, the HEFT algorithm yields good tasks execution time, but its drawback is that there is no load balancing. In this paper, an enhancement of Heterogeneous Earliest Finish Time (E-HEFT) algorithm under a user-specified financial constraint is proposed to achieve a well balanced load across the virtual machines while trying to minimize the makespan of a given workflow application. To evaluate the performance of the enhancement algorithm, we compare our algorithm with some existing scheduling algorithms. Experimental results show that our algorithm outperforms other algorithms by reducing the makespan and improving load balance among virtual machines.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.