An Enhanced Data-Locality-Aware Task Scheduling Algorithm for Hadoop Applications

Choi, Dongjoo; Jeon, Myunghoon; Kim, Namgi; Lee, Byoung-Dai

doi:10.1109/jsyst.2017.2764481

Cited by 16 publications

(10 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The advantage of the improved fair scheduling scheme is its efficiency in producing throughput for datasets of variable size; however, the disadvantages are that long jobs can slow the algorithm and cause overloading issues at a node. Authors in [29] proposed a data locality-aware enhanced task scheduling algorithm was proposed to improve the job completion time when an input split consists of multiple data blocks that are distributed and stored in different nodes, this data location method fails to cope with the degradation in processing performance due to the increased frequency of data block copying. To solve this issue, authors have proposed a task scheduling algorithm by defining a method to classify data locality taking into account the location of all data blocks that comprise an input split, categorizing tasks based on the defined method, and sequentially assigning tasks according to a given priority.…”

Section: Problem Statementmentioning

confidence: 99%

“…5, Improved Hadoop displays better performance than Hadoop because the improved ACO is used to schedule client jobs. Next, the proposed approach is compared with state-of-the-art approaches, including DGNS [9] and iShufe [29].…”

Section: Effectiveness Of the Proposed Approach Versus Those In Previous Workmentioning

confidence: 99%

“…Figure 7 shows the response time results when the number of tasks is varied from 1000 to 5000. In [29], the authors evaluate iShufe operations in a multiuser Hadoop environment and run multiple jobs. Furthermore, they used the modified Hadoop fair scheduler to support two different MapReduce jobs at a time.…”

Section: Effectiveness Of the Proposed Approach Versus Those In Previous Workmentioning

confidence: 99%

See 2 more Smart Citations

A Multi-Optimization Technique for Improvement of Hadoop Performance with a Dynamic Job Execution Method Based on Artificial Neural Network

et al. 2020

View full text Add to dashboard Cite

The improvement of Hadoop performance has received considerable attention from researchers in cloud computing fields. Most studies have focused on improving the performance of a Hadoop cluster. Notably, various parameters are required to configure Hadoop and must be adjusted to improve performance. This paper proposes a mechanism to improve Hadoop, schedule jobs, and allocate and utilize resources. Specifically, we present an improved ant colony optimization method to schedule jobs according to the job size and the time expected for execution. Priority is given to the job with the minimum data size and minimum response time. The resource usage and running jobs by data node are predicted using an artificial neural network, and job activity and resource usage are monitored using the resource manager. Moreover, we enhance the Hadoop Name node performance by adding an aggregator node to the default HDFS framework architecture. The changes involve four entities: the name node, secondary name node, aggregator nodes, and data nodes, where the aggregator node is responsible for assigning the jobs among the data node, and the Name node keeps tracking only the aggregator nodes. We test the overall scheme among Amazon EC2 and S3, and show the results of throughput and CPU response time for different data sizes. Finally, we show that the proposed approach shows significant improvement compare to native Hadoop and other approaches.

show abstract

Section: Problem Statementmentioning

confidence: 99%

Section: Effectiveness Of the Proposed Approach Versus Those In Previous Workmentioning

confidence: 99%

Section: Effectiveness Of the Proposed Approach Versus Those In Previous Workmentioning

confidence: 99%

See 1 more Smart Citation

A Multi-Optimization Technique for Improvement of Hadoop Performance with a Dynamic Job Execution Method Based on Artificial Neural Network

et al. 2020

View full text Add to dashboard Cite

show abstract

“…As an instance, an offline scheduling algorithm based on graph models was proposed by Selvitopi et al [25], which correctly encodes the interactions between map and reduce tasks. Choi et al [26] addressed a problem in which a map split consisted of multiple data blocks distributed and stored in different nodes. Two data-locality-aware task scheduling algorithms were proposed by Beaumont et al [27], which optimized makespan.…”

Section: Related Workmentioning

confidence: 99%

HybSMRP: a hybrid scheduling algorithm in Hadoop MapReduce framework

et al. 2019

View full text Add to dashboard Cite

IntroductionDistributed and parallel processing is one of the best intelligent ways to store and compute big data [1]. Most definitions defined big data as characterized by the 3Vs: the extreme volume of data, the wide variety of data types and the velocity at which the data must be processed. MapReduce [2] is a programming model for big data processing. MapReduce programs are intrinsically parallel [3,4]. MapReduce executes the programs in two phases, map and reduce, so that each phase is defined by a function called mapper and reducer. A MapReduce framework consists of a master and multiple slaves. The master is responsible for the management of the framework, including user interaction, job queue organization and task scheduling. Each slave has a fixed number of map and reduce slots to perform tasks. The job scheduler located in the master assigns tasks according to the number of free task slots AbstractDue to the advent of new technologies, devices, and communication tools such as social networking sites, the amount of data produced by mankind is growing rapidly every year. Big data is a collection of large datasets that cannot be processed using traditional computing techniques. MapReduce has been introduced to solve largedata computational problems. It is specifically designed to run on commodity hardware, and it depends on dividing and conquering principles. Nowadays, the focus of researchers has shifted towards Hadoop MapReduce. One of the most outstanding characteristics of MapReduce is data locality-aware scheduling. Data locality-aware scheduler is a further efficient solution to optimize one or a set of performance metrics such as data locality, energy consumption and job completion time. Similar to all situations, time and scheduling are the most important aspects of the MapReduce framework. Therefore, many scheduling algorithms have been proposed in the past decades. The main ideas of these algorithms are increasing data locality rate and decreasing the response and completion time. In this paper, a new hybrid scheduling algorithm has been proposed, which uses dynamic priority and localization ID techniques and focuses on increasing data locality rate and decreasing completion time. The proposed algorithm was evaluated and compared with Hadoop default schedulers (FIFO, Fair), by running concurrent workloads consisting of Wordcount and Terasort benchmarks. The experimental results show that the proposed algorithm is faster than FIFO and Fair scheduling, achieves higher data locality rate and avoids wasting resources.

show abstract

“…Selvitopi et al [44] proposed an offline scheduling algorithm based on graph and hypergraph models, which correctly encoded the interactions between map and reduce tasks. Choi et al [45] aimed at a problem where an input split consisted of multiple data blocks that were distributed and stored in different nodes. Beaumont et al [46] proposed two data-locality-aware task scheduling algorithms that optimized makespan and communication, respectively, and theoretically studied their performance.…”

Section: Related Workmentioning

confidence: 99%

DynDL: Scheduling Data-Locality-Aware Tasks with Dynamic Data Transfer Cost for Multicore-Server-Based Big Data Clusters

Jin

Zhou

et al. 2018

Applied Sciences

View full text Add to dashboard Cite

Network bandwidth is a scarce resource in big data environments, so data locality is a fundamental problem for data-parallel frameworks such as Hadoop and Spark. This problem is exacerbated in multicore server-based clusters, where multiple tasks running on the same server compete for the server’s network bandwidth. Existing approaches solve this problem by scheduling computational tasks near the input data and considering the server’s free time, data placements, and data transfer costs. However, such approaches usually set identical values for data transfer costs, even though a multicore server’s data transfer cost increases with the number of data-remote tasks. Eventually, this hampers data-processing time, by minimizing it ineffectively. As a solution, we propose DynDL (Dynamic Data Locality), a novel data-locality-aware task-scheduling model that handles dynamic data transfer costs for multicore servers. DynDL offers greater flexibility than existing approaches by using a set of non-decreasing functions to evaluate dynamic data transfer costs. We also propose online and offline algorithms (based on DynDL) that minimize data-processing time and adaptively adjust data locality. Although DynDL is NP-complete (nondeterministic polynomial-complete), we prove that the offline algorithm runs in quadratic time and generates optimal results for DynDL’s specific uses. Using a series of simulations and real-world executions, we show that our algorithms are 30% better than algorithms that do not consider dynamic data transfer costs in terms of data-processing time. Moreover, they can adaptively adjust data localities based on the server’s free time, data placement, and network bandwidth, and schedule tens of thousands of tasks within subseconds or seconds.

show abstract

An Enhanced Data-Locality-Aware Task Scheduling Algorithm for Hadoop Applications

Cited by 16 publications

References 12 publications

A Multi-Optimization Technique for Improvement of Hadoop Performance with a Dynamic Job Execution Method Based on Artificial Neural Network

A Multi-Optimization Technique for Improvement of Hadoop Performance with a Dynamic Job Execution Method Based on Artificial Neural Network

HybSMRP: a hybrid scheduling algorithm in Hadoop MapReduce framework

DynDL: Scheduling Data-Locality-Aware Tasks with Dynamic Data Transfer Cost for Multicore-Server-Based Big Data Clusters

Contact Info

Product

Resources

About