LiPS: A cost-efficient data and task co-scheduler for MapReduce

Ehsan, Moussa; Chen, Yao; Kang, Hui; Sion, Radu; Wong, Jennifer L.

doi:10.1109/hipc.2013.6799103

Cited by 5 publications

(7 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figures 3 (a) and (b) show one scenario indicating the current schedulers' drawback. That is, only scheduling submitted tasks [1,2,18,19,22,31,41,43] may violate the deadlines of tightdeadline tasks that will be submitted in the future if all computing slots are occupied. As shown in Figure 3(a), there is one server with one computing slot available from 10s to 50s.…”

Section: Common-data Requester Consolidationmentioning

confidence: 99%

“…Currently, the clusters employ the computing framework that first allocates data and then schedules jobs (data-first-job-second in short). That is, first, data blocks are randomly distributed to servers [9,26,31,40] and then job schedulers [1,2,18,22,36,[41][42][43] allocate job tasks to servers.…”

mentioning

confidence: 99%

“…With random data allocation, a data server is very likely to host data blocks requested by several submitted tasks, and hence may not provide enough computing slots (i.e., containers) to all the tasks. In this case, some locality-aware schedulers [43] delay scheduling the task, while other locality-aware schedulers [1,2,18,19,22,24,29,31,41] allocate the task to the server that has an available computing slot and is closest to its data server. To meet user deadline requirements, current deadlineaware schedulers [19,22,41] schedule the submitted jobs according to their deadline urgency.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Cooperative Job Scheduling and Data Allocation for Busy Data-Intensive Parallel Computing Clusters

Liu

Shen

Wang

2019

Proceedings of the 48th International Conference on Parallel Processing

View full text Add to dashboard Cite

In data-intensive parallel computing clusters, it is important to provide deadline-guaranteed service to jobs while minimizing resource usage (e.g., network bandwidth and energy). Under the current computing framework (that first allocates data and then schedules jobs), in a busy cluster with many jobs, it is difficult to achieve these objectives simultaneously. We model the problem to simultaneously achieve the objectives using integer programming, and propose a heuristic Cooperative job Scheduling and data Allocation method (CSA). CSA novelly reverses the order of data allocation and job scheduling in the current computing framework, i.e., changing data-first-job-second to job-first-data-second. It enables CSA to proactively consolidate tasks with more common requested data to the same server when conducting deadline-aware scheduling, and also consolidate the tasks to as few servers as possible to maximize energy savings. This facilitates the subsequent data allocation step to allocate a data block to the server that hosts most of this data's requester tasks, thus maximally enhancing data locality and reduce bandwidth consumption. CSA also has a recursive schedule refinement process to adjust the job and data allocation schedules to improve system performance regarding the three objectives and achieve the tradeoff between data locality and energy savings with specified weights. We implemented CSA and a number of previous job schedulers on Apache Hadoop on a real supercomputing cluster. Trace-driven experiments in the simulation and the real cluster show that CSA outperforms other schedulers in supplying deadline-guarantee and resource-efficient services.

show abstract

Section: Common-data Requester Consolidationmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Cooperative Job Scheduling and Data Allocation for Busy Data-Intensive Parallel Computing Clusters

Liu

Shen

Wang

2019

Proceedings of the 48th International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…The strategy includes a scheduling policy based on a task performance prediction model, and an adaptive delay scheduling algorithm for data locality improvement. LiPS system is introduced in Ehsan et al (2013) and promises a cost-efficient data and task co-scheduler for MapReduce in a cloud environment using linear programming technique. A hierarchical Map Reduce scheduler for hybrid data centres is presented in Sharma et al (2013).…”

Section: Background and Related Workmentioning

confidence: 99%

A scalable Map Reduce tasks scheduling: a threading-based approach

Yaseen

AlQudah

Jararweh

et al. 2017

IJCSE

View full text Add to dashboard Cite

The Map Reduce paradigm is now considered a standard platform that is used for large-scale data processing and management. A major operation that the Map Reduce platform relies on greatly is tasks scheduling. Although many schedulers have been presented, task scheduling is still one of the major problems that face Map Reduce frameworks. Schedulers need to maintain data locality to achieve an acceptable performance by avoiding several data transmissions. Hence, in this paper, we propose a new scheduling algorithm named 'MTL' that utilises multi-threading principles. The MTL scheduler assigns a dedicated thread for each data block. Indeed, the multi-threading approach shows great results that make our MTL scheduler a scalable one that performs well. At the same time, it maintains the locality property. During the evaluation of the MTL scheduler performance, two main factors were taken into consideration; the simulation time and the energy consumption. The MTL scheduler is then compared with other existing schedulers such as FIFO, matchmaking, and delay schedulers. The MTL scheduler showed favourable results and proved its advantages over other existing schedulers.

show abstract

“…LiPS system is introduced in and promised a cost‐efficient data and task co‐scheduler for MapReduce in a cloud environment using linear programming technique.…”

Section: Background and Related Workmentioning

confidence: 99%

Evaluating map reduce tasks scheduling algorithms over cloud computing infrastructure

Althebyan

Jararweh

Yaseen

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

Efficiently scheduling MapReduce tasks is considered as one of the major challenges that face MapReduce frameworks. Many algorithms were introduced to tackle this issue. Most of these algorithms are focusing on the data locality property for tasks scheduling. The data locality may cause less physical resources utilization in non-virtualized clusters and more power consumption. Virtualized clusters provide a viable solution to support both data locality and better cluster resources utilization. In this paper, we evaluate the major MapReduce scheduling algorithms such as FIFO, Matchmaking, Delay, and multithreading locality (MTL) on virtualized infrastructure. Two major factors are used to test the evaluated algorithms: the simulation time and the energy consumption. The evaluated schedulers are compared, and the results show the superiority and the preference of the MTL scheduler over the other existing schedulers. Also, we present a comparison study between virtualized and non-virtualized clusters for MapReduce tasks scheduling. EVALUATING MAP REDUCE TASKS SCHEDULING ALGORITHMS 5687 as 'datasets whose size are beyond the ability of typical database software tools to capture, store, manage and analyze.' New technologies are needed to be able to extract values from those datasets; such processed data might be used in other fields such as artificial intelligence, data mining, health care, and social networks. International Business Machine researchers [4] characterized big data with the 3Vs: variety, volume, and velocity. Variety is used to refer to the multiple types/formats in which big data is generated such as digits, texts, audios, videos, and log files.The second characteristic is the huge volume of big data which can reach hundreds or thousands of terabytes. The third characteristic is the velocity where processing and analyzing data must be performed in a fast manner to extract value of data within an appropriate time. These characteristics drive for developing new methodologies to deal with such huge amounts of data. So, comes to existence the term 'big data management'.Big data operations are widely used in many technologies, for example, cloud computing, distributed systems, data warehouse, Hadoop, and MapReduce. MapReduce is one of these technologies that are utilized to handle such big data. It is a software framework introduced by Google for processing large amounts of data in a parallel manner [5]. In fact, it provides a set of features such as user-defined functions, automatic parallelization and distribution, fault tolerance, and high availability by data replicating.MapReduce works in two phases: the map phase and the reduce phase. In the map phase, a dedicated node called the master node takes the input, divides it into smaller shared data splits, and assigns them to worker nodes. The worker nodes may perform the same splitting operation, leading to a hierarchal tree structure. The worker node processes the assigned splits and sends the results back to the master node. The reduce phase then begins...

show abstract

LiPS: A cost-efficient data and task co-scheduler for MapReduce

Cited by 5 publications

References 20 publications

Cooperative Job Scheduling and Data Allocation for Busy Data-Intensive Parallel Computing Clusters

Cooperative Job Scheduling and Data Allocation for Busy Data-Intensive Parallel Computing Clusters

A scalable Map Reduce tasks scheduling: a threading-based approach

Evaluating map reduce tasks scheduling algorithms over cloud computing infrastructure

Contact Info

Product

Resources

About