Estimating the progress of MapReduce pipelines

Morton, Kristi; Friesen, Abram L.; Balazinska, Magdalena; Grossman, Dan

doi:10.1109/icde.2010.5447919

Cited by 93 publications

(81 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The elapsed time of the running task can be evaluated through its actual execution speed to provide more accurate estimate. Existing methods break map or reduce task into pipelines and sum the elapsed time of every pipeline as the task estimate [17] [12]. In our experiment, we adopts the finish time estimate method of [17], which estimates the time left for a task based on the progress score provided by Hadoop, as (1 − ProgreeScore)/ProgressRate, and the ProgressRate=ProgressScore/elapsed time t.…”

Section: Estimating the Progressmentioning

confidence: 99%

“…[17] provides a method to estimate the progress of a MapReduce task, however, there are also several challenges to estimate the progress of MapReduce jobs and MapReduce DAGs. Parallax [12] estimates the progress of queries translated into sequences of MapReduce jobs. It breaks a MapReduce job into pipelines, which are groups of interconnected operators that execute simultaneously.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Halt or Continue: Estimating Progress of Queries in the Cloud

Shi

Meng

Liu

2012

Database Systems for Advanced Applications

View full text Add to dashboard Cite

Abstract. With cloud-based data management gaining more ground by day, the problem of estimating the progress of MapReduce queries in the cloud is of paramount importance. This problem is challenging to solve for two reasons: i) cloud is typically a large-scale heterogeneous environment, which requires progress estimation to tailor to non-uniform hardware characteristics, and ii) cloud is often built with cheap and commodity hardware that is prone to fail, so our estimation should be able to dynamically adjust. These two challenges were largely unaddressed in previous work. In this paper, we propose PEQC, a Progress Estimator of Queries composed of MapReduce jobs in the Cloud. Our work is able to apply to a heterogeneous setting and provides a dynamically update mechanism to repair the network when failure occurs. We experimentally validate our techniques on a heterogeneous cluster and results show that PEQC outperforms the state of the art.

show abstract

Section: Estimating the Progressmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Halt or Continue: Estimating Progress of Queries in the Cloud

Shi

Meng

Liu

2012

Database Systems for Advanced Applications

View full text Add to dashboard Cite

show abstract

“…While there are several frameworks that generate pipeline MapReduce applications, few works focus on optimizing the actual execution of this type of applications. In [11], the authors propose a tool for estimating the progress of MapReduce pipelines generated by Pig queries. The Hadoop Online Prototype (HOP) [7] is a modified version of the Hadoop MapReduce framework that supports online aggregation, allowing users to get snapshots from a job as it is being computed.…”

Section: Pipeline Mapreduce Applications: Overview and Related Workmentioning

confidence: 99%

On-the-Fly Task Execution for Speeding Up Pipelined MapReduce

Moise

Antoniu

Bougé

2012

Euro-Par 2012 Parallel Processing

View full text Add to dashboard Cite

Abstract. The MapReduce programming model is widely acclaimed as a key solution to designing data-intensive applications. However, many of the computations that fit this model cannot be expressed as a single MapReduce execution, but require a more complex design. Such applications consisting of multiple jobs chained into a long-running execution are called pipeline MapReduce applications. Standard MapReduce frameworks are not optimized for the specific requirements of pipeline applications, yielding performance issues. In order to optimize the execution on pipelined MapReduce, we propose a mechanism for creating map tasks along the pipeline, as soon as their input data becomes available. We implemented our approach in the Hadoop MapReduce framework. The benefits of our dynamic task scheduling are twofold: reducing job-completion time and increasing cluster utilization by involving more resources in the computation. Experimental evaluation performed on the Grid'5000 testbed, shows that our approach delivers performance gains between 9% and 32%.Keywords: MapReduce, pipeline MapReduce applications, intermediate data management, task scheduling, Hadoop, HDFS IntroductionThe MapReduce abstraction has revolutionized the data-intensive community and has rapidly spread to various research and production areas. Google introduced MapReduce [8] as a solution to the need to process datasets up to multiple terabytes in size on a daily basis. The goal of the MapReduce programming model is to provide an abstraction that enables users to perform computations on large amounts of data.The MapReduce abstraction is inspired by the "map" and "reduce" primitives commonly used in functional programming. When designing an application using the MapReduce paradigm, the user has to specify two functions: map and reduce that are executed in parallel on multiple machines. Applications that can be modeled by the means of MapReduce, mostly consist of two computations: the "map" step, that applies a filter on the input data, selecting only the data that satisfies a given condition, and the "reduce" step, that collects and aggregates all the data produced by the first phase. The MapReduce model exposes a simple interface, that can be easily manipulated by users without any experience with parallel and distributed systems. However, the interface is versatile enough so that it can be employed to suit a wide range of data-intensive applications. These are the main reasons for which MapReduce has known an increasing popularity ever since it was introduced.An open-source implementation of Google's abstraction was provided by Yahoo! through the Hadoop [5] project. This framework is considered the reference MapReduce implementation and is currently heavily used for various purposes and on several infrastructures. The MapReduce paradigm has also been adopted by the cloud computing community as a support to those cloud-based applications that are data-intensive. Cloud providers support MapReduce computations so as to take advantage of the huge processi...

show abstract

“…However it uses high resource when constructing virtual machines and results in wasting allocated resources when they are not activated. The cloud platform gives a way when startups select the platforms to deploy their development and operational environment [1], [2] When prototyping a distributed application like MapReduce, a developer needs to ensure that the application execution corresponds to the specification while its performance is not impacted by the number of nodes or by some failure scenarios [1]- [3]. Indeed, MapReduce relies on successive computing-communication steps that, if not coordinated with care, lead to performance bottlenecks and a poor scalability.…”

Section: Introductionmentioning

confidence: 99%