Stage Aware Performance Modeling of DAG Based in Memory Analytic Platforms

Algorithms and Architectures for Parallel Processing

2016

Self Cite

The last years have seen a steep rise in data generation worldwide, with the development and widespread adoption of several software projects targeting the Big Data paradigm. Many companies currently engage in Big Data analytics as part of their core business activities, nonetheless there are no tools and techniques to support the design of the underlying hardware configuration backing such systems. In particular, the focus in this report is set on Cloud deployed clusters, which represent a cost-effective alternative to on premises installations. We propose a novel tool implementing a battery of optimization and prediction techniques integrated so as to efficiently assess several alternative resource configurations, in order to determine the minimum cost cluster deployment satisfying Quality of Service constraints. Further, the experimental campaign conducted on real systems shows the validity and relevance of the proposed method.

Section: Introductionmentioning

confidence: 99%

D-SPACE4Cloud: A Design Tool for Big Data Applications

Ciavotta

Gianniti

Algorithms and Architectures for Parallel Processing

2016

Self Cite

“…Because of this, predicting the execution time of Hadoop jobs is usually done empirically through experimentation, requiring a costly setup [15]. An alternative is to develop models for predicting performance.…”

Section: Introductionmentioning

confidence: 99%

Modeling Performance of Hadoop Applications: A Journey from Queueing Networks to Stochastic Well Formed Nets

Algorithms and Architectures for Parallel Processing

Bernardi

Gianniti

et al. 2016

Self Cite

Nowadays, many enterprises commit to the extraction of actionable knowledge from huge datasets as part of their core business activities. Applications belong to very different domains such as fraud detection or one-to-one marketing, and encompass business analytics and support to decision making in both private and public sectors. In these scenarios, a central place is held by the MapReduce framework and in particular its open source implementation, Apache Hadoop. In such environments, new challenges arise in the area of jobs performance prediction, with the needs to provide Service Level Agreement guarantees to the end-user and to avoid waste of computational resources. In this paper we provide performance analysis models to estimate MapReduce job execution times in * Acknowledgments: This work has received funding from the European Union Horizon 2020 research and innovation program under grant agreement No. 644869 (DICE). Experimental data are available as open data at https://zenodo.org/record/58847#.V5i0wmXA45Q. 1Hadoop clusters governed by the YARN Capacity Scheduler. We propose models of increasing complexity and accuracy, ranging from queueing networks to stochastic well formed nets, able to estimate job performance under a number of scenarios of interest, including also unreliable resources. The accuracy of our models is evaluated by considering the TPC-DS industry benchmark running experiments on Amazon EC2 and the CINECA Italian supercomputing center. The results have shown that the average accuracy we can achieve is in the range 9-14%.

“…This idea of profiling is also used in [16,17]. Task durations are obtained from execution logs, and on average, each query runs 20 times.…”

Section: Composite Dag Modelmentioning

confidence: 99%

“…Other more recent models, presented in this area are simulation-based models for which analysis is time consuming and less scalable [13,14]. Methods based on machine learning are good for interpolation, but suffer from low generality and insight [15,16,17]. Moreover, machine learning needs costly cluster setup to study historical logs of past executions.…”

Section: Introductionmentioning

confidence: 99%

Analytical composite performance models for Big Data applications

Karimian-Aliabadi

Journal of Network and Computer Applications

Entezari‐Maleki

et al. 2019

Self Cite

In the era of Big Data, whose digital industry is facing the massive growth of data size and development of data intensive software, more and more companies are moving to use new frameworks and paradigms capable of handling data at scale. The outstanding MapReduce (MR) paradigm and its implementation framework, Hadoop are among the most referred ones, and basis for later and more advanced frameworks like Tez and Spark. Accurate prediction of the execution time of a Big Data application helps improving design time decisions, reduces over allocation charges, and assists budget management. In this regard, we propose analytical models based on the Stochastic Activity Networks (SANs) to accurately model the execution of MR, Tez and Spark applications in Hadoop environments governed by the YARN Capacity scheduler. We evaluate the accuracy of the proposed models over the TPC-DS industry benchmark across different configurations. Results obtained by numerically solving analytical SAN models show an average error of 6% in estimating the execution time of an application compared to the data gathered from experiments and moreover the model evaluation time is lower than simulation time of state of the art solutions.