On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

Scheinert, Dominik; Alamgiralem, Alireza; Bader, Jonathan; Will, Jonathan; Wittkopp, Thorsten; Thamsen, Lauritz

doi:10.1109/bigdata52589.2021.9671275

Cited by 9 publications

(8 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In several of our prior works [16,31,33,32,20], we discussed the idea of exploiting similarities between different jobs and their executions, cultivating runtime data in a collaborative manner among numerous users and thereby improving the prediction capabilities of individual users. This includes decentralized system architectures for sharing context-aware runtime metrics, as well as similarity matching between jobs.…”

Section: Results Overviewmentioning

confidence: 99%

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Thamsen,

Scheinert,

Will

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate performance accurately, users frequently overprovision resources for their jobs, leading to low resource utilization and high costs.In this paper, we present major building blocks towards a collaborative approach for optimization of data processing cluster configurations based on runtime data and performance models. We believe that runtime data can be shared and used for performance models across different execution contexts, significantly reducing the reliance on the recurrence of individual processing jobs or, else, dedicated job profiling. For this, we describe how the similarity of processing jobs and cluster infrastructures can be employed to combine suitable data points from local and global job executions into accurate performance models. Furthermore, we outline approaches to performance prediction via more context-aware and reusable models. Finally, we lay out how metrics from previous executions can be combined with runtime monitoring to effectively re-configure models and clusters dynamically.

show abstract

Section: Results Overviewmentioning

confidence: 99%

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Thamsen,

Scheinert,

Will

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Both approaches have in common that they need at least comprehensive knowledge about execution times of all tasks on all available nodes. However, these values are not available in advance but must be determined either by asking users for estimates [18,22,23], by analyzing historical traces [35,36,42], or by using some form of online learning [43,45]. Lotaru aims to estimate the runtime for all task-node pairs in a cluster to enable the use of existing scheduling methods in real-world systems.…”

Section: Scheduling Workflow Tasks Onto Heterogeneous Clustersmentioning

confidence: 99%

Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

Bader,

Lehmann,

Thamsen

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefinitely and workloads as well as infrastructure changes. In contrast, online methods, which predict task runtimes on specific nodes while the workflow is running, have to cope with the lack of example runs, especially during the start-up.In this paper, we present Lotaru, a novel online method for locally estimating task runtimes in scientific workflows on heterogeneous clusters. Lotaru first profiles all nodes of a cluster with a set of shortrunning and uniform microbenchmarks. Next, it runs the workflow to be scheduled on the user's local machine with drastically reduced data to determine important task characteristics. Based on these measurements, Lotaru learns a Bayesian linear regression model to predict a task's runtime given the input size and finally adjusts the predicted runtime specifically for each task-node pair in the cluster based on the micro-benchmark results. Due to its Bayesian approach, Lotaru can also compute robust uncertainty estimates and provides them as an input for advanced scheduling methods.Our evaluation with five real-world scientific workflows and different datasets shows that Lotaru significantly outperforms the baselines in terms of prediction errors for homogeneous and heterogeneous clusters. CCS CONCEPTS• Information systems → Information systems applications; • Computer systems organization → Distributed architectures; • Software and its engineering → Software architectures.

show abstract

“…Some approaches use runtime data to predict the job's scale-out and runtime behavior. This data is gained either from dedicated profiling or previous full executions [7], [25]- [31]. The models can then be used to predict the execution performance for different cluster configurations, and the most resource-efficient one will be chosen.…”

Section: A Approaches Based On Historical Performance Datamentioning

confidence: 99%

Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

Will¹,

Thamsen²,

Bader³

et al. 2022

2022 IEEE International Conference on Big Data (Big Data)

Self Cite

View full text Add to dashboard Cite

On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

Cited by 9 publications

References 24 publications

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

Contact Info

Product

Resources

About