Automatic Caching Decision for Scientific Dataflow Execution in Apache Spark

Gottin, Vinícius M.; Pacheco, Edward; Dias, Jonas; Ciarlini, Angelo E. M.; Costa, Bruno; Vieira, Wagner; Souto, Yania Molina; Pires, Paulo F.; Porto, Fábio; Rittmeyer, João Guilherme Nobre

doi:10.1145/3206333.3206339

Cited by 10 publications

(7 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…S-CACHE [28] automatically makes a sub-optimal caching decision by analyzing the application's execution flow and cost model, implemented in Apache Spark. It calculates the computational cost of individual caching decisions by considering the dataset's computation cost, cache writes cost, and cache read cost.…”

Section: Related Workmentioning

confidence: 99%

CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics Frameworks

Park

Jeong

Han

2021

Sensors

View full text Add to dashboard Cite

To process data from IoTs and wearable devices, analysis tasks are often offloaded to the cloud. As the amount of sensing data ever increases, optimizing the data analytics frameworks is critical to the performance of processing sensed data. A key approach to speed up the performance of data analytics frameworks in the cloud is caching intermediate data, which is used repeatedly in iterative computations. Existing analytics engines implement caching with various approaches. Some use run-time mechanisms with dynamic profiling and others rely on programmers to decide data to cache. Even though caching discipline has been investigated long enough in computer system research, recent data analytics frameworks still leave a room to optimize. As sophisticated caching should consider complex execution contexts such as cache capacity, size of data to cache, victims to evict, etc., no general solution often exists for data analytics frameworks. In this paper, we propose an application-specific cost-capacity-aware caching scheme for in-memory data analytics frameworks. We use a cost model, built from multiple representative inputs, and an execution flow analysis, extracted from DAG schedule, to select primary candidates to cache among intermediate data. After the caching candidate is determined, the optimal caching is automatically selected during execution even if the programmers no longer manually determine the caching for the intermediate data. We implemented our scheme in Apache Spark and experimentally evaluated our scheme on HiBench benchmarks. Compared to the caching decisions in the original benchmarks, our scheme increases the performance by 27% on sufficient cache memory and by 11% on insufficient cache memory, respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics Frameworks

Park

Jeong

Han

2021

Sensors

View full text Add to dashboard Cite

show abstract

“…We obtained 19 papers 20‐22,32,75‐89 . These papers were published in five distinct journals publications, 13 conferences/workshops and one PhD thesis.…”

Section: Related Workmentioning

confidence: 99%

“…Considering the work by Gotting et al, 84 an automatic pre‐computing strategy computes an optimal combination of cache operations given a dataflow definition and a simple operation cost model for a Spark dataflow, under memory constraints. The work is orthogonal to the one presented in this article as the latter obtain performance improvements that are independent of code changes.…”

Section: Related Workmentioning

confidence: 99%

“…Although SWfMS are widely used nowadays in the scientific community, they may present a long learning curve associated with running simulations and guaranteeing reproducibility of results. Thus, big data frameworks are attracting more and more attention from academia for modeling and executing scientific workflows 17,18 . There are several big data frameworks available for use such as Hadoop (MapReduce processing framework), Storm (stream processing framework), Samza (stream processing framework), Flink (stream processing framework that can also handle batch tasks),and Spark (batch processing framework with stream processing capabilities).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning

Oliveira

Porto

Boeres

et al. 2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install framework, it has more than one hundred parameters to be set, besides domain‐specific parameters of each workflow. In this way, to execute Spark‐based workflows efficiently, the user has to fine‐tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial‐and‐error manner since it is tedious and error‐prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain‐specific ones related to the workflow performance in the predictive model.

show abstract

“…The problem of this approach is that it is static, i.e., they do not consider automatic caching. Gottin et al [15] propose an algorithm that finds an optimized cache decision plan for a dataflow execution in Apache Spark. The approach is based on a cost model that uses provenance data, and tries the possible combinations of caching selection in order to select the best one.…”

Section: Related Workmentioning

confidence: 99%

Efficient Execution of Scientific Workflows in the Cloud Through Adaptive Caching

Heidsieck

Oliveira²,

Pacitti

et al. 2020

Transactions on Large-Scale Data- And Knowledge-Centered Systems XLIV

View full text Add to dashboard Cite

Many scientific experiments are now carried on using scientific workflows, which are becoming more and more data-intensive and complex. We consider the efficient execution of such workflows in the cloud. Since it is common for workflow users to reuse other workflows or data generated by other workflows, a promising approach for efficient workflow execution is to cache intermediate data and exploit it to avoid task re-execution. In this paper, we propose an adaptive caching solution for data-intensive workflows in the cloud. Our solution is based on a new scientific workflow management architecture that automatically manages the storage and reuse of intermediate data and adapts to the variations in task execution times and output data size. We evaluated our solution by implementing it in the OpenAlea system and performing extensive experiments on real data with a data-intensive application in plant phenotyping. The results show that adaptive caching can yield major performance gains, e.g., up to a factor of 3.5 with 6 workflow re-executions.

show abstract

Automatic Caching Decision for Scientific Dataflow Execution in Apache Spark

Cited by 10 publications

References 11 publications

CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics Frameworks

CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics Frameworks

Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning

Efficient Execution of Scientific Workflows in the Cloud Through Adaptive Caching

Contact Info

Product

Resources

About