Machine Learning-Based Configuration Parameter Tuning on Hadoop System

Chen, Chi-Ou; Zhuo, Ye-Qi; Yeh, Chao-Chun; Lin, Che-Min; Liao, Shih-Wei

doi:10.1109/bigdatacongress.2015.64

Cited by 40 publications

(13 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Machine learning techniques have been applied to explore complex configuration spaces to find near optimal settings without considering constraints on operating behavior [5,53,60,68]. Some approaches employ ML to meet resource constraints in dynamic environments [9].…”

Section: Motivationmentioning

confidence: 99%

“…Machine learning frameworks Many learning approaches have been proposed for predicting an optimal configuration within a complicated configuration space [8-10, 31, 40, 53, 65]. Machine learning has even been applied to further improve existing heuristic autotuners, like Starfish [21], by using learning models to direct the search for optimal configurations [5,60]. Perhaps the most closely related learning works are those based on reinforcement learning (RL) [51].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Understanding and Auto-Adjusting Performance-Sensitive Configurations

Wang

Hoffmann

et al. 2018

SIGPLAN Not.

View full text Add to dashboard Cite

Modern software systems are often equipped with hundreds to thousands of configurations, many of which greatly affect performance. Unfortunately, properly setting these configurations is challenging for developers due to the complex and dynamic nature of system workload and environment. In this paper, we first conduct an empirical study to understand performance-sensitive configurations and the challenges of setting them in the real-world. Guided by our study, we design a systematic and general control-theoretic framework, SmartConf , to automatically set and dynamically adjust performance-sensitive configurations to meet required operating constraints while optimizing other performance metrics. Evaluation shows that SmartConf is effective in solving real-world configuration problems, often providing better performance than even the best static configuration developers can choose under existing configuration systems.

show abstract

Section: Motivationmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Understanding and Auto-Adjusting Performance-Sensitive Configurations

Wang

Hoffmann

et al. 2018

SIGPLAN Not.

View full text Add to dashboard Cite

show abstract

“…Another work [25] proposes a tree-based regression approach consisting of a prediction and an optimization phase. The former one estimates the execution time of a MapReduce job by building three prediction models.…”

Section: Batch Processing Systemsmentioning

confidence: 99%

A Survey on Automatic Parameter Tuning for Big Data Processing Systems

Herodotou

Chen

2020

ACM Comput. Surv.

View full text Add to dashboard Cite

Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.

show abstract

“…For the experiments, we will present our cluster performance based on MapReduce and Spark using the HiBench suite [23,23]. In particular, we have selected two Hibench workloads out of thirteen standard workloads to represent the two types of jobs, namely WordCount (aggregation job) [32], and TeraSort (shuffle job) [33] with large datasets. We selected both the workloads because of their complex characteristics to study how efficiently both the workloads analyze the cluster performance by correlating MapReduce and Spark function with a combination of groups of parameters.…”

Section: Cluster Architecturementioning

confidence: 99%

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

et al. 2020

View full text Add to dashboard Cite

Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.

show abstract

Machine Learning-Based Configuration Parameter Tuning on Hadoop System

Cited by 40 publications

References 16 publications

Understanding and Auto-Adjusting Performance-Sensitive Configurations

Understanding and Auto-Adjusting Performance-Sensitive Configurations

A Survey on Automatic Parameter Tuning for Big Data Processing Systems

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

Contact Info

Product

Resources

About