Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing

Yu, Zhibin; Bei, Zhendong; Qian, Xuehai

doi:10.1145/3173162.3173187

Cited by 67 publications

(63 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A straightforward method [12]- [14], [24], [32]- [34] to solve the configuration parameter optimization problem is to construct an offline prediction model first and then apply some search algorithms to online find the optimal configuration based on this prediction model. For instance, Xiong et al [24] utilize an ensemble learning algorithm to build the performance-prediction model and leverage genetic algorithm to search the optimal configuration parameters for HBase.…”

Section: A Prediction Model-based Methodsmentioning

confidence: 99%

Hdconfigor: Automatically Tuning High Dimensional Configuration Parameters for Log Search Engines

2020

View full text Add to dashboard Cite

Search engines are nowadays widely applied to store and analyze logs generated by largescale distributed systems. To adapt to various workload scenarios, log search engines such as Elasticsearch usually expose a large number of performance-related configuration parameters. As manual configuring is time consuming and labor intensive, automatically tuning configuration parameters to optimize performance has been an urgent need. However, it is challenging because: 1) Due to the complex implementation, the relationship between performance and configuration parameters is difficult to model and thus the objective function is actually a black box; 2) In addition to application parameters, JVM and kernel parameters are also closely related to the performance and together they construct a high dimensional configuration space; 3) To iteratively search for the best configuration, a tool is necessary to automatically deploy the newly generated configuration and launch tests to measure the corresponding performance. To address these challenges, this paper designs and implements HDConfigor, an automatic holistic configuration parameter tuning tool for log search engines. In order to solve the high dimensional optimization problem, we propose a modified Random EMbedding Bayesian Optimization algorithm (mREMBO) in HDConfigor which is a black-box approach. Instead of directly using a black-box optimization algorithm such as Bayesian optimization (BO), mREMBO first generates a lower dimensional embedded space through introducing a random embedding matrix and then performs BO in this embedded space. Therefore, HDConfigor is able to find a competitive configuration automatically and quickly. We evaluate HDConfigor in an Elasticsearch cluster with different workload scenarios. Experimental results show that compared with the default configuration, the best relative median indexing results achieved by mREMBO can reach 2.07×. In addition, under the same number of trials, mREMBO is able to find a configuration with at least a further 10.31% improvement in throughput compared to Random search, Simulated Annealing and BO. INDEX TERMS Log search engine, configuration parameter tuning, black-box optimization, Bayesian optimization, random embedding.

show abstract

Section: A Prediction Model-based Methodsmentioning

confidence: 99%

Hdconfigor: Automatically Tuning High Dimensional Configuration Parameters for Log Search Engines

2020

View full text Add to dashboard Cite

show abstract

“…If that were the case, amortization could be done over the lifetime of a cluster rather than for individual workloads. The main difficulties posed for training such a model are: 1) hundreds of executions are needed to build it [26]; 2) difficulties in adapting to dynamic resource allocation in the cluster; 3) the high diversity of the workloads makes it harder to build a single cost model of a good accuracy [7]; 4) the high dimensionality of the search space: one dimension per configuration parameter. Complex data processing frameworks such as Spark commonly have 20 -60 parameters that are relevant for tuning, and our experimental evaluation with system-wide models yields results around 40% worse than optimal.…”

Section: Tuning Cost Amortizationmentioning

confidence: 99%

“…git 2 $ cd tuneful -code 3 $ mvn clean package 4 $ / usr / lib / spark / bin / spark -submit Table 3. We selected those parameters as they cover a wide range of Spark's internal aspects (memory, processing, shuffle and network aspects) and represent a superset of the ones used in the related work [26,27], with approximatively 2 • 10 40 configurations possible in total (this represents the size of the search space).…”

Section: A Appendix A1 Experiments Reproducibilitymentioning

confidence: 99%

“…While the position above is a usual way of framing optimization problems, we have found that the amortization of the optimization costs through the resulting savings is often overlooked. This is because tuning is usually performed using ML performance/cost models [16,24,26]. Once trained, it is assumed the model can produce predictions about what configurations yield close-to-optimum executions for a given workload for a long period of time.…”

Section: Introductionmentioning

confidence: 99%

“…While all methods discussed are not limited to a particular system or cost function, we have targeted Spark as the data processing framework to configure, because it is both popular and poses significant challenges for state-of-art configuration tuners (huge configuration search space, a variety of ways to process data) [26].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

To Tune or Not to Tune?

Fekry

Carata

Pasquier

et al. 2020

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

This experimental study presents a number of issues that pose a challenge for practical configuration tuning and its deployment in data analytics frameworks. These issues include: 1) the assumption of a static workload or environment, ignoring the dynamic characteristics of the analytics environment (e.g., increase in input data size, changes in allocation of resources). 2) the amortization of tuning costs and how this influences what workloads can be tuned in practice in a cost-effective manner. 3) the need for a comprehensive incremental tuning solution for a diverse set of workloads. We adapt different ML techniques in order to obtain efficient incremental tuning in our problem domain, and propose Tuneful, a configuration tuning framework. We show how it is designed to overcome the above issues and illustrate its applicability by running a wide array of experiments in cloud environments provided by two different service providers. CCS CONCEPTS • Theory of computation → Online learning algorithms; Gaussian processes; Non-parametric optimization.

show abstract

JointConf: Jointly autotuning configuration parameters for modularized graph databases

Dou

Mei

Zhang

et al. 2022

J Software Evolu Process

View full text Add to dashboard Cite

To support different application scenarios, graph databases (GDBs) usually provide a large number of performance-related parameters for developers. Since manually configuring is both time-consuming and cost-intensive, automatically tuning configurations parameters to achieve a better performance has been an urgent need. Besides, considering various graph management requirements, GDBs begin to utilize the modular architecture to interoperate with a wide range of storage and index backends. Due to the complicated interactions among different modules, sequentially tuning each software with previous solutions may fall into a local optimal and it is necessary to jointly autotune the cross-module configuration parameters. Toward this challenging target, we propose JointConf-a new black-box approach of jointly autotuning configuration parameters for modularized GDBs. To address the formulated highdimensional black-box optimization problem, JointConf utilizes the recently proposed BO_dropout algorithm. Inspired by the dropout algorithm in neural networks, BO_dropout explores efficient dimension dropout to achieve a high-dimensional Bayesian optimization. We evaluate the effectiveness of JointConf on a local distributed JanusGraph cluster with three different graph query benchmark applications and experimental results show its advantages over the four baseline search-based approaches. The necessity of jointly tuning for modularized GDBs is also verified in our experiments.

show abstract

Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing

Cited by 67 publications

References 35 publications

Hdconfigor: Automatically Tuning High Dimensional Configuration Parameters for Log Search Engines

Hdconfigor: Automatically Tuning High Dimensional Configuration Parameters for Log Search Engines

To Tune or Not to Tune?

JointConf: Jointly autotuning configuration parameters for modularized graph databases

Contact Info

Product

Resources

About