Distributed data stream processing systems (DSPSs) such as Storm, Flink, and Spark Streaming are now routinely used to process continuous data streams in (near) real-time. However, achieving the low latency and high throughput demanded by today's streaming applications can be a daunting task, especially since the performance of DSPSs highly depends on a large number of system parameters that control load balancing, degree of parallelism, buffer sizes, and various other aspects of system execution. This tutorial offers a comprehensive review of the state-of-the-art automatic performance tuning approaches that have been proposed in recent years. The approaches are organized into five main categories based on their methodologies and features: cost modeling, simulation-based, experimentdriven, machine learning, and adaptive tuning. The categories of approaches will be analyzed in depth and compared to each other, exposing their various strengths and weaknesses. Finally, we will identify several open research problems and challenges related to automatic performance tuning for DSPSs.
This paper presents DITIS, a simulator for distributed and tiered file-based storage systems. In particular, DITIS can model a distributed storage system with up to three levels of storage tiers and up to three additional levels of caches. Each tier and cache can be configured with different number and type of storage media devices (e.g., HDD, SSD, NVRAM, DRAM), each with their own performance characteristics. The simulator utilizes the provided characteristics in fine-grained performance cost models (which are distinct for each device type) in order to compute the duration time of each I/O request processed on each tier. At the same time, DITIS simulates the overall flow of requests through the different layers and storage nodes of the system using numerous pluggable policies that control every aspect of execution, ranging from request routing and data redundancy to cache and tiering strategies. For performing the simulation, DITIS adapts an extended version of the Actor Model, during which key components of the system exchange asynchronous messages with each other, much like a real distributed multi-threaded system. The ability to simulate the execution of a workload in such an accurate and realistic way brings multiple benefits for its users, since DITIS can be used to better understand the behavior of the underlying file system as well as evaluate different storage setups and policies.
The growing need to identify patterns in data and automate decisions based on them in near-real time, has stimulated the development of new machine learning (ML) applications processing continuous data streams. However, the deployment of ML applications over distributed stream processing engines (DSPEs) such as Apache Spark Streaming is a complex procedure that requires extensive tuning along two dimensions. First, DSPEs have a plethora of system configuration parameters, like degree of parallelism, memory buffer sizes, etc., that have a direct impact on application throughput and/or latency, and need to be optimized. Second, ML models have their own set of hyperparameters that require tuning as they can affect the overall prediction accuracy of the trained model significantly. These two forms of tuning have been studied extensively in the literature but only in isolation from each other. This manuscript presents a comprehensive experimental study that combines system configuration and hyperparameter tuning of ML applications over DSPEs. The experimental results reveal unexpected and complex interactions between the choices of system configurations and hyperparameters, and their impact on both application and model performance. These insights motivate the need for new combined system and ML model tuning approaches, and open up new research directions in the field of self-managing distributed stream processing systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.