Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks

Marcu, Ovidiu-Cristian; Costan, Alexandru; Antoniu, Gabriel; Pérez, Marı́a S.

doi:10.1109/cluster.2016.22

Cited by 64 publications

(50 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The Big-DataBench [8] suite contains 19 scenarios covering a broad range of applications and diverse data sets. Marcu et al [9] performed an extensive analysis of the differences between Apache Spark and Apache Flink on iterative workloads. The above benchmarks either adopt batch processing systems and metrics used in batch processing systems or apply the batchbased metrics on SDPSs.…”

Section: Related Workmentioning

confidence: 99%

Benchmarking Distributed Stream Data Processing Systems

Karimov

Rabl

Katsifodimos

et al. 2018

2018 IEEE 34th International Conference on Data Engineering (ICDE)

149

120

View full text Add to dashboard Cite

The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads, there is a clear gap of detailed analyses of the systems' performance characteristics. In this paper, we propose a framework for benchmarking distributed stream processing engines. We use our suite to evaluate the performance of three widely used SDPSs in detail, namely Apache Storm, Apache Spark, and Apache Flink. Our evaluation focuses in particular on measuring the throughput and latency of windowed operations, which are the basic type of operations in stream analytics. For this benchmark, we design workloads based on real-life, industrial use-cases inspired by the online gaming industry. The contribution of our work is threefold. First, we give a definition of latency and throughput for stateful operators. Second, we carefully separate the system under test and driver, in order to correctly represent the open world model of typical stream processing deployments and can, therefore, measure system performance under realistic conditions. Third, we build the first benchmarking framework to define and test the sustainable performance of streaming systems. Our detailed evaluation highlights the individual characteristics and use-cases of each system.

show abstract

Section: Related Workmentioning

confidence: 99%

Benchmarking Distributed Stream Data Processing Systems

Karimov

Rabl

Katsifodimos

et al. 2018

2018 IEEE 34th International Conference on Data Engineering (ICDE)

149

120

View full text Add to dashboard Cite

show abstract

“…Apache Hadoop has been used in various big data processing fields but cannot meet the real-time computing tasks and requirements [44,45]. Apache Storm only supports stream processing [46], Apache Spark simulates stream processing based on batch processing [47], and Apache Flink is entirely based on stream processing and simulates batch processing through stream processing [48]. Apache Flink can implement both stream processing and batch processing via a single solution, which can help prevent duplication of codes during development.…”

Section: Batch and Stream Computingmentioning

confidence: 99%

Microservice-Based Platform for Space Situational Awareness Data Analytics

Lan

et al. 2020

International Journal of Aerospace Engineering

View full text Add to dashboard Cite

The development, deployment, and maintenance of the current space situational awareness (SSA) information system have become increasingly complex. However, researchers cannot flexibly and conveniently apply the research results to practical applications due to the lack of basic research platforms for SSA. Inspired by X as a Service (XaaS), we propose the microservice-based platform for SSA data analytics to provide a scaffold-like platform for researchers. Based on microservice, the architecture for this platform is proposed to meet the requirements of flexible development and loosely coupled deployment. To facilitate the use of the platform, the hybrid data service layer is established to provide basic data for research and the functional service layer is designed to provide services for clients and applications. Due to the massive data processing requirements, the data analysis architecture and processing model, which can easily integrate various user-defined algorithms and significantly improve the computational efficiency, are proposed based on the Lambda architecture. To verify the platform's effectiveness, two cases are established and implemented. The results show that this platform can provide a convenient, flexible, and efficient platform for the requirements of algorithm integration, experiment, and data display from users and researchers.

show abstract

“…We mention that most of the above presented surveys are limited in terms of both the evaluated features of Big Data frameworks and the number of considered frameworks. For example, in [64], only stream processing frameworks are considered while in [16] [54] [24] [40], only batch processing frameworks are considered. We highlight that our experimental survey differs from the above presented works by the fact that it compares the studied frameworks in the case of both batch and stream processing.…”

Section: Related Workmentioning

confidence: 99%

An experimental survey on big data frameworks

Inoubli

Aridhi

Mezni

et al. 2018

Future Generation Computer Systems

104

View full text Add to dashboard Cite

Recently, increasingly large amounts of data are generated from a variety of sources. Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword referring to the processing of massive volumes of (unstructured) data. Recently proposed frameworks for Big Data applications help to store, analyze and process the data. In this paper, we discuss the challenges of Big Data and we survey existing Big Data frameworks. We also present an experimental evaluation and a comparative study of the most popular Big Data frameworks with several representative batch and iterative workloads. This survey is concluded with a presentation of best practices related to the use of studied frameworks in several application domains such as machine learning, graph processing and real-world applications.

show abstract

Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks

Cited by 64 publications

References 17 publications

Benchmarking Distributed Stream Data Processing Systems

Benchmarking Distributed Stream Data Processing Systems

Microservice-Based Platform for Space Situational Awareness Data Analytics

An experimental survey on big data frameworks

Contact Info

Product

Resources

About