More and more use cases require fast, accurate, and reliable processing of large volumes of data. To do this, a distributed stream processing framework is needed which can distribute the load over several machines. In this work, we study and benchmark the scalability of stream processing jobs in four popular frameworks: Flink, Kafka Streams, Spark Streaming, and Structured Streaming. Besides that, we determine the factors that influence the performance and efficiency of scaling processing jobs with distinct characteristics. We evaluate horizontal, as well as vertical scalability. Our results show how the scaling efficiency is impacted by many factors including the initial cluster layout and direction of scaling, the pipeline design, the framework design, resource allocation, and data characteristics. Finally, we give some recommendations on how practitioners should undertake to scale their clusters.
Distributed stream processing frameworks have gained widespread adoption in the last decade because they abstract away the complexity of parallel processing. One of their key features is built-in fault tolerance. In this work, we dive deeper into the implementation, performance, and efficiency of this critical feature for four state-of-the-art frameworks. We include the established Spark Streaming and Flink frameworks and the more novel Spark Structured Streaming and Kafka Streams frameworks. We test the behavior under different types of faults and settings: master failure with and without high-availability setups, driver failures for Spark frameworks, worker failure with or without exactly-once semantics, application and task failures. We highlight differences in behavior during these failures on several aspects, e.g., whether there is an outage, downtime, recovery time, data loss, duplicate processing, accuracy, and the cost and behavior of different message delivery guarantees. Our results highlight the impact of framework design on the speed of fault recovery and explain how different use cases may benefit from different approaches. Due to their task-based scheduling approach, the Spark frameworks can recover quickly and in most cases without necessitating an application restart. Kafka Streams has the shortest downtime, while Flink can offer end-to-end exactly-once semantics at a low cost.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.