Building a replicated logging system with Apache Kafka

Wang, Guozhang; Koshy, Joel; Subramanian, Sriram; Paramasivam, Kartik; Zadeh, Mammad; Narkhede, Neha; Rao, Jun; Kreps, Jay; Stein, Joe

doi:10.14778/2824032.2824063

Cited by 125 publications

(51 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We communicate with the engine using the Pub/Sub messaging service. 9 Specifically, we deploy a topology composed of four operators: (1) a Pub/Sub subscriber that reads elements from an input topic; (2) a window operator; (3) a reducer that concatenates the content of each window into an output string; (4) the Pub/Sub publisher that writes the results of the reducer on an output topic. We submit elements by publishing them on the input topic and we read the results from the output topic.…”

Section: Google Cloud Dataflowmentioning

confidence: 99%

“…In the presence of out-of-order elements that alter the values of some results produced in the past, the engine retracts the previous output from the mutable dataset and substitutes it with the newly computed values. This is the case of the Kafka Stream system [9].…”

Section: Management Of Out-of-order Elementsmentioning

confidence: 99%

“…Predicate windows are more flexible since they are not append-only, but they enable incoming elements to overwrite part of a window content. This approach is also at the base of some modern distributed stream processing systems, such as the Apache Kafka [9] message broker.…”

Section: Windowing Approachesmentioning

confidence: 99%

See 2 more Smart Citations

Defining the execution semantics of stream processing engines

et al. 2017

View full text Add to dashboard Cite

IntroductionSeveral modern data-intensive applications need to process large volumes of data on the fly as they are produced. Examples range from credit card fraud detection systems, which analyze massive streams of credit card transactions to identify suspicious patterns, to environmental monitoring applications that continuously analyze sensor data, to click stream analysis of Web sites that identify frequent patterns of interactions. More in general, stream processing is a central requirement in today's information systems.This state of facts pushed the development of several stream processing engines (SPEs) that continuously analyze streams of data to produce new results as new elements enter the streams. Unfortunately, existing SPEs adopt different processing models and standardized execution semantics have not yet emerged. This severely hampers the usability AbstractThe ability to process large volumes of data on the fly, as soon as they become available, is a fundamental requirement in today's information systems. Modern distributed stream processing engines (SPEs) address this requirement and provide low-latency and high-throughput data stream processing in cluster platforms, offering high-level programming interfaces that abstract from low-level details such as data distribution and hardware failures. The last decade saw a rapid increase in the number of available SPEs. However, each SPE defines its own processing model and standardized execution semantics have not emerged yet. This paper tackles this problem and analyzes the execution semantics of some widely adopted modern SPEs, namely Flink, Storm, Spark Streaming, Google Dataflow, and Azure Stream Analytics. We specifically target the notions of windowing and time, traditionally considered the key distinguishing factors that characterize the behavior of SPEs. We rely on the SECRET model, introduced in 2010 to analyze the windowing semantics for the SPEs available at that time. We show that SECRET models well some aspects of the behavior of modern SPEs, and we shed light on the evolution of SPEs after the introduction of SECRET by analyzing the elements that SECRET cannot fully capture. In this way, the paper contributes to the research in the area of stream processing by: (1) contrasting and comparing some widely used modern SPEs based on a formal model of their execution semantics; (2) discussing the evolution of SPEs since the introduction of the SECRET model; (3) suggesting promising research directions to direct further modeling efforts. Affetti et al. J Big Data (2017) et al. J Big Data (2017) 4:12 and interoperability of SPEs, since a user needs to understand system-specific aspects to confront various alternatives and select the ones that better suite her needs. SURVEY PAPERPage 2 of 24 AffettiThe main factors that differentiate the behaviors of SPEs are the models of windows and time they adopt [1]. Windows enable computations that would be otherwise unfeasible on unbounded datasets such as streams. For instance, counting the number o...

show abstract

Section: Google Cloud Dataflowmentioning

confidence: 99%

Section: Management Of Out-of-order Elementsmentioning

confidence: 99%

See 1 more Smart Citation

Defining the execution semantics of stream processing engines

et al. 2017

View full text Add to dashboard Cite

show abstract

“…32 Similar work with our approach is replicated logging system using publish-subscribe messaging middleware. 33 Although replicated logging is similar to our CSMs, the difference is that our approach is based on stream processing rather than event- The elements in CSM are explained as follows.…”

Section: Stream-based Data Replicationmentioning

confidence: 99%

Stream‐based live data replication approach of in‐memory cache

Yang

2017

Concurrency and Computation

View full text Add to dashboard Cite

Replication is a method to keep the consistency of source data and target data. In our previous work of access-aware in-memory data cache middleware for relational databases, the data are easy to be lost in case that power cuts off. Therefore, we investigate a live data replication approach from in-memory data cache to versioning repository in this paper. This method attempts to recover the in-memory data cache from the versioning repository in failure of access-aware in-memory data cache middleware. Although the replication is not a new problem, the state of art of the replication in the context of document stores is not mature. In our paper, we propose a live data replication approach of in-memory document stores using stream processing framework.First, we introduce cell state model to describe the replication process. To infinitely look back to any revision, we enable our proposed cell state model to support copy-modify-merge model to manage the changed data revisions subsequently. Finally, experimental results show that this approach is more suitable for the replication of continuous in-stream changed data compared with MapReduce-based batch replication.

show abstract

“…The structure of the real-time analysis system is the use of Flume [4] to monitor /usr/local/data/flume_sources/data-1 if new data is generated, and every log information is collected in real time and saved in the Kafka message system which is then consumed by Storm system. Meanwhile, the consumption record is based on the Zookeeper cluster management which means last time consumption record can be find even if Kafka is down after the restart and continue to consumption from Kafka Broker [5] . Because of the non-atomic operations including consumption before record or record before consumption, few data loss or repeat consumption problems will occur when Kafka is down or similar problems happened at the time message is not recorded to Zookeeper after consumption.…”

Section: Real-time Data Writing To Linuxmentioning

confidence: 99%

Research on Energy Management System Integration and Energy Saving Optimization

Zhao¹,

Zhi²

2017

Proceedings of the 2017 Global Conference on Mechanics and Civil Engineering (GCMCE 2017)

View full text Add to dashboard Cite

Abstract. With the rapid development of modern society, an increasing number of people start to realize the importance of the electric power consumption of buildings. Based on this, the article explore the effective management method of energy consumption of buildings: First of all, through the DHT11(temperature and humidity sensors), on-site temperature and humidity data, together with the energy consumption data are collected and sent back to the linux file through the zigbee technology. The real-time data is saved to a linux file through executing Python program. Then applying the Flume+Kafka+Storm+Redis real-time analysis structure: As the producer of Kafka, Flume monitors whether the file has generated new data. As the consumer of Kafka, Storm cleans and organizes the data, monitors and predicts the energy consumption of the buildings. The data is analyzed periodically by using the Support Vector Machine (SVM) algorithm in spark MLlib and establish the forecasting model. At least, the data is stored in Redis and offline analysis is conducted periodically.

show abstract

Building a replicated logging system with Apache Kafka

Cited by 125 publications

References 2 publications

Defining the execution semantics of stream processing engines

Defining the execution semantics of stream processing engines

Stream‐based live data replication approach of in‐memory cache

Research on Energy Management System Integration and Energy Saving Optimization

Contact Info

Product

Resources

About