The emergence of data stream applications has posed a number of new challenges to existing infrastructures, processing engines and programming models. In this sense, high-level interfaces, encapsulating algorithmic aspects in pattern-based constructions, have considerable reduced the development and parallelization efforts of these type of applications. An example of parallel pattern interface is GrPPI, a C++ generic high-level library that acts as a layer between developers and existing parallel programming frameworks, such as C++ threads, OpenMP and Intel TBB. In this paper, we complement the basic patterns supported by GrPPI with the new stream operators Split-Join and Window, and the advanced parallel patterns Stream-Pool, Windowed-Farm and Stream-Iterator for the aforementioned back ends. Thanks to these new stream operators, complex compositions among streaming patterns can be expressed. On the other hand, the collection of advanced patterns allows users to tackle some domain-specific applications, ranging from the evolutionary to the real-time computing areas, where compositions of basic patterns are not capable of fully mimicking the algorithmic behavior of their original sequential codes. The experimental evaluation of the new advanced patterns and the stream operators on a set of domain-specific use-cases, using different back ends and pattern-specific parameters, reports remarkable performance gains with respect to the sequential versions. Additionally, we demonstrate the benefits of the GrPPI pattern interface from the usability, flexibility and maintainability points of view.
IntroductionWith the rise of big data, services and instruments, such as mobile devices, social media and sensor networks, are constantly producing huge amounts of data [22]. This extreme on-line generated data growth poses profound challenges to existing processing engines, programming models and infrastructures. In this sense, existing big data application models, such as MapReduce, have become popular for batch processing. Nevertheless, these models cannot fulfill the strict requirements of low latency and high throughput demanded by data streaming applications (DaSP). To address these issues, the DaSP paradigm is a more adequate approach to deal with their real-time requirements [8]. Basically, the DaSP model considers that the data is not fully available nor stored in disk or memory, but rather they continuously arrive from one or more streams. The idea behind this paradigm is that data has to be processed as soon as it is received.