An Exploratory Study of How Specialists Deal with Testing in Data Stream Processing Applications

Vianna, Alexandre; Ferreira, Waldemar; Gama, Kiev

doi:10.1109/esem.2019.8870186

Cited by 3 publications

(5 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Conducting semi-structured expert interviews is a wellestablished procedure, originally from psychology and social science, and proven for the exploratory examination of requirements in both the contexts of software engineering [20,14] and ML [3,7]. Compared to fully structured interviews there is some flexibility regarding the course of a conversation, which allows the experts to express their very own perception without being primed by too explicit questions.…”

Section: Semi-structured Interviewsmentioning

confidence: 99%

See 1 more Smart Citation

A qualitative study of Machine Learning practices and engineering challenges in Earth Observation

Jentzsch

Hochgeschwender

2021

It - Information Technology

View full text Add to dashboard Cite

Machine Learning (ML) is ubiquitously on the advance. Like many domains, Earth Observation (EO) also increasingly relies on ML applications, where ML methods are applied to process vast amounts of heterogeneous and continuous data streams to answer socially and environmentally relevant questions. However, developing such ML- based EO systems remains challenging: Development processes and employed workflows are often barely structured and poorly reported. The application of ML methods and techniques is considered to be opaque and the lack of transparency is contradictory to the responsible development of ML-based EO applications. To improve this situation a better understanding of the current practices and engineering-related challenges in developing ML-based EO applications is required. In this paper, we report observations from an exploratory study where five experts shared their view on ML engineering in semi-structured interviews. We analysed these interviews with coding techniques as often applied in the domain of empirical software engineering. The interviews provide informative insights into the practical development of ML applications and reveal several engineering challenges. In addition, interviewees participated in a novel workflow sketching task, which provided a tangible reflection of implicit processes. Overall, the results confirm a gap between theoretical conceptions and real practices in ML development even though workflows were sketched abstractly as textbook-like. The results pave the way for a large-scale investigation on requirements for ML engineering in EO.

show abstract

Section: Semi-structured Interviewsmentioning

confidence: 99%

“…Within the ML community, there is a broad understanding of what ML workflows look like in reality. Numerous practical blog posts and textbooks aim to make these abstract conceptions tangible (e. g. [22,20,5,1]). Of course, there are differences in the variety of depictions.…”

Section: Think-aloud Workflow Sketchingmentioning

confidence: 99%

A qualitative study of Machine Learning practices and engineering challenges in Earth Observation

Jentzsch

Hochgeschwender

2021

It - Information Technology

View full text Add to dashboard Cite

show abstract

“…Test activities are essential for DSP programs' quality assurance and to improve dependability. The importance is fully emphasized in a recent empirical study [11]. Software testing is a significant cost during the software development life cycle, and there exists a strong demand to design an efficient test solution for the expense-reducing purpose.…”

Section: Introductionmentioning

confidence: 99%

“…As discussed in [13], the common industry practice to test big data programs is running locally with randomly sampled data. An empirical study presented by Vianna et al [11] demonstrates that difficulties in generating test data are one of the most frequent problems when designing DSP programs. In practice, test data for DSP programs mainly comes from three sources [11], i.e., replaying historical data, mirroring real-time production data, and generating synthetic data randomly.…”

Section: Introductionmentioning

confidence: 99%

“…An empirical study presented by Vianna et al [11] demonstrates that difficulties in generating test data are one of the most frequent problems when designing DSP programs. In practice, test data for DSP programs mainly comes from three sources [11], i.e., replaying historical data, mirroring real-time production data, and generating synthetic data randomly. However, real data, including historical and real-time production data, may be privacy-sensitive and hinder developers from accessing conveniently.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

SPOT: Testing Stream Processing Programs with Symbolic Execution and Stream Synthesizing

Qian

2021

Applied Sciences

View full text Add to dashboard Cite

Adoption of distributed stream processing (DSP) systems such as Apache Flink in real-time big data processing is increasing. However, DSP programs are prone to be buggy, especially when one programmer neglects some DSP features (e.g., source data reordering), which motivates development of approaches for testing and verification. In this paper, we focus on the test data generation problem for DSP programs. Currently, there is a lack of an approach that generates test data for DSP programs with both high path coverage and covering different stream reordering situations. We present a novel solution, SPOT (i.e., Stream Processing Program Test), to achieve these two goals simultaneously. At first, SPOT generates a set of individual test data representing each path of one DSP program through symbolic execution. Then, SPOT composes these independent data into various time series data (a.k.a, stream) in diverse reordering. Finally, we can perform a test by feeding the DSP program with these streams continuously. To automatically support symbolic analysis, we also developed JPF-Flink, a JPF (i.e., Java Pathfinder) extension to coordinate the execution of Flink programs. We present four case studies to illustrate that: (1) SPOT can support symbolic analysis for the commonly used DSP operators; (2) test data generated by SPOT can more efficiently achieve high JDU (i.e., Joint Dataflow and UDF) path coverage than two recent DSP testing approaches; (3) test data generated by SPOT can more easily trigger software failure when comparing with those two DSP testing approaches; and (4) the data randomly generated by those two test techniques are highly skewed in terms of stream reordering, which is measured by the entropy metric. In comparison, it is even for test data from SPOT.

show abstract

Fast Prototyping of Distributed Stream Processing Applications with stream2gym

Amin Ifath,

Neves,

Haque

2023

2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)

View full text Add to dashboard Cite

An Exploratory Study of How Specialists Deal with Testing in Data Stream Processing Applications

Cited by 3 publications

References 20 publications

A qualitative study of Machine Learning practices and engineering challenges in Earth Observation

A qualitative study of Machine Learning practices and engineering challenges in Earth Observation

SPOT: Testing Stream Processing Programs with Symbolic Execution and Stream Synthesizing

Fast Prototyping of Distributed Stream Processing Applications with stream2gym

Contact Info

Product

Resources

About