Abstract:Background] Nowadays, there is a massive growth of data volume and speed in many types of systems. It introduces new needs for infrastructure and applications that have to handle streams of data with low latency and high throughput. Testing applications that process such data streams has become a significant challenge for engineers. Companies are adopting different approaches to dealing with this issue. Some have developed their own solutions for testing, while others have adopted a combination of existing tes… Show more
“…Conducting semi-structured expert interviews is a wellestablished procedure, originally from psychology and social science, and proven for the exploratory examination of requirements in both the contexts of software engineering [20,14] and ML [3,7]. Compared to fully structured interviews there is some flexibility regarding the course of a conversation, which allows the experts to express their very own perception without being primed by too explicit questions.…”
Section: Semi-structured Interviewsmentioning
confidence: 99%
“…Within the ML community, there is a broad understanding of what ML workflows look like in reality. Numerous practical blog posts and textbooks aim to make these abstract conceptions tangible (e. g. [22,20,5,1]). Of course, there are differences in the variety of depictions.…”
Machine Learning (ML) is ubiquitously on the advance. Like many domains, Earth Observation (EO) also increasingly relies on ML applications, where ML methods are applied to process vast amounts of heterogeneous and continuous data streams to answer socially and environmentally relevant questions. However, developing such ML- based EO systems remains challenging: Development processes and employed workflows are often barely structured and poorly reported. The application of ML methods and techniques is considered to be opaque and the lack of transparency is contradictory to the responsible development of ML-based EO applications. To improve this situation a better understanding of the current practices and engineering-related challenges in developing ML-based EO applications is required. In this paper, we report observations from an exploratory study where five experts shared their view on ML engineering in semi-structured interviews. We analysed these interviews with coding techniques as often applied in the domain of empirical software engineering. The interviews provide informative insights into the practical development of ML applications and reveal several engineering challenges. In addition, interviewees participated in a novel workflow sketching task, which provided a tangible reflection of implicit processes. Overall, the results confirm a gap between theoretical conceptions and real practices in ML development even though workflows were sketched abstractly as textbook-like. The results pave the way for a large-scale investigation on requirements for ML engineering in EO.
“…Conducting semi-structured expert interviews is a wellestablished procedure, originally from psychology and social science, and proven for the exploratory examination of requirements in both the contexts of software engineering [20,14] and ML [3,7]. Compared to fully structured interviews there is some flexibility regarding the course of a conversation, which allows the experts to express their very own perception without being primed by too explicit questions.…”
Section: Semi-structured Interviewsmentioning
confidence: 99%
“…Within the ML community, there is a broad understanding of what ML workflows look like in reality. Numerous practical blog posts and textbooks aim to make these abstract conceptions tangible (e. g. [22,20,5,1]). Of course, there are differences in the variety of depictions.…”
Machine Learning (ML) is ubiquitously on the advance. Like many domains, Earth Observation (EO) also increasingly relies on ML applications, where ML methods are applied to process vast amounts of heterogeneous and continuous data streams to answer socially and environmentally relevant questions. However, developing such ML- based EO systems remains challenging: Development processes and employed workflows are often barely structured and poorly reported. The application of ML methods and techniques is considered to be opaque and the lack of transparency is contradictory to the responsible development of ML-based EO applications. To improve this situation a better understanding of the current practices and engineering-related challenges in developing ML-based EO applications is required. In this paper, we report observations from an exploratory study where five experts shared their view on ML engineering in semi-structured interviews. We analysed these interviews with coding techniques as often applied in the domain of empirical software engineering. The interviews provide informative insights into the practical development of ML applications and reveal several engineering challenges. In addition, interviewees participated in a novel workflow sketching task, which provided a tangible reflection of implicit processes. Overall, the results confirm a gap between theoretical conceptions and real practices in ML development even though workflows were sketched abstractly as textbook-like. The results pave the way for a large-scale investigation on requirements for ML engineering in EO.
“…Test activities are essential for DSP programs' quality assurance and to improve dependability. The importance is fully emphasized in a recent empirical study [11]. Software testing is a significant cost during the software development life cycle, and there exists a strong demand to design an efficient test solution for the expense-reducing purpose.…”
Section: Introductionmentioning
confidence: 99%
“…As discussed in [13], the common industry practice to test big data programs is running locally with randomly sampled data. An empirical study presented by Vianna et al [11] demonstrates that difficulties in generating test data are one of the most frequent problems when designing DSP programs. In practice, test data for DSP programs mainly comes from three sources [11], i.e., replaying historical data, mirroring real-time production data, and generating synthetic data randomly.…”
Section: Introductionmentioning
confidence: 99%
“…An empirical study presented by Vianna et al [11] demonstrates that difficulties in generating test data are one of the most frequent problems when designing DSP programs. In practice, test data for DSP programs mainly comes from three sources [11], i.e., replaying historical data, mirroring real-time production data, and generating synthetic data randomly. However, real data, including historical and real-time production data, may be privacy-sensitive and hinder developers from accessing conveniently.…”
Adoption of distributed stream processing (DSP) systems such as Apache Flink in real-time big data processing is increasing. However, DSP programs are prone to be buggy, especially when one programmer neglects some DSP features (e.g., source data reordering), which motivates development of approaches for testing and verification. In this paper, we focus on the test data generation problem for DSP programs. Currently, there is a lack of an approach that generates test data for DSP programs with both high path coverage and covering different stream reordering situations. We present a novel solution, SPOT (i.e., Stream Processing Program Test), to achieve these two goals simultaneously. At first, SPOT generates a set of individual test data representing each path of one DSP program through symbolic execution. Then, SPOT composes these independent data into various time series data (a.k.a, stream) in diverse reordering. Finally, we can perform a test by feeding the DSP program with these streams continuously. To automatically support symbolic analysis, we also developed JPF-Flink, a JPF (i.e., Java Pathfinder) extension to coordinate the execution of Flink programs. We present four case studies to illustrate that: (1) SPOT can support symbolic analysis for the commonly used DSP operators; (2) test data generated by SPOT can more efficiently achieve high JDU (i.e., Joint Dataflow and UDF) path coverage than two recent DSP testing approaches; (3) test data generated by SPOT can more easily trigger software failure when comparing with those two DSP testing approaches; and (4) the data randomly generated by those two test techniques are highly skewed in terms of stream reordering, which is measured by the entropy metric. In comparison, it is even for test data from SPOT.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.