This paper proposes a mutation testing approach for big data processing programs that follow a data flow model, such as those implemented on top of Apache Spark. Mutation testing is a fault-based technique that relies on fault simulation by modifying programs, to create faulty versions called mutants. Mutant creation is carried on by operators able to simulate specific and well identified faults. A testing process must be able to signal faults within mutants and thereby avoid having ill behaviours within a program. We propose a set of mutation operators designed for Spark programs characterized by a data flow and data processing operations. These operators model changes in the data flow and operations, to simulate faults that take into account Spark program characteristics. We performed manual experiments to evaluate the proposed mutation operators in terms of cost and effectiveness. Thereby, we show that mutation operators can contribute to the testing process, in the construction of reliable Spark programs.
We propose a new model for data processing programs. Our model generalizes the data flow programming style implemented by systems such as Apache Spark, DryadLINQ, Apache Beam and Apache Flink. The model uses directed acyclic graphs (DAGs) to represent the main aspects of data flow-based systems, namely Operations over data (filtering, aggregation, join) and Program execution defined by data dependence between operations. We use Monoid Algebra to model operations over distributed, partitioned datasets and Petri Nets to represent the data/control flow. This allows the specification of a data processing program to be agnostic of the target Big Data processing system. Our model has been used to design mutation test operators for big data processing programs. These operators have been implemented by the testing environment TRANSMUT-Spark.
Background: BETA (Bbased testing approach) is a toolsupported approach to generate test cases from Bmethod specifications through the application of input space partitioning and logical coverage criteria. The BETA tool automates the whole process, from the design of abstract test cases to the generation of executable test scripts. Methods: In this paper, we present an empirical study that was performed to evaluate and contribute to the development of BETA. The study evaluated the applicability of BETA on problems with different characteristics and used techniques of code coverage and mutation analysis to measure the quality of the generated test cases. The study was carried out in different rounds, and the results of each round were used as a reference for the improvement of the approach and its supporting tool. Results: These case studies were relevant not only for the evaluation of BETA but also to evaluate how different features affect the usability of the approach and the quality of the test cases and to compare the quality of the test cases generated using different coverage criteria.
Conclusions:The results of this study showed that (1) BETAgenerated test scenarios for the different criteria follow theoretical expectations in terms of criteria subsumption; (2) the BETA implementation of the logical criteria generates more efficient test sets regarding code and mutation coverage than the input space partitioning ones; (3) it is important to go beyond the strict requirements of the criteria by adding some additional variation (randomization) of the input data; and (4) the algorithms designed to combine test requirements into test cases need to deal carefully with infeasible (logically unsatisfiable) combinations.
This paper proposes TRANSMUT-SPARK for automating mutation testing of Big Data processing code within Spark programs. Apache Spark is an engine for Big Data Analytics/Processing that hides the inherent complexity of parallel Big Data programming. Nonetheless, programmers must cleverly combine Spark built-in functions within programs and guide the engine to use the right data management strategies to exploit the computational resources required by Big Data processing and avoid substantial production losses. Many programming details in Spark data processing code are prone to false statements that must be correctly and automatically tested. This paper explores the application of mutation testing in Spark programs, a faultbased testing technique that relies on fault simulation to evaluate and design test sets. The paper introduces TRANSMUT-SPARK for testing Spark programs by automating the most laborious steps of the process and fully executing the mutation testing process. The paper describes how the TRANSMUT-SPARK automates the mutants generation, test execution, and adequacy analysis phases of mutation testing. It also discusses the results of experiments to validate the tool and argues its scope and limitations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.