Separating passing and failing test executions by clustering anomalies

Almaghairbe, Rafig; Roper, Marc

doi:10.1007/s11219-016-9339-1

Cited by 22 publications

(33 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In an earlier study [2] we explored a range of clustering algorithms using either just test inputs and outputs, or inputs, outputs and execution traces, and found that small (less than average sized) clusters contained more than 60% of failures (and often a substantially higher proportion). Moreover, as well as having a higher failure density they also contained a spread of failures in the cases where there were multiple faults in the programs.…”

Section: Overview Of a Test Classification Strategymentioning

confidence: 99%

Using Machine Learning to Classify Test Outcomes

Roper

2019

2019 IEEE International Conference on Artificial Intelligence Testing (AITest)

Self Cite

View full text Add to dashboard Cite

When testing software it has been shown that there are substantial benefits to be gained from approaches which exercise unusual or unexplored interactions with a systemtechniques such as random testing, fuzzing, and exploratory testing. However, such approaches have a drawback in that the outputs of the tests need to be manually checked for correctness, representing a significant burden for the software engineer. This paper presents a strategy to support the process of identifying which tests have passed or failed by combining clustering and semi-supervised learning. We have shown that by using machine learning it is possible to cluster test cases in such a way that those corresponding to failures concentrate into smaller clusters. Examining the test outcomes in clustersize order has the effect of prioritising the results: those that are checked early on have a much higher probability of being a failing test. As the software engineer examines the results (and confirms or refutes the initial classification), this information is employed to bootstrap a secondary learner to further improve the accuracy of the classification of the (as yet) unchecked tests. Results from experimenting with a range of systems demonstrate the substantial benefits that can be gained from this strategy, and how remarkably accurate test output classifications can be derived from examining a relatively small proportion of results.

show abstract

Section: Overview Of a Test Classification Strategymentioning

confidence: 99%

Using Machine Learning to Classify Test Outcomes

Roper

2019

2019 IEEE International Conference on Artificial Intelligence Testing (AITest)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this case derived oracles are commonly used to decrease the number of tests to manually examine or ease the validation. For example, existing tests can be used to generate more meaningful tests [25], similarity between executions can be used to pinpoint suspicious asserts [32], or clustering techniques can be used to group potentially faulty tests [1]. Moreover, if there are multiple versions from the implementation (e.g., regression testing [47] or di erent implementations for the same speci cation [29]), tests generated from one version could be executed on the other one.…”

Section: Related Workmentioning

confidence: 99%

“…us the question that motivated, and served as a basis of our research is the following: How do developers perform in using the tests generated from code to detect faults and decide whether the implementation is correct? 1 is question is mainly motivated by the fact that the actual fault-nding capability of white-box test generator tools could be much lower than reported in already 1 Note that if a test generated from a faulty implementation encodes a fault but passes, then the test can be considered faulty as well. erefore classifying the tests as faulty or correct could reveal a faulty implementation.…”

Section: Introductionmentioning

confidence: 99%

Classifying generated white-box tests: an exploratory study

Honfi

Micskei

2019

Software Qual J

View full text Add to dashboard Cite

White-box test generator tools rely only on the code under test to select test inputs, and capture the implementation's output as assertions. If there is a fault in the implementation, it could get encoded in the generated tests. Tool evaluations usually measure fault-detection capability using the number of such fault-encoding tests. However, these faults are only detected, if the developer can recognize that the encoded behavior is faulty. We designed an exploratory study to investigate how developers perform in classifying generated white-box test as faulty or correct. We carried out the study in a laboratory se ing with 54 graduate students. e tests were generated for two open-source projects with the help of the IntelliTest tool. e performance of the participants were analyzed using binary classi cation metrics and by coding their observed activities. e results showed that participants incorrectly classi ed a large number of both fault-encoding and correct tests (with median misclassi cation rate 33% and 25% respectively). us the real fault-detection capability of test generators could be much lower than typically reported, and we suggest to take this human factor into account when evaluating generated white-box tests.

show abstract

“…Previous work by the authors has explored the use of machine learning techniques to support the automatic classification of test outcomes as either passing or failing, and thereby providing a form of test oracle [1], [2], [3], but their relative performance, strengths and weaknesses have not been statistically analysed and compared with existing techniques. The aim of this study is to investigate and extensively evaluate these approaches to test oracle construction in terms of effectiveness when they are applied to medium-sized subject systems.…”

Section: Introductionmentioning

confidence: 99%

“…The aim of this study is to investigate and extensively evaluate these approaches to test oracle construction in terms of effectiveness when they are applied to medium-sized subject systems. The empirical evaluation in this paper can be summarised as follows: (1) statistical verification is implemented into two different sets of experimental results (in the first experiment, the input to the machine learning techniques consisted of just the test case inputs along with their associated outputs, and the second experiment extended this by adding to the input/output pairs their corresponding execution traces); (2) new results are presented that evaluate the effectiveness of our machine learning techniques by calculating the accuracy, recall, and the false positive rate; (3) a comparison between existing techniques from the specification mining domain (the data invariant detector Daikon [4]) and machine learning techniques is reported (Daikon was selected because it was the most effective oracle from a set of dynamic analysis techniques explored in a previous study [8]). The study is useful for testers because they need to be able to assess the features offered by these oracles, and also for the developers of oracle-based approaches to further understand the strengths and weaknesses of these different techniques and how they can be developed.…”

Section: Introductionmentioning

confidence: 99%

Machine Learning Techniques for Automated Software Fault Detection via Dynamic Execution Data

Almaghairbe

Roper

Almabruk

2020

Proceedings of the 6th International Conference on Engineering &Amp; MIS 2020

Self Cite

View full text Add to dashboard Cite

The biggest obstacle of automated software testing is the construction of test oracles. Today, it is possible to generate enormous amount of test cases for an arbitrary system that reach a remarkably high level of coverage, but the effectiveness of test cases is limited by the availability of test oracles that can distinguish failing executions. Previous work by the authors has explored the use of unsupervised and semi-supervised learning techniques to develop test oracles so that the correctness of software outputs and behaviours on new test cases can be predicated [1], [2], [3], and experimental results demonstrate the promise of this approach. In this paper, we present an evaluation study for test oracles based on machine-learning approaches via dynamic execution data (firstly, input/output pairs and secondly, amalgamations of input/output pairs and execution traces) by comparing their effectiveness with existing techniques from the specification mining domain (the data invariant detector Daikon [4]). The two approaches are evaluated on a range of mid-sized systems and compared in terms of their fault detection ability and false positive rate. The empirical study also discuss the major limitations and the most important properties related to the application of machine learning techniques as test oracles in practice. The study also gives a road map for further research direction in order to tackle some of discussed limitations such as accuracy and scalability. The results show that in most cases semi-supervised learning techniques performed far better as an automated test classifier than Daikon (especially in the case that input/output pairs were augmented with their execution traces). However, there is one system for which our strategy struggles and Daikon performed far better. Furthermore, unsupervised learning techniques performed on a par when compared with Daikon in several cases particularly when input/output pairs were used together with execution traces.

show abstract

Separating passing and failing test executions by clustering anomalies

Cited by 22 publications

References 21 publications

Using Machine Learning to Classify Test Outcomes

Using Machine Learning to Classify Test Outcomes

Classifying generated white-box tests: an exploratory study

Machine Learning Techniques for Automated Software Fault Detection via Dynamic Execution Data

Contact Info

Product

Resources

About