A Replication Study on the Usability of Code Vocabulary in Predicting Flaky Tests

Haben, Guillaume; Habchi, Sarra; Papadakis, Mike; Cordy, Maxime; Traon, Yves Le

doi:10.1109/msr52588.2021.00034

Cited by 25 publications

(11 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To ensure the generalisability of our results, it would have been preferable to include more flaky tests in our experiments. Nonetheless, the datasets of flaky tests are generally limited in size due to the elusiveness of flakiness [61], [14], [13]. Moreover, as explained in Section II, the requirements of this study limited the set of candidates considerably.…”

Section: Threats To Validitymentioning

confidence: 99%

“…Given the adverse effects of test flakiness, engineers and researchers aim at developing detection techniques that can predict whether a test is potentially flaky. These approaches rely on a number of runs and re-runs, such as IDFLAKIES [10] and SHAKER [11], coverage analysis like DEFLAKER [12], or static and dynamic test features [13], [14], [15], [16], [17], [18], [19]. Evaluated on open-source projects, these approaches showed promising detection accuracy and considerably decreased the amount of time and resources needed to detect flaky tests.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness

Habchi¹,

Haben²,

Sohn³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Flaky tests are defined as tests that manifest nondeterministic behaviour by passing and failing intermittently for the same version of the code. These tests cripple continuous integration with false alerts that waste developers' time and break their trust in regression testing. To mitigate the effects of flakiness, both researchers and industrial experts proposed strategies and tools to detect and isolate flaky tests. However, flaky tests are rarely fixed as developers struggle to localise and understand their causes. Additionally, developers working with large codebases often need to know the sources of nondeterminism to preserve code quality, i.e., avoid introducing technical debt linked with non-deterministic behaviour, and to avoid introducing new flaky tests. To aid with these tasks, we propose re-targeting Fault Localisation techniques to the flaky component localisation problem, i.e., pinpointing program classes that cause the non-deterministic behaviour of flaky tests. In particular, we employ Spectrum-Based Fault Localisation (SBFL), a coverage-based fault localisation technique commonly adopted for its simplicity and effectiveness. We also utilise other data sources, such as change history and static code metrics, to further improve the localisation. Our results show that augmenting SBFL with change and code metrics ranks flaky classes in the top-1 and top-5 suggestions, in 26% and 47% of the cases. Overall, we successfully reduced the average number of classes inspected to locate the first flaky class to 19% of the total number of classes covered by flaky tests. Our results also show that localisation methods are effective in major flakiness categories, such as concurrency and asynchronous waits, indicating their general ability to identify flaky components.

show abstract

Section: Threats To Validitymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness

Habchi¹,

Haben²,

Sohn³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Following previous studies on flakiness prediction [12,23] finding that Random Forest yields the best performances in flakiness classification tasks, we rely on this model for our classification as well. Selecting the model that yields the best performance is not in the scope of our study.…”

Section: Failure Classifiermentioning

confidence: 99%

“…This line of work has gained a lot of momentum lately as models achieved higher performances. Several works were carried out to replicate those studies and ensure their validity in different contexts [4,12]. More recently, FlakeFlagger [1] has been introduced as another model using an extended set of features retrieved from the code under test and test smells.…”

Section: Related Workmentioning

confidence: 99%

“…Hence, other tools such as DeFlaker [2] and iDFlakies [17] were designed to detect flaky tests with a minimal number of reruns. Recently, several approaches relied on machine-learning to predict flaky tests based on code vocabulary, code coverage, and dynamic features [1,12,23], allowing flakiness detection without reruns. Nevertheless, all these studies focus on distinguishing flaky tests from reliable tests and do not address the distinction between false alerts (i.e., flaky failures) and legitimate test failures (i.e., real regressions in the code).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Discerning Legitimate Failures From False Alerts: A Study of Chromium's Continuous Integration

Haben¹,

Habchi²,

Papadakis³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Flakiness is a major concern in Software testing. Flaky tests pass and fail for the same version of a program and mislead developers who spend time and resources investigating test failures only to discover that they are false alerts. In practice, the defacto approach to address this concern is to rerun failing tests hoping that they would pass and manifest as false alerts. Nonetheless, completely filtering out false alerts may require a disproportionate number of reruns, and thus incurs important costs both computation and time-wise. As an alternative to reruns, we propose Fair, a novel lightweight approach that classifies test failures into false alerts and legitimate failures. Fair relies on a classifier and a set of features from the failures and test artefacts. To build and evaluate our machine learning classifier, we use the continuous integration of the Chromium project. In particular, we collect the properties and artefacts of more than 1 million test failures from 2,000 builds. Our results show that Fair can accurately distinguish legitimate failures from false alerts, with an MCC up to 95%. Moreover, by studying different test categories: GUI, integration and unit tests, we show that Fair classifies failures accurately even when the number of failures is limited. Finally, we compare the costs of our approach to reruns and show that Fair could save up to 20 minutes of computation time per build.

show abstract

Static test flakiness prediction: How Far Can We Go?

2022

View full text Add to dashboard Cite

Test flakiness is a phenomenon occurring when a test case is non-deterministic and exhibits both a passing and failing behavior when run against the same code. Over the last years, the problem has been closely investigated by researchers and practitioners, who all have shown its relevance in practice. The software engineering research community has been working toward defining approaches for detecting and addressing test flakiness. Despite being quite accurate, most of these approaches rely on expensive dynamic steps, e.g., the computation of code coverage information. Consequently, they might suffer from scalability issues that possibly preclude their practical use. This limitation has been recently targeted through machine learning solutions that could predict the flakiness of tests using various features, like source code vocabulary or a mixture of static and dynamic metrics computed on individual snapshots of the system. In this paper, we aim to perform a step forward and predict test flakiness only using static metrics. We propose a large-scale experiment on 70 Java projects coming from the iDFlakies and FlakeFlagger datasets. First, we statistically assess the differences between flaky and non-flaky tests in terms of 25 test and production code metrics and smells, analyzing both their individual and combined effects. Based on the results achieved, we experiment with a machine learning approach that predicts test flakiness solely based on static features, comparing it with two state-of-the-art approaches. The key results of the study show that the static approach has performance comparable to those of the baselines. In addition, we found that the characteristics of the production code might impact the performance of the flaky test prediction models.

show abstract

A Replication Study on the Usability of Code Vocabulary in Predicting Flaky Tests

Cited by 25 publications

References 25 publications

What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness

What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness

Discerning Legitimate Failures From False Alerts: A Study of Chromium's Continuous Integration

Static test flakiness prediction: How Far Can We Go?

Contact Info

Product

Resources

About