FlakeFlagger: Predicting Flakiness Without Rerunning Tests

Alshammari, Abdulrahman; Morris, Christopher J.; Hilton, Michael; Bell, Jonathan

doi:10.1109/icse-companion52605.2021.00081

Cited by 12 publications

(68 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Indeed, while reruns can affirm that a failure is due to flakiness by manifesting a test pass and fail for the same version, they do not allow us to affirm that a failure is legitimate. A previous study showed that up to 10,000 reruns can be required to discover flaky tests that have a low flake rate [1]. Hence, a legitimate failure in the case of Chromium can still be a false alert (flaky failure) that was not rerun enough to manifest.…”

Section: Test Historymentioning

confidence: 99%

“…Each failure in our dataset is denoted as an n-dimensional feature vector 𝑋 = (𝑥 1 , ..., 𝑥 𝑛 ) where 𝑥 𝑖 represents one feature. 𝑦 = {0, 1} indicates if the failure is from the false alert class (0) or from the legitimate failure class (1). Once all vectors are created, we randomly split our dataset by including 80% of it in the training set and 20% in the test set, conserving the class ratio in each subset (stratified).…”

Section: Failure Classifiermentioning

confidence: 99%

“…Studies often re-execute test suites a large number of times, 100 times [23], 400 times [11] or even 10,000 times [1] and are still able to uncover unseen flaky tests. Hence, other tools such as DeFlaker [2] and iDFlakies [17] were designed to detect flaky tests with a minimal number of reruns.…”

Section: Introductionmentioning

confidence: 99%

“…Hence, other tools such as DeFlaker [2] and iDFlakies [17] were designed to detect flaky tests with a minimal number of reruns. Recently, several approaches relied on machine-learning to predict flaky tests based on code vocabulary, code coverage, and dynamic features [1,12,23], allowing flakiness detection without reruns. Nevertheless, all these studies focus on distinguishing flaky tests from reliable tests and do not address the distinction between false alerts (i.e., flaky failures) and legitimate test failures (i.e., real regressions in the code).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Discerning Legitimate Failures From False Alerts: A Study of Chromium's Continuous Integration

Haben¹,

Habchi²,

Papadakis³

et al. 2021

Preprint

View full text Add to dashboard Cite

Flakiness is a major concern in Software testing. Flaky tests pass and fail for the same version of a program and mislead developers who spend time and resources investigating test failures only to discover that they are false alerts. In practice, the defacto approach to address this concern is to rerun failing tests hoping that they would pass and manifest as false alerts. Nonetheless, completely filtering out false alerts may require a disproportionate number of reruns, and thus incurs important costs both computation and time-wise. As an alternative to reruns, we propose Fair, a novel lightweight approach that classifies test failures into false alerts and legitimate failures. Fair relies on a classifier and a set of features from the failures and test artefacts. To build and evaluate our machine learning classifier, we use the continuous integration of the Chromium project. In particular, we collect the properties and artefacts of more than 1 million test failures from 2,000 builds. Our results show that Fair can accurately distinguish legitimate failures from false alerts, with an MCC up to 95%. Moreover, by studying different test categories: GUI, integration and unit tests, we show that Fair classifies failures accurately even when the number of failures is limited. Finally, we compare the costs of our approach to reruns and show that Fair could save up to 20 minutes of computation time per build.

show abstract

Section: Test Historymentioning

confidence: 99%

Section: Failure Classifiermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Discerning Legitimate Failures From False Alerts: A Study of Chromium's Continuous Integration

Haben¹,

Habchi²,

Papadakis³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Other studies investigated tools and techniques that could help developers to cope with test flakiness. Automated tools, such as DeFlaker [11], iDFlakies [12], and FlakeFlagger [13] have been developed in order to detect flaky tests with a minimum number of test runs or re-runs. Unfortunately, these advances offer only partial solutions to the problem and may not fit well within the development systems and organisation constraints.…”

Section: Introductionmentioning

confidence: 99%

A Qualitative Study on the Sources, Impacts, and Mitigation Strategies of Flaky Tests

Habchi¹,

Haben²,

Papadakis³

et al. 2021

Preprint

View full text Add to dashboard Cite

Test flakiness forms a major testing concern. Flaky tests manifest non-deterministic outcomes that cripple continuous integration and lead developers to investigate false alerts. Industrial reports indicate that on a large scale, the accrual of flaky tests breaks the trust in test suites and entails significant computational cost. To alleviate this, practitioners are constrained to identify flaky tests and investigate their impact. To shed light on such mitigation mechanisms, we interview 14 practitioners with the aim to identify (i) the sources of flakiness within the testing ecosystem, (ii) the impacts of flakiness, (iii) the measures adopted by practitioners when addressing flakiness, and (iv) the automation opportunities for these measures. Our analysis shows that, besides the tests and code, flakiness stems from interactions between the system components, the testing infrastructure, and external factors. We also highlight the impact of flakiness on testing practices and product quality and show that the adoption of guidelines together with a stable infrastructure are key measures in mitigating the problem.• RQ1: Where can we locate flakiness? Goal: Differently from previous studies [6]-[10], which focused on the root causes of flakiness, e.g., concurrency and timeouts, we aim to identify where flakiness stems within the different components of the development ecosystem, e.g., test, code under test, and infrastructure. This localisation is necessary to guide both detection and fixing approaches. Results: In addition to tests, flakiness stems from the poor orchestration between the system components, the testing infrastructure, and external factors, e.g., OS and firmware. Studies should consider and leverage these factors when addressing flaky tests and not focus solely on the test and code under test.• RQ2: How do practitioners perceive the impact of flakiness? Goal: This question is commonly discussed in industrial reports and research studies. In this paper, we examine it through direct discussions with practitioners. The aim is to understand the impact of flakiness on the development workflow and practices. Results: Besides dissipating development time and hindering the continuous integration (CI), flakiness affects the testing practices and leads to a degradation of the system quality. We also shed light on the pernicious consequences of system flakiness, i.e., buggy or non-deterministic features that are falsely labelled as flaky tests.• RQ3: How do practitioners address flaky tests? Goal: This question aims at identifying and understanding the measures taken by practitioners to address flakiness before and after it manifests in the CI. Results: The prevention of test flakiness is performed by building stable infrastructures and enforcing guidelines, whereas the detection still relies mainly on reruns and manual inspection. Our results also highlight monitoring and logging tasks, which are commonly dismissed in research, yet they are key to most of the mitigation measures taken by practitioners.• RQ4: Ho...

show abstract

Static test flakiness prediction: How Far Can We Go?

2022

View full text Add to dashboard Cite

Test flakiness is a phenomenon occurring when a test case is non-deterministic and exhibits both a passing and failing behavior when run against the same code. Over the last years, the problem has been closely investigated by researchers and practitioners, who all have shown its relevance in practice. The software engineering research community has been working toward defining approaches for detecting and addressing test flakiness. Despite being quite accurate, most of these approaches rely on expensive dynamic steps, e.g., the computation of code coverage information. Consequently, they might suffer from scalability issues that possibly preclude their practical use. This limitation has been recently targeted through machine learning solutions that could predict the flakiness of tests using various features, like source code vocabulary or a mixture of static and dynamic metrics computed on individual snapshots of the system. In this paper, we aim to perform a step forward and predict test flakiness only using static metrics. We propose a large-scale experiment on 70 Java projects coming from the iDFlakies and FlakeFlagger datasets. First, we statistically assess the differences between flaky and non-flaky tests in terms of 25 test and production code metrics and smells, analyzing both their individual and combined effects. Based on the results achieved, we experiment with a machine learning approach that predicts test flakiness solely based on static features, comparing it with two state-of-the-art approaches. The key results of the study show that the static approach has performance comparable to those of the baselines. In addition, we found that the characteristics of the production code might impact the performance of the flaky test prediction models.

show abstract

FlakeFlagger: Predicting Flakiness Without Rerunning Tests

Cited by 12 publications

References 36 publications

Discerning Legitimate Failures From False Alerts: A Study of Chromium's Continuous Integration

Discerning Legitimate Failures From False Alerts: A Study of Chromium's Continuous Integration

A Qualitative Study on the Sources, Impacts, and Mitigation Strategies of Flaky Tests

Static test flakiness prediction: How Far Can We Go?

Contact Info

Product

Resources

About