Parry, O. orcid.org/0000-0002-0917-1274, Kapfhammer, G. orcid.org/0000-0002-7706-2299, Hilton, M. orcid.org/0000-0001-9195-6902 et al. (1 more author) (2025) Systemic flakiness: an empirical analysis of co-occurring flaky test failures. In: Ali Babar, M., Tosun, A., Wagner, S. and Stray, V., (eds.) Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering. 29th International Conference on Evaluation and Assessment in Software Engineering (EASE 25), 17-20 Jun 2025, Istanbul, Turkiye. Association for Computing Machinery, pp. 476-487. ISBN: 9798400713859.
Abstract
Flaky tests produce inconsistent outcomes without code changes, creating major challenges for software developers. An industrial case study reported that developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250. This paper reveals that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness. This result suggests that developers can reduce test repair costs by addressing shared root causes, enabling them to fix multiple flaky tests at once rather than tackling them individually. This study represents an inflection point by challenging the deep-seated assumption that flaky test failures are isolated occurrences. We used an established dataset of 10,000 test suite runs from 24 Java projects on GitHub, spanning domains from data orchestration to job scheduling. Using a data set that contains 810 flaky tests, we performed a mixed-method empirical analysis of co-occurring flaky test failures, revealing that systemic flakiness is significant and widespread. We ran agglomerative clustering of flaky tests based on their failure co-occurrence, showing that 75% of flaky tests across all projects belong to a cluster, with a mean cluster size of 13.5 flaky tests. Instead of requiring 10,000 test suite runs to identify systemic flakiness, this paper demonstrates a lightweight alternative by training machine learning models based on static test case distance measures. Through manual inspection of stack traces, conducted independently by the paper’s four authors and resolved through negotiated agreement, we identified intermittent networking issues and instabilities in external dependencies as the predominant causes of systemic flakiness in the chosen open-source projects.
Metadata
| Item Type: | Proceedings Paper |
|---|---|
| Authors/Creators: |
|
| Editors: |
|
| Copyright, Publisher and Additional Information: | © 2025 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0 |
| Keywords: | Software Testing; Flaky Tests; Systemic Flakiness |
| Dates: |
|
| Institution: | The University of Sheffield |
| Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
| Funding Information: | Funder Grant number ENGINEERING AND PHYSICAL SCIENCE RESEARCH COUNCIL EP/X024539/1 |
| Date Deposited: | 07 Jan 2026 14:55 |
| Last Modified: | 08 Jan 2026 10:07 |
| Status: | Published |
| Publisher: | Association for Computing Machinery |
| Refereed: | Yes |
| Identification Number: | 10.1145/3756681.3756945 |
| Related URLs: | |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:236252 |
Download
Filename: 3756681.3756945.pdf
Licence: CC-BY 4.0

CORE (COnnecting REpositories)
CORE (COnnecting REpositories)