Systemic flakiness: an empirical analysis of co-occurring flaky test failures

Parry, O. orcid.org/0000-0002-0917-1274, Kapfhammer, G. orcid.org/0000-0002-7706-2299, Hilton, M. orcid.org/0000-0001-9195-6902 et al. (1 more author) (2025) Systemic flakiness: an empirical analysis of co-occurring flaky test failures. In: Ali Babar, M., Tosun, A., Wagner, S. and Stray, V., (eds.) Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering. 29th International Conference on Evaluation and Assessment in Software Engineering (EASE 25), 17-20 Jun 2025, Istanbul, Turkiye. Association for Computing Machinery, pp. 476-487. ISBN: 9798400713859.

Abstract

Flaky tests produce inconsistent outcomes without code changes, creating major challenges for software developers. An industrial case study reported that developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250. This paper reveals that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness. This result suggests that developers can reduce test repair costs by addressing shared root causes, enabling them to fix multiple flaky tests at once rather than tackling them individually. This study represents an inflection point by challenging the deep-seated assumption that flaky test failures are isolated occurrences. We used an established dataset of 10,000 test suite runs from 24 Java projects on GitHub, spanning domains from data orchestration to job scheduling. Using a data set that contains 810 flaky tests, we performed a mixed-method empirical analysis of co-occurring flaky test failures, revealing that systemic flakiness is significant and widespread. We ran agglomerative clustering of flaky tests based on their failure co-occurrence, showing that 75% of flaky tests across all projects belong to a cluster, with a mean cluster size of 13.5 flaky tests. Instead of requiring 10,000 test suite runs to identify systemic flakiness, this paper demonstrates a lightweight alternative by training machine learning models based on static test case distance measures. Through manual inspection of stack traces, conducted independently by the paper’s four authors and resolved through negotiated agreement, we identified intermittent networking issues and instabilities in external dependencies as the predominant causes of systemic flakiness in the chosen open-source projects.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Parry, O. https://orcid.org/0000-0002-0917-1274 Kapfhammer, G. https://orcid.org/0000-0002-7706-2299 Hilton, M. https://orcid.org/0000-0001-9195-6902 McMinn, P. https://orcid.org/0000-0001-9137-7433
Editors:	Ali Babar, M. Tosun, A. Wagner, S. Stray, V.
Copyright, Publisher and Additional Information:	© 2025 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0
Keywords:	Software Testing; Flaky Tests; Systemic Flakiness
Dates:	Published (online): 24 December 2025 Published: 17 June 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Funding Information:	Funder Grant number ENGINEERING AND PHYSICAL SCIENCE RESEARCH COUNCIL EP/X024539/1
Date Deposited:	07 Jan 2026 14:55
Last Modified:	08 Jan 2026 10:07
Status:	Published
Publisher:	Association for Computing Machinery
Refereed:	Yes
Identification Number:	10.1145/3756681.3756945
Related URLs:	Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:236252

Download

Published Version

Filename: 3756681.3756945.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Systemic flakiness: an empirical analysis of co-occurring flaky test failures

Abstract

Metadata

Download

Published Version

Export

Statistics