Evaluating features for machine learning detection of order- and non-order-dependent flaky tests

Parry, O., Kapfhammer, G.M., Hilton, M. et al. (1 more author) (2022) Evaluating features for machine learning detection of order- and non-order-dependent flaky tests. In: Proceedings of 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). 2022 IEEE Conference on Software Testing, Verification and Validation (ICST), 04-14 Apr 2022, Valencia, Spain. Institute of Electrical and Electronics Engineers (IEEE) , pp. 93-104. ISBN: 9781665466806 ISSN: 2159-4848 EISSN: 2771-3091

Abstract

Flaky tests are test cases that can pass or fail without code changes. They often waste the time of software developers and obstruct the use of continuous integration. Previous work has presented several automated techniques for detecting flaky tests, though many involve repeated test executions and a lot of source code instrumentation and thus may be both intrusive and expensive. While this motivates researchers to evaluate machine learning models for detecting flaky tests, prior work on the features used to encode a test case is limited. Without further study of this topic, machine learning models cannot perform to their full potential in this domain. Previous studies also exclude a specific, yet prevalent and problematic, category of flaky tests: order-dependent (OD) flaky tests. This means that prior research only addresses part of the challenge of detecting flaky tests with machine learning. Closing this knowledge gap, this paper presents a new feature set for encoding tests, called Flake16. Using 54 distinct pipelines of data preprocessing, data balancing, and machine learning models for detecting both non-order-dependent (NOD) and OD flaky tests, this paper compares Flake16 to another well-established feature set. To assess the new feature set's effectiveness, this paper's experiments use the test suites of 26 Python projects, consisting of over 67,000 tests. Along with identifying the most impactful metrics for using machine learning to detect both types of flaky test, the empirical study shows how Flake16 is better than prior work, including (1) a 13% increase in overall F1 score when detecting NOD flaky tests and (2) a 17% increase in overall F1 score when detecting OD flaky tests.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Parry, O. Kapfhammer, G.M. Hilton, M. McMinn, P. https://orcid.org/0000-0001-9137-7433
Copyright, Publisher and Additional Information:	© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. Reproduced in accordance with the publisher's self-archiving policy.
Keywords:	Software Testing; Flaky Tests; Machine Learning
Dates:	Published (online): 8 June 2022 Published: 8 June 2022
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Funding Information:	Funder Grant number ENGINEERING AND PHYSICAL SCIENCE RESEARCH COUNCIL EP/T015764/1
Depositing User:	Symplectic Sheffield
Date Deposited:	06 Aug 2025 14:43
Last Modified:	06 Aug 2025 14:43
Status:	Published
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Refereed:	Yes
Identification Number:	10.1109/icst53961.2022.00021
Related URLs:	Author Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:230092

CORE (COnnecting REpositories)

Evaluating features for machine learning detection of order- and non-order-dependent flaky tests

Abstract

Metadata

Download

Accepted Version

Export

Statistics