Walkinshaw, N. and Minku, L. (2018) Are 20% of files responsible for 80% of defects? In: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 11-12 Oct 2018, Oulu, Finland. ACM ISBN 978-1-4503-5823-1
Abstract
Background: Over the past two decades a mixture of anecdote from the industry and empirical studies from academia have suggested that the 80:20 rule (otherwise known as the Pareto Principle) applies to the relationship between source code files and the number of defects in the system: a small minority of files (roughly 20%) are responsible for a majority of defects (roughly 80%).
Aims: This paper aims to establish how widespread the phenomenon is by analysing 100 systems (previous studies have focussed on between one and three systems), with the goal of whether and under what circumstances this relationship does hold, and whether the key files can be readily identified from basic metrics.
Method: We devised a search criterion to identify defect fixes from commit messages and used this to analyse 100 active Github repositories, spanning a variety of languages and domains. We then studied the relationship between files, basic metrics (churn and LOC), and defect fixes.
Results: We found that the Pareto principle does hold, but only if defects that incur fixes to multiple files count as multiple defects. When we investigated multi-file fixes, we found that key files (belonging to the top 20%) are commonly fixed alongside other much less frequently-fixed files. We found LOC to be poorly correlated with defect proneness, Code Churn was a more reliable indicator, but only for extremely high values of Churn.
Conclusions: It is difficult to reliably identify the "most fixed" 20% of files from basic metrics. However, even if they could be reliably predicted, focussing on them would probably be misguided. Although fixes will naturally involve files that are often involved in other fixes too, they also tend to include other less frequently-fixed files.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2018 Association for Computing Machinery. This is an author-produced version of a paper subsequently published in Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. Uploaded in accordance with the publisher's self-archiving policy. |
Keywords: | Defect distribution; Pareto principle; Survey |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 04 Dec 2018 15:32 |
Last Modified: | 05 Dec 2018 09:05 |
Published Version: | https://doi.org/10.1145/3239235.3239244 |
Status: | Published |
Publisher: | ACM |
Refereed: | Yes |
Identification Number: | 10.1145/3239235.3239244 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:139558 |