Sarfraz, A., Birnbaum, A., Dolan, F. et al. (3 more authors) (2025) A robust unsupervised method for outlier set detection. Knowledge-Based Systems. 114274. ISSN: 0950-7051
Abstract
This paper proposes a robust method that identifies sets of points that collectively deviate from typical patterns in a dataset, which it calls “outlier sets”, while excluding individual points from detection. This new methodology, Outlier Set Two-step Identification (OSTI) employs a two-step approach to detect and label these outlier sets. First, it uses Gaussian Mixture Models for probabilistic clustering, identifying candidate outlier sets based on cluster weights below a hyperparameter threshold. Second, OSTI measures the Inter-cluster Mahalanobis distance between each candidate outlier set’s centroid and the overall dataset mean. OSTI then tests the null hypothesis that this distance does not significantly differ from its theoretical chi-square distribution, enabling the formal detection of outlier sets. We test OSTI systematically on 8,000 synthetic 2D datasets across various inlier configurations and thousands of possible outlier set characteristics. Results show OSTI robustly and consistently detects outlier sets with an average F1 score of 0.92 and an average purity (the degree to which outlier sets identified correspond to those generated synthetically, i.e., our ground truth) of 98.58%. We also compare OSTI with state-of-the-art outlier detection methods, to illuminate how OSTI fills a gap as a tool for the exclusive detection of outlier sets.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2025 The Authors. Except as otherwise noted, this author-accepted version of a journal article published in Knowledge-Based Systems is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ |
Keywords: | Outlier sets; Outlier Set Two-step Identification (OSTI); Gaussian mixture models; Inter-cluster Mahalanobis distance |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > School of Mechanical, Aerospace and Civil Engineering |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 15 Aug 2025 11:43 |
Last Modified: | 28 Aug 2025 12:36 |
Status: | Published online |
Publisher: | Elsevier |
Refereed: | Yes |
Identification Number: | 10.1016/j.knosys.2025.114274 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:230400 |