Predicting retracted research: a dataset and machine learning approaches

Abstract

Background: Retractions undermine the scientific record’s reliability and can lead to the continued propagation of flawed research. This study aimed to (1) create a dataset aggregating retraction information with bibliographic metadata, (2) train and evaluate various machine learning approaches to predict article retractions, and (3) assess each feature’s contribution to feature-based classifier performance using ablation studies.

Methods: An open-access dataset was developed by combining information from the Retraction Watch database and the OpenAlex API. Using a casecontrolled design, retracted research articles were paired with non-retracted articles published in the same period. Traditional feature-based classifiers and models leveraging contextual language representations were then trained and evaluated. Model performance was assessed using accuracy, precision, recall, and the F1-score.

Results: The Llama 3.2 base model achieved the highest overall accuracy. The Random Forest classifier achieved a precision of 0.687 for identifying nonretracted articles, while the Llama 3.2 base model reached a precision of 0.683 for identifying retracted articles. Traditional feature-based classifiers generally outperformed most contextual language models, except for the Llama 3.2 base model, which showed competitive performance across several metrics.

Conclusions: Although no single model excelled across all metrics, our findings indicate that machine learning techniques can effectively support the identification of retracted research. These results provide a foundation for developing automated tools to assist publishers and reviewers in detecting potentially problematic publications. Further research should focus on refining these models and investigating additional features to improve predictive performance.

Metadata

Item Type:	Article
Authors/Creators:	Fletcher, A.H.A. Stevenson, M. https://orcid.org/0000-0002-9483-6006
Copyright, Publisher and Additional Information:	© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Keywords:	Retraction prediction; Machine Learning; Scientific Publishing
Dates:	Submitted: 31 January 2025 Accepted: 14 May 2025 Published (online): 11 June 2025 Published: 11 June 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	21 May 2025 15:32
Last Modified:	16 Jun 2025 10:39
Status:	Published
Publisher:	BMC
Refereed:	Yes
Identification Number:	10.1186/s41073-025-00168-w
Related URLs:	Dataset
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:226914