Hu, Z., Cui, J. and Lin, A. (2023) Identifying potentially excellent publications using a citation-based machine learning approach. Information Processing & Management, 60 (3). 103323. ISSN 0306-4573
Abstract
Excellent research papers are vital to science and technology advances. Thus, the early identification of potentially excellent research papers and recognizing their value in science and technology is high on the research agenda. This study used a set of 5 static and 8 time-dependent citation features to explore six machine learning methods and identify the method with the best performance to identify potentially excellent papers. The study modelled Random Forest, LightGBM, Naive Bayes, Support Vector Machine, Neural Network, and TabNet to identify PEPs in the artificial intelligence field. The study defined highly cited papers using the threshold of the top 1% and top 5% and collected the data from the Web of Science®. Bibliometric and citation data from 485,041 research articles, proceeding papers, and reviews published in AI between 1990 and 2010 were collected initially. The data was screened and processed, and the final dataset consists of 96,169 papers for the training and test sets. The findings suggest that the time-dependent citation features are more important than the static features, and citation peak features are more significant than the citation features in identifying potentially excellent papers. The findings demonstrate the effect of threshold on machine learning outcomes (e.g., the top 1% and 5%); therefore, the study argues that the decision about threshold selection should be carefully made. LightGBM and Random Forest both performed with the given conditions and achieved the same score in accuracy and recall. Nevertheless, when comparing their performance in other indicators, such as F1 and cross-entropy loss, LightGBM performed better. The study concluded that LightGBM was the best-performing model for identifying potentially excellent papers. The papers identified the contributions and recommended future research.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
Keywords: | Machine learning; Artificial intelligence; Excellent papers; Highly cited papers; Sleeping beauty; Citation-based measures; Citation peak; Neural network; LightGBM; TabNet |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 12 Apr 2023 10:24 |
Last Modified: | 12 Apr 2023 10:24 |
Status: | Published |
Publisher: | Elsevier |
Refereed: | Yes |
Identification Number: | 10.1016/j.ipm.2023.103323 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:197983 |