Research quality evaluation by AI in the era of large language models: advantages, disadvantages, and systemic effects – an opinion paper

Abstract

Artificial Intelligence (AI) technologies like ChatGPT now threaten bibliometrics as the primary generators of research quality indicators. They are already used in at least one research quality evaluation system and evidence suggests that they are used informally by many peer reviewers. Since harnessing bibliometrics to support research evaluation continues to be controversial, this article reviews the corresponding advantages and disadvantages of AI-generated quality scores. From a technical perspective, generative AI based on Large Language Models (LLMs) equals or surpasses bibliometrics in most important dimensions, including accuracy (mostly higher correlations with human scores), and coverage (more fields, more recent years) and may reflect more research quality dimensions. Like bibliometrics, current LLMs do not “measure” research quality, however. On the clearly negative side, LLM biases are currently unknown for research evaluation, and LLM scores are less transparent than citation counts. From a systemic perspective, a key issue is how introducing LLM-based indicators into research evaluation will change the behaviour of researchers. Whilst bibliometrics encourage some authors to target journals with high impact factors or to try to write highly cited work, LLM-based indicators may push them towards writing misleading abstracts and overselling their work in the hope of impressing the AI. Moreover, if AI-generated journal indicators replace impact factors, then this would encourage journals to allow authors to oversell their work in abstracts, threatening the integrity of the academic record.

Metadata

Item Type:	Article
Authors/Creators:	Thelwall, M. https://orcid.org/0000-0001-6065-205X
Copyright, Publisher and Additional Information:	© The Author(s) 2025. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Keywords:	Research evaluation; ChatGPT; Large Language Models; Research ethics
Dates:	Submitted: 3 October 2024 Accepted: 9 June 2025 Published (online): 28 July 2025 Published: October 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)
Funding Information:	Funder Grant number UK RESEARCH AND INNOVATION UKRI1079
Date Deposited:	17 Jun 2025 10:59
Last Modified:	25 Nov 2025 12:47
Status:	Published
Publisher:	Springer
Refereed:	Yes
Identification Number:	10.1007/s11192-025-05361-8
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:227637

Download

Published Version

Filename: s11192-025-05361-8.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Research quality evaluation by AI in the era of large language models: advantages, disadvantages, and systemic effects – an opinion paper

Abstract

Metadata

Download

Published Version

Export

Statistics