Can ChatGPT evaluate research environments? Evidence from REF2021

Abstract

UK academic departments are evaluated partly on the statements that they write about the value of their research environments for the Research Excellence Framework (REF) periodic assessments. QueryThese statements mix qualitative narratives and quantitative data, typically requiring time-consuming and difficult expert judgements to assess. This article investigates whether Large Language Models (LLMs) can support the process or validate the results, using the UK REF2021 unit-level environment statements as a test case. Based on prompts mimicking the REF guidelines, ChatGPT-4o mini scores correlated positively with expert scores in almost all 34 (field-based) Units of Assessment (UoAs). ChatGPT’s scores had moderate to strong positive Spearman correlations with REF expert scores in 32 out of 34 UoAs: 14 UoAs above 0.7 and a further 13 between 0.6 and 0.7. Only two UoAs had weak or no significant associations (Classics and Clinical Medicine). From further tests for UoA34, multiple LLMs had significant positive correlations with REF2021 environment scores (all p < .001), with ChatGPT-5 performing best (r = 0.81; ρ = 0.82), followed by ChatGPT-4o mini (r = 0.68; ρ = 0.67) and Gemini Flash 2.5 (r = 0.67; ρ = 0.69). If LLM-generated scores for environment statements are used in future to help reduce workload, support more consistent interpretation, and complement human review, where acceptable, then caution must be exercised because of the potential for biases, inaccuracy in some cases, and unwanted systemic effects. Even the strong correlations found here seem unlikely to be judged close enough to expert scores to fully delegate the assessment task to LLMs.

Metadata

Item Type:	Article
Authors/Creators:	Kousha, K. Thelwall, M. https://orcid.org/0000-0001-6065-205X Gadd, E.
Copyright, Publisher and Additional Information:	© 2026 The Authors. Except as otherwise noted, this author-accepted version of a journal article published in Scientometrics is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/
Keywords:	Large Language Models (LLMs); ChatGPT; research environment statement; Research Excellence Framework (REF); research evaluation; AI-assisted assessment
Dates:	Submitted: 16 December 2025 Accepted: 12 April 2026 Published (online): 25 April 2026 Published: 25 April 2026
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Social Sciences (Sheffield) > School of Information, Journalism and Communication
Funding Information:	Funder Grant number UK RESEARCH AND INNOVATION UKRI1079
Date Deposited:	17 Apr 2026 14:02
Last Modified:	27 Apr 2026 07:39
Status:	Published online
Publisher:	Springer
Refereed:	Yes
Identification Number:	10.1007/s11192-026-05633-x
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:240033

CORE (COnnecting REpositories)

Can ChatGPT evaluate research environments? Evidence from REF2021

Abstract

Metadata

Download

Accepted Version

Export

Statistics