Scoring structured academic documents with large language models: impact case studies

Abstract

Purpose

Academic documents require expert time to evaluate, and Large Language Models (LLMs) might support this through score or decision predictions. For confidential structured academic texts, such as grants and Impact Case Studies (ICSs), medium-sized LLMs can be run offline without expensive computing infrastructures, enhancing security.

Design/methodology/approach

This study evaluates for the first time how well medium-sized LLMs can score structured academic documents using the UK Research Excellence Framework (REF) 2021 ICSs, and whether LLMs can guess scores from individual sections. We obtained score estimates from five recent popular LLMs (DeepSeek R1 32B, Qwen 3 32B, Magistral Small 24B, Gemma 3 27B, and Llama 4 Scout 27B) across 6,010 REF 2021 ICSs, correlating the scores with a proxy quality rating (departmental average score).

Findings

Scoring the full texts was only moderately effective (in terms of correlations with the proxy quality rating) and Llama 4 failed to score most of the longest. Surprisingly, all LLMs except Magistral were able to make statistically significantly above random guesses at ICS scores from each of the individual component sections (summary, underpinning research, references, details of the impacts, and sources to support the impact). A logical two-stage approach mimicking the human reviewer instructions did not outperform focusing on impact alone. The best strategy was to score the summary and the details of the impact sections combined (five times, averaged) with Gemma 3. This gave the highest Spearman correlation (0.37) with departmental average proxy quality scores (0.55 for department-level correlations).

Practical implications

Medium sized LLMs can be used to score structured academic documents to support research assessments.

Research limitations

This uses a single large case study with a public, albeit obscured, gold standard.

Originality/value

This improves on the state of the art despite the additional restrictions and with a much cheaper and potentially private open weights LLM approach.

Metadata

Item Type:	Article
Authors/Creators:	Thelwall, M. https://orcid.org/0000-0001-6065-205X Kousha, K. He, G.
Copyright, Publisher and Additional Information:	© 2026 the author(s), published by De Gruyter on behalf of the Chinese Academy of Sciences. This work is licensed under the Creative Commons Attribution 4.0 International License. (https://creativecommons.org/licenses/by/4.0/)
Keywords:	large language models; impact case studies; grant evaluation; research evaluation; research excellence framework (REF)
Dates:	Accepted: 28 May 2026 Published (online): 15 June 2026 Published: 15 June 2026
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Social Sciences (Sheffield) > School of Information, Journalism and Communication
Funding Information:	Funder Grant number UK RESEARCH AND INNOVATION UKRI1079
Date Deposited:	09 Jun 2026 09:00
Last Modified:	15 Jun 2026 12:16
Published Version:	https://www.degruyterbrill.com/document/doi/10.151...
Status:	Published online
Publisher:	Sciendo
Refereed:	Yes
Identification Number:	10.1515/jdis-2025-0465/html
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:241531

Download

Published Version

Filename: 10.1515_jdis-2025-0465.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

[thumbnail of 10.1515_jdis-2025-0465.pdf]

CORE (COnnecting REpositories)

Scoring structured academic documents with large language models: impact case studies

Abstract

Metadata

Download

Published Version

Export

Statistics