Thelwall, M. orcid.org/0000-0001-6065-205X, Kousha, K. and He, G. (2026) Scoring structured academic documents with large language models: impact case studies. Journal of Data and Information Science. ISSN: 2096-157X
Abstract
Purpose
Academic documents require expert time to evaluate, and Large Language Models (LLMs) might support this through score or decision predictions. For confidential structured academic texts, such as grants and Impact Case Studies (ICSs), medium-sized LLMs can be run offline without expensive computing infrastructures, enhancing security.
Design/methodology/approach
This study evaluates for the first time how well medium-sized LLMs can score structured academic documents using the UK Research Excellence Framework (REF) 2021 ICSs, and whether LLMs can guess scores from individual sections. We obtained score estimates from five recent popular LLMs (DeepSeek R1 32B, Qwen 3 32B, Magistral Small 24B, Gemma 3 27B, and Llama 4 Scout 27B) across 6,010 REF 2021 ICSs, correlating the scores with a proxy quality rating (departmental average score).
Findings
Scoring the full texts was only moderately effective (in terms of correlations with the proxy quality rating) and Llama 4 failed to score most of the longest. Surprisingly, all LLMs except Magistral were able to make statistically significantly above random guesses at ICS scores from each of the individual component sections (summary, underpinning research, references, details of the impacts, and sources to support the impact). A logical two-stage approach mimicking the human reviewer instructions did not outperform focusing on impact alone. The best strategy was to score the summary and the details of the impact sections combined (five times, averaged) with Gemma 3. This gave the highest Spearman correlation (0.37) with departmental average proxy quality scores (0.55 for department-level correlations).
Practical implications
Medium sized LLMs can be used to score structured academic documents to support research assessments.
Research limitations
This uses a single large case study with a public, albeit obscured, gold standard.
Originality/value
This improves on the state of the art despite the additional restrictions and with a much cheaper and potentially private open weights LLM approach.
Metadata
| Item Type: | Article |
|---|---|
| Authors/Creators: |
|
| Copyright, Publisher and Additional Information: | © 2026 the author(s), published by De Gruyter on behalf of the Chinese Academy of Sciences. This work is licensed under the Creative Commons Attribution 4.0 International License. (https://creativecommons.org/licenses/by/4.0/) |
| Keywords: | large language models; impact case studies; grant evaluation; research evaluation; research excellence framework (REF) |
| Dates: |
|
| Institution: | The University of Sheffield |
| Academic Units: | The University of Sheffield > Faculty of Social Sciences (Sheffield) > School of Information, Journalism and Communication |
| Funding Information: | Funder Grant number UK RESEARCH AND INNOVATION UKRI1079 |
| Date Deposited: | 09 Jun 2026 09:00 |
| Last Modified: | 15 Jun 2026 12:16 |
| Published Version: | https://www.degruyterbrill.com/document/doi/10.151... |
| Status: | Published online |
| Publisher: | Sciendo |
| Refereed: | Yes |
| Identification Number: | 10.1515/jdis-2025-0465/html |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:241531 |
Download
Filename: 10.1515_jdis-2025-0465.pdf
Licence: CC-BY 4.0

CORE (COnnecting REpositories)
CORE (COnnecting REpositories)