Kelsall, J., Tan, X., Bergin, A. et al. (7 more authors) (2025) A rapid evidence review of evaluation techniques for large language models in legal use cases: trends, gaps, and recommendations for future research. AI & Society.
Abstract
The legal profession faces mounting pressures, including case backlogs and limited access to legal services. Large language models (LLMs), such as OpenAI’s GPT series, have been touted as potential solutions, promising to streamline tasks such as legal drafting, summarisation, analysis, and advice. Proponents argue these models can enhance efficiency, accuracy, and access to justice. However, significant risks remain. LLMs are prone to bias, factual hallucinations, and opaque reasoning processes, which can have severe consequences in high-stakes legal contexts. For responsible use in law, legal use cases must be accurately operationalised into LLM tasks that are sensitive to legal settings, as do the evaluation metrics used to evaluate LLMs performing those tasks. This paper presents a rapid literature review of LLM research in legal contexts since ChatGPT-4’s release in March 2023. We examine how legal tasks are operationalised for LLMs and what evaluation metrics are used, with a focus on how these align—or fail to align—with real-world legal practice. We argue that existing studies often overlook the institutional, organisational, and professional contexts in which these tools would be deployed. This oversight limits the practical relevance of current evaluations and proposes directions for more contextually grounded research and responsible deployment strategies.
Metadata
| Item Type: | Article |
|---|---|
| Authors/Creators: |
|
| Copyright, Publisher and Additional Information: | © 2025 The Authors. This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
| Keywords: | AI and Law; Legal AI; AI benchmarking; AI Review; AI Metrics; Evaluation |
| Dates: |
|
| Institution: | The University of Sheffield |
| Academic Units: | The University of Sheffield > Faculty of Arts and Humanities (Sheffield) > School of Law |
| Funding Information: | Funder Grant number RESPONSIBLE AI UK EP/Y009800/1 RESPONSIBLE AI UK / RAI UK UNSPECIFIED |
| Date Deposited: | 24 Nov 2025 12:05 |
| Last Modified: | 24 Nov 2025 12:05 |
| Published Version: | https://doi.org/10.1007/s00146-025-02741-9 |
| Status: | Published online |
| Publisher: | Springer Verlag |
| Refereed: | Yes |
| Identification Number: | 10.1007/s00146-025-02741-9 |
| Related URLs: | |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:234787 |

CORE (COnnecting REpositories)
CORE (COnnecting REpositories)