A rapid evidence review of evaluation techniques for large language models in legal use cases: trends, gaps, and recommendations for future research

Abstract

The legal profession faces mounting pressures, including case backlogs and limited access to legal services. Large language models (LLMs), such as OpenAI’s GPT series, have been touted as potential solutions, promising to streamline tasks such as legal drafting, summarisation, analysis, and advice. Proponents argue these models can enhance efficiency, accuracy, and access to justice. However, significant risks remain. LLMs are prone to bias, factual hallucinations, and opaque reasoning processes, which can have severe consequences in high-stakes legal contexts. For responsible use in law, legal use cases must be accurately operationalised into LLM tasks that are sensitive to legal settings, as do the evaluation metrics used to evaluate LLMs performing those tasks. This paper presents a rapid literature review of LLM research in legal contexts since ChatGPT-4’s release in March 2023. We examine how legal tasks are operationalised for LLMs and what evaluation metrics are used, with a focus on how these align—or fail to align—with real-world legal practice. We argue that existing studies often overlook the institutional, organisational, and professional contexts in which these tools would be deployed. This oversight limits the practical relevance of current evaluations and proposes directions for more contextually grounded research and responsible deployment strategies.

Metadata

Item Type:	Article
Authors/Creators:	Kelsall, J. Tan, X. Bergin, A. Chen, J. https://orcid.org/0000-0002-1970-6762 Waheed, M. Sorell, T. Procter, R. Liakata, M. Chim, J. Chi, S.
Copyright, Publisher and Additional Information:	© 2025 The Authors. This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Keywords:	AI and Law; Legal AI; AI benchmarking; AI Review; AI Metrics; Evaluation
Dates:	Accepted: 10 November 2025 Published (online): 21 November 2025 Published: 21 November 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Arts and Humanities (Sheffield) > School of Law
Funding Information:	Funder Grant number RESPONSIBLE AI UK EP/Y009800/1 RESPONSIBLE AI UK / RAI UK UNSPECIFIED
Date Deposited:	24 Nov 2025 12:05
Last Modified:	24 Nov 2025 12:05
Published Version:	https://doi.org/10.1007/s00146-025-02741-9
Status:	Published online
Publisher:	Springer Verlag
Refereed:	Yes
Identification Number:	10.1007/s00146-025-02741-9
Related URLs:	Dataset
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:234787

CORE (COnnecting REpositories)

A rapid evidence review of evaluation techniques for large language models in legal use cases: trends, gaps, and recommendations for future research

Abstract

Metadata

Download

Published Version

Export

Statistics

A rapid evidence review of evaluation techniques for large language models in legal use cases: trends, gaps, and recommendations for future research

Abstract

Metadata

Download

Published Version

Related datasets

Export

Statistics