Large language model-based multiagent collaboration for abstract screening toward automated systematic reviews

Abstract

Objective Systematic reviews (SRs) are essential for evidence-based practice but remain labor-intensive, especially during abstract screening. This study evaluates whether multiple large language model (multi-LLM) collaboration can improve the efficiency and reduce costs for abstract screening.

Methods Abstract screening was framed as a question-answering (QA) task using cost-effective LLMs. Three multi-LLM collaboration strategies were evaluated, including majority voting by averaging opinions of peers, multi-agent debate (MAD) for answer refinement, and LLM-based adjudication against answers of individual QA baselines. These strategies were evaluated on 28 SRs of the CLEF eHealth 2019 Technology-Assisted Review benchmark using standard performance metrics such as Mean Average Precision (MAP) and Work Saved over Sampling at 95% recall (WSS@95%).

Results Multi-LLM collaboration significantly outperformed QA baselines. Majority voting was overall the best strategy, achieving the highest MAP 0.462 and 0.341 on subsets of SRs about clinical intervention and diagnostic technology assessment, respectively, with WSS@95% 0.606 and 0.680, enabling in theory up to 68% workload reduction at 95% recall of all relevant studies. MAD improved weaker models most. Our own adjudicator-as-a-ranker method was the second strongest approach, surpassing adjudicator-as-a-judge, but at a significantly higher cost than majority voting and debating.

Conclusion Multi-LLM collaboration substantially improves abstract screening efficiency, and the success lies in model diversity. Making the best use of diversity, majority voting stands out in terms of both excellent performance and low cost compared to adjudication. Despite context-dependent gains and diminishing model diversity, MAD is still a cost-effective strategy and a potential direction of further research.

Metadata

Item Type:	Article
Authors/Creators:	Akinseloyin, O. Jiang, X. https://orcid.org/0000-0003-4255-5445 Palade, V.
Copyright, Publisher and Additional Information:	© The Author(s) 2026. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
Keywords:	Systematic Review; Abstract Screening; Large Language Model; Ensemble; Multi-Agent System
Dates:	Accepted: 19 January 2026 Published (online): 4 February 2026 Published: 4 February 2026
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Social Sciences (Sheffield) > School of Information, Journalism and Communication
Date Deposited:	12 Feb 2026 15:38
Last Modified:	11 Mar 2026 09:40
Status:	Published
Publisher:	Oxford University Press (OUP)
Refereed:	Yes
Identification Number:	10.1093/biomethods/bpag006
Related URLs:	Dataset Dataset
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:237902

Download

Published Version

Filename: bpag006.pdf

Licence: CC-BY-NC 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Large language model-based multiagent collaboration for abstract screening toward automated systematic reviews

Abstract

Metadata

Download

Published Version

Export

Statistics

Large language model-based multiagent collaboration for abstract screening toward automated systematic reviews

Abstract

Metadata

Download

Published Version

Related datasets

Export

Statistics