Rahmani, H.A. orcid.org/0000-0002-2779-4942, Wang, X. orcid.org/0000-0001-5936-9919, Yilmaz, E. orcid.org/0000-0003-4734-4532 et al. (3 more authors) (2025) SynDL: A large-scale synthetic test collection for passage retrieval. In: Companion Proceedings of the ACM on Web Conference 2025. WWW '25: The ACM Web Conference 2025, 28 Apr - 02 May 2025, Sydney NSW, Australia. ACM , pp. 781-784. ISBN 9798400713316
Abstract
Large-scale test collections play a crucial role in Information Retrieval (IR) research. However, according to the Cranfield paradigm and the research into publicly available datasets, the existing information retrieval research studies are commonly developed on small-scale datasets that rely on human assessors for relevance judgments — a time-intensive and expensive process. Recent studies have shown the strong capability of Large Language Models (LLMs) in producing reliable relevance judgments with human accuracy but at a greatly reduced cost. In this paper, to address the missing large-scale ad-hoc document retrieval dataset, we extend the TREC Deep Learning Track (DL) test collection via additional language model synthetic labels to enable researchers to test and evaluate their search systems at a large scale. Specifically, such a test collection includes more than 1,900 test queries from the previous years of tracks. We compare system evaluation with past human labels from past years and find that our synthetically created large-scale test collection can lead to highly correlated system rankings.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2025 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) |
Keywords: | Synthetic Data Generation; Large Language Model; Test Collection |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 02 Jul 2025 11:16 |
Last Modified: | 02 Jul 2025 11:16 |
Status: | Published |
Publisher: | ACM |
Refereed: | Yes |
Identification Number: | 10.1145/3701716.3715311 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:228621 |