Overview of the “Information Retrieval in Software Engineering” (IRSE) track at Forum for Information Retrieval 2024

Paul, S., Majumdar, S. orcid.org/0000-0003-3935-4087, Shah, R. et al. (9 more authors) (2025) Overview of the “Information Retrieval in Software Engineering” (IRSE) track at Forum for Information Retrieval 2024. In: Ganguly, D., Sanyal, D. K., Majumder, P., Majumdar, S. and Gangopadhyay, S., (eds.) FIRE '24: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation. The 16th Annual Meeting of the Forum for Information Retrieval Evaluation, 12-15 Dec 2024, Gandhinagar, India. Association for Computing Machinery, New York, NY, pp. 18-21. ISBN: 979-8-4007-1318-7.

Abstract

The “Software Engineering Information Retrieval” (IRSE) track aims to devise solutions for the automated evaluation of code comments within a machine learning framework, with labels generated by both humans and large language models. Within this track, we offered a total of two tasks this year - i) a comment usefulness prediction task, and ii) a code quality estimation task.

The comment classification task involves discerning comments as either useful or not useful. The dataset includes 9,048 pairs of code comments and surrounding code snippets drawn from open-source C-based projects on GitHub and an additional dataset generated by teams employing large language models. In total, 12 teams representing various universities have contributed their experiments. These experiments were assessed through quantitative metrics, primarily the F1-Score, and qualitative evaluations based on the features developed, the supervised learning models employed, and their respective hyper-parameters. It is worth noting that labels generated by large language models introduce bias into the prediction model but lead to less over-fitted results.

The sub-track pertaining to code quality estimation was introduced this year. Given a problem description, and a list of large language model (LLM) generated software code, the objective of the task is to automatically estimate the functional correctness of each generated code. For the purpose of evaluation, each problem-solution pair is then ranked by these estimated probabilities of functional correctness, the quality of which is then reported with standard ranking performance measures.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Paul, S. Majumdar, S. https://orcid.org/0000-0003-3935-4087 Shah, R. Das, S. Ghosh, M. Ganguly, D. Calikli, G. Sanyal, D. Das, P.P. Clough, P.D. Bandyopadhyay, A. Chattopadhyay, S.
Editors:	Ganguly, D. Sanyal, D. K. Majumder, P. Majumdar, S. Gangopadhyay, S.
Copyright, Publisher and Additional Information:	Copyright © 2024 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution 4.0 International License.
Keywords:	Large Language Models, Comment Usefulness Prediction, Code Quality Estimation
Dates:	Published (online): 24 July 2025 Published: 24 July 2025
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Date Deposited:	05 Feb 2026 15:20
Last Modified:	05 Feb 2026 15:20
Published Version:	https://dl.acm.org/doi/10.1145/3734947.3735667
Status:	Published
Publisher:	Association for Computing Machinery
Identification Number:	10.1145/3734947.3735667
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:237529

CORE (COnnecting REpositories)

Overview of the “Information Retrieval in Software Engineering” (IRSE) track at Forum for Information Retrieval 2024

Abstract

Metadata

Download

Published Version

Export

Statistics