SecureMind: A Framework for Benchmarking Large Language Models in Memory Bug Detection and Repair

Wang, H., Jacob, D., Kelly, D. et al. (3 more authors) (2025) SecureMind: A Framework for Benchmarking Large Language Models in Memory Bug Detection and Repair. In: ISMM '25: Proceedings of the 2025 ACM SIGPLAN International Symposium on Memory Management. 2025 ACM SIGPLAN International Symposium on Memory Management (ISMM 2025), 17 Jun 2025, Seoul, South Korea. . Association for Computer Machinery, pp. 27-40. ISBN: 979-8-4007-1610-2/25/06.

Abstract

Large language models (LLMs) hold great promise for automating software vulnerability detection and repair, but ensuring their correctness remains a challenge. While recent work has developed benchmarks for evaluating LLMs in bug detection and repair, existing studies rely on hand-crafted datasets that quickly become outdated. Moreover, systematic evaluation of advanced reasoning-based LLMs using chain-of-thought prompting for software security is lacking. We introduce SecureMind, an open-source framework for evaluating LLMs in vulnerability detection and repair, focusing on memory-related vulnerabilities. SecureMind provides a user-friendly Python interface for defining test plans, which automates data retrieval, preparation, and benchmarking across a wide range of metrics. Using SecureMind, we assess 10 representative LLMs, including 7 state-of-the-art reasoning models, on 16K test samples spanning 8 Common Weakness Enumeration (CWE) types related to memory safety violations. Our findings highlight the strengths and limitations of current LLMs in handling memory-related vulnerabilities.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Wang, H. Jacob, D. Kelly, D. Elkhatib, Y. Singer, J. Wang, Z. https://orcid.org/0000-0001-6157-0662
Copyright, Publisher and Additional Information:	© 2025 Copyright held by the owner/author(s). This is an open access conference paper under the terms of the Creative Commons Attribution License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited.
Keywords:	Software bug detection, Bug repair, Large language models
Dates:	Accepted: 3 May 2025 Published: 13 June 2025
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Funding Information:	Funder Grant number EPSRC (Engineering and Physical Sciences Research Council) EP/X018202/1 EPSRC (Engineering and Physical Sciences Research Council) EP/X037304/1
Date Deposited:	16 May 2025 12:50
Last Modified:	12 Aug 2025 10:25
Status:	Published
Publisher:	Association for Computer Machinery
Identification Number:	10.1145/3735950.3735954
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:226674

Download

Published Version

Filename: 3735950.3735954.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

SecureMind: A Framework for Benchmarking Large Language Models in Memory Bug Detection and Repair

Abstract

Metadata

Download

Published Version

Export

Statistics