Detecting Code Vulnerabilities by Learning from Large-Scale Open Source Repositories

Abstract

Machine learning methods are widely used to identify common, repeatedly occurring bugs and code vulnerabilities. The performance of a machine-learned model is bounded by the quality and quantity of training data and the model’s capability in extracting and capturing the essential information of the problem domain. Unfortunately, there is a storage of high-quality samples for training code vulnerability detection models, and existing machine learning methods are inadequate in capturing code vulnerability patterns.

We present Developer, a novel learning framework for building code vulnerability detection models. To address the data scarcity challenge, Developer automatically gathers training samples from open-source projects and applies constraints rules to the collected data to filter out noisy data to improve the quality of the collected samples. The collected data provides many real-world vulnerable code training samples to complement the samples available in standard vulnerable databases. To build an effective code vulnerability detection model, Developer employs a convolutional neural network architecture with attention mechanisms to extract code representation from the program abstract syntax tree. The extracted program representation is then fed to a downstream network – a bidirectional long–short term memory architecture – to predict if the target code contains a vulnerability or not. We apply Developer to identify vulnerabilities at the program source-code level. Our evaluation shows that Developer outperforms state-of-the-art methods by uncovering more vulnerabilities with a lower false-positive rate.

Metadata

Item Type:	Article
Authors/Creators:	Xu, R Tang, Z Ye, G Wang, H Ke, X Fang, D Wang, Z https://orcid.org/0000-0001-6157-0662
Copyright, Publisher and Additional Information:	© 2022 Elsevier Ltd. This is an author produced version of an article published in Journal of Information Security and Applications. Uploaded in accordance with the publisher's self-archiving policy.
Keywords:	Code vulnerability detection; Deep learning; Attention mechanism; Software vulnerability
Dates:	Published: September 2022 Published (online): 9 August 2022 Accepted: 20 July 2022
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Depositing User:	Symplectic Publications
Date Deposited:	25 Jul 2022 11:47
Last Modified:	09 Aug 2023 00:13
Status:	Published
Publisher:	Elsevier
Identification Number:	10.1016/j.jisa.2022.103293
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:189375

Download

Accepted Version

Filename: DEVELOPER.pdf

Licence: CC-BY-NC-ND 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Detecting Code Vulnerabilities by Learning from Large-Scale Open Source Repositories

Abstract

Metadata

Download

Accepted Version

Export

Statistics