Xu, R, Tang, Z, Ye, G et al. (4 more authors) (2022) Detecting Code Vulnerabilities by Learning from Large-Scale Open Source Repositories. Journal of Information Security and Applications, 69. 103293. ISSN 2214-2126
Abstract
Machine learning methods are widely used to identify common, repeatedly occurring bugs and code vulnerabilities. The performance of a machine-learned model is bounded by the quality and quantity of training data and the model’s capability in extracting and capturing the essential information of the problem domain. Unfortunately, there is a storage of high-quality samples for training code vulnerability detection models, and existing machine learning methods are inadequate in capturing code vulnerability patterns.
We present Developer, a novel learning framework for building code vulnerability detection models. To address the data scarcity challenge, Developer automatically gathers training samples from open-source projects and applies constraints rules to the collected data to filter out noisy data to improve the quality of the collected samples. The collected data provides many real-world vulnerable code training samples to complement the samples available in standard vulnerable databases. To build an effective code vulnerability detection model, Developer employs a convolutional neural network architecture with attention mechanisms to extract code representation from the program abstract syntax tree. The extracted program representation is then fed to a downstream network – a bidirectional long–short term memory architecture – to predict if the target code contains a vulnerability or not. We apply Developer to identify vulnerabilities at the program source-code level. Our evaluation shows that Developer outperforms state-of-the-art methods by uncovering more vulnerabilities with a lower false-positive rate.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2022 Elsevier Ltd. This is an author produced version of an article published in Journal of Information Security and Applications. Uploaded in accordance with the publisher's self-archiving policy. |
Keywords: | Code vulnerability detection; Deep learning; Attention mechanism; Software vulnerability |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 25 Jul 2022 11:47 |
Last Modified: | 09 Aug 2023 00:13 |
Status: | Published |
Publisher: | Elsevier |
Identification Number: | 10.1016/j.jisa.2022.103293 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:189375 |