Combining Graph-based Learning with Automated Data Collection for Code Vulnerability Detection

Abstract

This paper presents FUNDED (Flow-sensitive vUl-Nerability coDE Detection), a novel learning framework for building vulnerability detection models. Funded leverages the advances in graph neural networks (GNNs) to develop a novel graph-based learning method to capture and reason about the program's control, data, and call dependencies. Unlike prior work that treats the program as a sequential sequence or an untyped graph, Funded learns and operates on a graph representation of the program source code, in which individual statements are connected to other statements through relational edges. By capturing the program syntax, semantics and flows, Funded finds better code representation for the downstream software vulnerability detection task. To provide sufficient training data to build an effective deep learning model, we combine probabilistic learning and statistical assessments to automatically gather high-quality training samples from open-source projects. This provides many real-life vulnerable code training samples to complement the limited vulnerable code samples available in standard vulnerability databases. We apply Funded to identify software vulnerabilities at the function level from program source code. We evaluate Funded on large real-world datasets with programs written in C, Java, Swift and Php, and compare it against six state-of-the-art code vulnerability detection models. Experimental results show that Funded significantly outperforms alternative approaches across evaluation settings.

Metadata

Item Type:	Article
Authors/Creators:	Wang, H Ye, G Tang, Z Tan, SH Huang, S Fang, D Feng, Y Bian, L Wang, Z https://orcid.org/0000-0001-6157-0662
Copyright, Publisher and Additional Information:	© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Keywords:	Software vulnerability , code vulnerability detection , deep learning , deep graph neural networks
Dates:	Published: 14 December 2020 Published (online): 14 December 2020 Accepted: 18 November 2020
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Depositing User:	Symplectic Publications
Date Deposited:	01 Dec 2020 15:49
Last Modified:	16 Apr 2021 04:31
Status:	Published
Publisher:	Institute of Electrical and Electronics Engineers
Identification Number:	10.1109/TIFS.2020.3044773
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:168594

CORE (COnnecting REpositories)

Combining Graph-based Learning with Automated Data Collection for Code Vulnerability Detection

Abstract

Metadata

Download

Accepted Version

Export

Statistics