To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference

Qin, Q, Ren, J, Yu, J et al. (6 more authors) (2019) To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference. In: 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), 11-13 Dec 2018, Melbourne, Australia. IEEE, pp. 729-736. ISBN: 978-1-7281-1141-4 ISSN: 2158-9178 EISSN: UNSPECIFIED

Abstract

The recent advances in deep neural networks (DNNs) make them attractive for embedded systems. However, it can take a long time for DNNs to make an inference on resource constrained computing devices. Model compression techniques can address the computation issue of deep inference on embedded devices. This technique is highly attractive, as it does not rely on specialized hardware, or computation-offloading that is often infeasible due to privacy concerns or high latency. However, it remains unclear how model compression techniques perform across a wide range of DNNs. To design efficient embedded deep learning solutions, we need to understand their behaviors. This work develops a quantitative approach to characterize model compression techniques on a representative embedded deep learning architecture, the NVIDIA Jetson Tx2. We perform extensive experiments by considering 11 influential neural network architectures from the image classification and the natural language processing domains. We experimentally show that how two mainstream compression techniques, data quantization and pruning, perform on these network architectures and the implications of compression techniques to the model storage size, inference time, energy consumption and performance metrics. We demonstrate that there are opportunities to achieve fast deep inference on embedded systems, but one must carefully choose the compression settings. Our results provide insights on when and how to apply model compression techniques and guidelines for designing efficient embedded deep learning systems.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Qin, Q Ren, J Yu, J Wang, H Gao, L Zheng, J Feng, Y Fang, J Wang, Z https://orcid.org/0000-0001-6157-0662
Copyright, Publisher and Additional Information:	© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Keywords:	Deep learning; embedded systems; parallelism; energy efficiency; deep inference
Dates:	Published (online): 21 March 2019 Published: 21 March 2019
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Depositing User:	Symplectic Publications
Date Deposited:	24 Jun 2020 12:55
Last Modified:	24 Jun 2020 12:55
Status:	Published
Publisher:	IEEE
Identification Number:	10.1109/bdcloud.2018.00110
Related URLs:	Author
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:162250

CORE (COnnecting REpositories)

To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference

Abstract

Metadata

Download

Accepted Version

External copy

Export

Statistics