Optimizing Depthwise Separable Convolution Operations on GPUs

Abstract

The depthwise separable convolution is widely used to reduce the computation overhead of multi-channel 2D convolutions. Existing implementations of depthwise separable convolutions target accelerating model training with large batch size with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This paper aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of convolution operations to reduce the number of memory operations. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve the GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: NVIDIA RTX 2080Ti and NVIDIA Jetson AGX Xavier GPUs, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2 (up to 3) performance improvement over cuDNN.

Metadata

Item Type:	Article
Authors/Creators:	Lu, G Zhang, W Wang, Z https://orcid.org/0000-0001-6157-0662
Copyright, Publisher and Additional Information:	© 2021, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Keywords:	Performance Optimization, Convolution, Depthwise, Pointwise, Memory Optimization, GPU Utilization
Dates:	Published (online): 28 May 2021 Accepted: 24 May 2021
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Depositing User:	Symplectic Publications
Date Deposited:	07 Jun 2021 09:43
Last Modified:	07 Jun 2021 09:45
Status:	Published online
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Identification Number:	10.1109/tpds.2021.3084813
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:174797

CORE (COnnecting REpositories)

Optimizing Depthwise Separable Convolution Operations on GPUs

Abstract

Metadata

Download

Accepted Version

Export

Statistics