AIACC-Training: Optimizing Distributed Deep Learning Training through Multi-streamed and Concurrent Gradient Communications

Lin, L, Qiu, S, Yu, Z et al. (5 more authors) (2022) AIACC-Training: Optimizing Distributed Deep Learning Training through Multi-streamed and Concurrent Gradient Communications. In: Proceedings of the 42nd IEEE International Conference on Distributed Computing Systems (ICDCS). 42nd IEEE International Conference on Distributed Computing Systems (ICDCS), 10-13 Jul 2022, Bologna, Italy. IEEE , pp. 853-863. ISBN 978-1-6654-7178-7

Abstract

There is a growing interest in training deep neural networks (DNNs) in a GPU cloud environment. This is typically achieved by running parallel training workers on multiple GPUs across computing nodes. Under such a setup, the communication overhead is often responsible for long training time and poor scalability. This paper presents AIACC-Training, a unified communication framework designed for the distributed training of DNNs in a GPU cloud environment. AIACC-Training permits a training worker to participate in multiple gradient communication operations simultaneously to improve network bandwidth utilization and reduce communication latency. It employs auto-tuning techniques to dynamically determine the right communication parameters based on the input DNN workloads and the underlying network infrastructure. AIACC-Training has been deployed to production at Alibaba GPU Cloud with 3000+ GPUs executing AIACC-Training optimized code at any time. Experiments performed on representative DNN workloads show that AIACC-Training outperforms existing solutions, improving the training throughput and scalability by a large margin.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Lin, L Qiu, S Yu, Z You, L Xin, L Sun, X Xu, J Wang, Z https://orcid.org/0000-0001-6157-0662
Copyright, Publisher and Additional Information:	© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Keywords:	Distributed deep learning; Model training; Communication optimization
Dates:	Published: 13 October 2022 Published (online): 13 October 2022 Accepted: 4 April 2022
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Funding Information:	Funder Grant number Alibaba DAMO Academy Not Known
Depositing User:	Symplectic Publications
Date Deposited:	09 May 2022 14:43
Last Modified:	31 Jul 2023 15:47
Status:	Published
Publisher:	IEEE
Identification Number:	10.1109/ICDCS54860.2022.00087
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:186564

CORE (COnnecting REpositories)

AIACC-Training: Optimizing Distributed Deep Learning Training through Multi-streamed and Concurrent Gradient Communications

Abstract

Metadata

Download

Accepted Version

Export

Statistics