Lin, L, Qiu, S, Yu, Z et al. (5 more authors) (2022) AIACC-Training: Optimizing Distributed Deep Learning Training through Multi-streamed and Concurrent Gradient Communications. In: Proceedings of the 42nd IEEE International Conference on Distributed Computing Systems (ICDCS). 42nd IEEE International Conference on Distributed Computing Systems (ICDCS), 10-13 Jul 2022, Bologna, Italy. IEEE , pp. 853-863. ISBN 978-1-6654-7178-7
Abstract
There is a growing interest in training deep neural networks (DNNs) in a GPU cloud environment. This is typically achieved by running parallel training workers on multiple GPUs across computing nodes. Under such a setup, the communication overhead is often responsible for long training time and poor scalability. This paper presents AIACC-Training, a unified communication framework designed for the distributed training of DNNs in a GPU cloud environment. AIACC-Training permits a training worker to participate in multiple gradient communication operations simultaneously to improve network bandwidth utilization and reduce communication latency. It employs auto-tuning techniques to dynamically determine the right communication parameters based on the input DNN workloads and the underlying network infrastructure. AIACC-Training has been deployed to production at Alibaba GPU Cloud with 3000+ GPUs executing AIACC-Training optimized code at any time. Experiments performed on representative DNN workloads show that AIACC-Training outperforms existing solutions, improving the training throughput and scalability by a large margin.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. |
Keywords: | Distributed deep learning; Model training; Communication optimization |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Funding Information: | Funder Grant number Alibaba DAMO Academy Not Known |
Depositing User: | Symplectic Publications |
Date Deposited: | 09 May 2022 14:43 |
Last Modified: | 31 Jul 2023 15:47 |
Status: | Published |
Publisher: | IEEE |
Identification Number: | 10.1109/ICDCS54860.2022.00087 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:186564 |