Dai, Y, Dong, Y, Lu, K et al. (5 more authors) (2023) Towards Scalable Resource Management for Supercomputers. In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. SC22: The International Conference for High Performance Computing, Networking, Storage, and Analysis, 13-18 Nov 2022, Dallas, Texas, USA. IEEE ISBN 978-1-6654-5444-5
Abstract
Today's supercomputers offer massive computation resources to execute a large number of user jobs. Effectively managing such large-scale hardware parallelism and workloads is essential for supercomputers. However, existing HPC resource management (RM) systems fail to capitalize on the hardware parallelism by following a centralized design used decades ago. They give poor scalability and inefficient performance on today's supercomputers, which will worsen in exascale computing. We present ESlurm, a better RM for supercomputers. As a departure from existing HPC RMs, ESlurm implements a distributed communication structure. It employs a new communication tree strategy and uses job runtime estimation to improve communications and job scheduling efficiency. ESlurm is deployed into production in a real supercomputer. We evaluate ESlurm on up to 20K nodes. Compared to state-of-the-art RM solutions, ESlurm exhibits better scalability, significantly reducing the resource usage of master nodes and improving data transfer and job scheduling efficiency by a large margin.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Uploaded in accordance with the publisher's self-archiving policy. |
Keywords: | Resource management; Exascale computing; Scheduling |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Funding Information: | Funder Grant number Royal Society IEC\NSFC\191465 |
Depositing User: | Symplectic Publications |
Date Deposited: | 01 Sep 2022 10:19 |
Last Modified: | 28 Jul 2023 12:32 |
Status: | Published |
Publisher: | IEEE |
Identification Number: | 10.1109/SC41404.2022.00029 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:190405 |