Ouyang, X, Wang, C, Yang, R et al. (3 more authors) (2018) ML-NA: A Machine Learning Based Node Performance Analyzer Utilizing Straggler Statistics. In: 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). ICPADS 2017, 15-17 Dec 2017, Shenzhen, China. IEEE , pp. 73-80. ISBN 978-1-5386-2129-5
Abstract
Current Cloud clusters often consist of heterogeneous machine nodes, which can trigger performance challenges such as the task straggler problem, whereby a small subset of parallel tasks running abnormally slower than the other sibling ones. The straggler problem leads to extended job response and deteriorates system throughput. Poor performance nodes are more likely to engender stragglers, and can undermine straggler mitigation effectiveness. For example, as the dominant mechanism for straggler alleviation, speculative execution functions by creating redundant task replicas on other machine nodes as soon as a straggler is detected. When speculative copies are assigned onto the poor performance nodes, it is hard for them to catch up with the stragglers compared to replicas run on fast nodes. And due to the fact that the performance heterogeneity is caused not only by static attribute variations such as physical capacity, but also dynamic characteristic uctuations such as contention level, analyzing node performance is important yet challenging. In this paper we develop ML-NA, a Machine Learning based Node performance Analyzer. By leveraging historical parallel tasks execution log data, ML-NA classies cluster nodes into different categories and predicts their performance in the near future as a scheduling guide to improve speculation effectiveness and minimize task straggler generation. We consider MapReduce as a representative framework to perform our analysis, and use the published OpenCloud trace as a case study to train and to evaluate our model. Results show that ML-NA can predict node performance categories with an average accuracy up to 92.86%.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | This is an author produced version of a paper accepted for publication in Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS . Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. Uploaded in accordance with the publisher’s self-archiving policy. |
Keywords: | Node Performance; Straggler Problem; Machine Learning; Prediction |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 26 Sep 2017 15:36 |
Last Modified: | 13 Jul 2018 13:24 |
Status: | Published |
Publisher: | IEEE |
Identification Number: | 10.1109/ICPADS.2017.00021 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:121681 |