Garraghan, PM, Ouyang, X, Townend, P et al. (1 more author) (2015) Timely Long Tail Identification through Agent Based Monitoring and Analytics. In: Real-Time Distributed Computing (ISORC), 2015 IEEE 18th International Symposium on. IEEE 18th International Symposium on Real-Time Distributed Computing (ISORC), 13-17 Apr 2015, Auckland, New Zealand. Institute of Electrical and Electronics Engineers , 19 - 26.
Abstract
The increasing complexity and scale of distributed systems has resulted in the manifestation of emergent behavior which substantially affects overall system performance. A significant emergent property is that of the "Long Tail", whereby a small proportion of task stragglers significantly impact job execution completion times. To mitigate such behavior, straggling tasks occurring within the system need to be accurately identified in a timely manner. However, current approaches focus on mitigation rather than identification, which typically identify stragglers too late in the execution lifecycle. This paper presents a method and tool to identify Long Tail behavior within distributed systems in a timely manner, through a combination of online and offline analytics. This is achieved through historical analysis to profile and model task execution patterns, which then inform online analytic agents that monitor task execution at runtime. Furthermore, we provide an empirical analysis of two large-scale production Cloud data enters that demonstrate the challenge of data skew within modern distributed systems, this analysis shows that approximately 5% of task stragglers caused by data skew impact 50% of the total jobs for batch processes. Our results demonstrate that our approach is capable of identifying task stragglers less than 11% into their execution lifecycle with 98% accuracy, signifying significant improvement over current state-of-the-art practice and enables far more effective mitigation strategies in large-scale distributed systems worldwide.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2015, IEEE. Uploaded in accordance with the publisher's self-archiving policy. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. |
Keywords: | Cloud computing; Data analysis; Distributed Systems; Long Tail; datacenter; Stragglers; Monitoring |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Institute for Computational and Systems Science (Leeds) |
Funding Information: | Funder Grant number EPSRC EP/F057644/1 |
Depositing User: | Symplectic Publications |
Date Deposited: | 06 Aug 2015 13:23 |
Last Modified: | 29 Jan 2018 10:53 |
Published Version: | http://dx.doi.org/10.1109/ISORC.2015.39 |
Status: | Published |
Publisher: | Institute of Electrical and Electronics Engineers |
Identification Number: | 10.1109/ISORC.2015.39 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:88442 |