Ouyang, X, Garraghan, P, Yang, R et al. (2 more authors) (2016) Reducing Late-Timing Failure at Scale: Straggler Root-Cause Analysis in Cloud Datacenters. In: Fast Abstracts in the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 28 Jun - 01 Jul 2016, Toulouse, France. DSN
Abstract
Task stragglers hinder effective parallel job execution in Cloud datacenters, resulting in late-timing failures due to the violation of specified timing constraints. Stragglertolerant methods such as speculative execution provide limited effectiveness due to (i) lack of precise straggler root-cause knowledge and (ii) straggler identification occurring too late within a job lifecycle. This paper proposes a method to ascertain underlying straggler root-causes by analyzing key parameters within large-scale distributed systems, and to determine the correlation between straggler occurrence and factors including resource contention, task concurrency, and server failures. Our preliminary study of a production Cloud datacenter indicates that the dominate straggler root-cause is resultant of high temporal resource contention. The result can assist in enhancing straggler prediction and mitigation for tolerating late-timing failures within large-scale distributed systems.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Institute for Computational and Systems Science (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 07 Jun 2016 09:58 |
Last Modified: | 17 Jan 2018 04:01 |
Published Version: | https://hal.archives-ouvertes.fr/hal-01316515 |
Status: | Published |
Publisher: | DSN |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:100523 |