Ouyang, X, Garraghan, P, Yang, R et al. (2 more authors) (2016) Reducing Late-Timing Failure at Scale: Straggler Root-Cause Analysis in Cloud Datacenters. In: Fast Abstracts in the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 28 Jun - 01 Jul 2016, Toulouse, France. DSN
Abstract
Task stragglers hinder effective parallel job execution in Cloud datacenters, resulting in late-timing failures due to the violation of specified timing constraints. Stragglertolerant methods such as speculative execution provide limited effectiveness due to (i) lack of precise straggler root-cause knowledge and (ii) straggler identification occurring too late within a job lifecycle. This paper proposes a method to ascertain underlying straggler root-causes by analyzing key parameters within large-scale distributed systems, and to determine the correlation between straggler occurrence and factors including resource contention, task concurrency, and server failures. Our preliminary study of a production Cloud datacenter indicates that the dominate straggler root-cause is resultant of high temporal resource contention. The result can assist in enhancing straggler prediction and mitigation for tolerating late-timing failures within large-scale distributed systems.
Metadata
| Item Type: | Proceedings Paper | 
|---|---|
| Authors/Creators: | 
 | 
| Dates: | 
 | 
| Institution: | The University of Leeds | 
| Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Institute for Computational and Systems Science (Leeds) | 
| Depositing User: | Symplectic Publications | 
| Date Deposited: | 07 Jun 2016 09:58 | 
| Last Modified: | 17 Jan 2018 04:01 | 
| Published Version: | https://hal.archives-ouvertes.fr/hal-01316515 | 
| Status: | Published | 
| Publisher: | DSN | 
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:100523 | 
 CORE (COnnecting REpositories)
 CORE (COnnecting REpositories) CORE (COnnecting REpositories)
 CORE (COnnecting REpositories)