Ouyang, X, Wang, C and Xu, J orcid.org/0000-0002-4598-167X (2019) Mitigating stragglers to avoid QoS violation for time-critical applications through dynamic server blacklisting. Future Generation Computer Systems, 101. pp. 831-842. ISSN 0167-739X
Abstract
The straggler problem is one of the most challenging issues toward rapid and predictable response time for applications in cluster infrastructures, leading to potential QoS violation and late-timing failure. Straggler tasks occur due to reasons such as resource contention, hardware heterogeneity, etc., and become severe with increased system scale and complexity. Speculative execution and blacklisting are the major two straggler tolerant techniques, but each has its own limitations. The former creates replica task to catch up with the identified straggler, but normally with no selection toward nodes when deciding where to launch the backup. Ignoring server performance hinders the speculation success rate. The latter typically relies on manual configuration, despite the fact that the ability of nodes to effectively execute tasks changes over time. In addition, the misidentification of weak-performance nodes decreases system capacity. Combining these two techniques, we present DSB, a dynamic server blacklisting framework which takes into account both historical and current behavior of a server node to increase straggler mitigation effectiveness. Servers are ranked at each time interval according to their performance in fulfilling jobs instead of their physical capacities, and the worst performed ones got temporarily blacklisted. As a result, no new tasks/replications are assigned to those straggler-prone nodes within the following time window. DSB also provides an alternative API where adjustable top k worst nodes can be blacklisted according to the ranking. The optimal k is investigated as a trade-off between capacity loss and straggler mitigation efficiency. Results show that, the DSB scheme is capable of increasing successful speculation rate up to 89%. In addition, it can improve job completion time by up to 55.43% compared to the default speculator in the YARN platform. This helps to reduce the chance of QoS violation, which is particularly important for time-critical applications.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2019, Elsevier B.V. All rights reserved. This is an author produced version of an article published in Future Generations Computer Systems. Uploaded in accordance with the publisher's self-archiving policy. |
Keywords: | Time-critical application; Straggler; Speculation; Blacklist; Node performance |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 18 Oct 2019 15:44 |
Last Modified: | 11 Jul 2020 00:39 |
Status: | Published |
Publisher: | Elsevier |
Identification Number: | 10.1016/j.future.2019.07.017 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:152310 |
Download
Filename: AAM Mitigating Stragglers - The Manuscript.pdf
Licence: CC-BY-NC-ND 4.0