Reliable Computing Service in Massive-scale Systems Through Rapid Low-cost Failover

Abstract

Large-scale distributed systems deployed as Cloud datacenters are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely adopted means to achieve such a goal is using redundant system components to implement user-transparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed-an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, such as timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71 percent additional CPU usage.

Metadata

Item Type:	Article
Authors/Creators:	Yang, R Zhang, Y Garraghan, PM Feng, Y Ouyang, J Xu, J https://orcid.org/0000-0002-4598-167X Zhang, Z Li, C
Copyright, Publisher and Additional Information:	© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.
Keywords:	Cloud computing; Resource Management; Reliability; Services; Failover
Dates:	Published: November 2017 Published (online): 21 March 2016 Accepted: 3 March 2016
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Depositing User:	Symplectic Publications
Date Deposited:	21 Jul 2016 13:18
Last Modified:	25 Jan 2018 16:03
Status:	Published
Publisher:	IEEE
Identification Number:	10.1109/TSC.2016.2544313
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:97026

CORE (COnnecting REpositories)

Reliable Computing Service in Massive-scale Systems Through Rapid Low-cost Failover

Abstract

Metadata

Download

Accepted Version

Export

Statistics