Abstract:The detection of failures is a fundamental issue for faulttolerance in distributed systems. Recently, many people have come to realize that failure detection ought to be provided as some form of generic service, similar to IP address lookup or time synchronization. However, this has not been successful so far; one of the reasons being the fact that classical failure detectors were not designed to satisfy several application requirements simultaneously.We present a novel abstraction, called accrual failure dete… Show more
“…A fault tolerance service to check the cloud providers and other services status will be developed and evaluated. We also plan to use an adaptive fault monitoring algorithm, as proposed by [18,30] and [70], which are more adaptable to be used in a large-scale distributed environment. It is also important to include a security service and an SLA service in the federated platform.…”
Section: Discussionmentioning
confidence: 99%
“…There are extensive studies in the literature on failure detection systems [16,31,45,70]. On the other hand, few systems are designed to scale with a large number of nodes as those found on clouds.…”
Section: Fault Tolerance Service and High Availabilitymentioning
“…A fault tolerance service to check the cloud providers and other services status will be developed and evaluated. We also plan to use an adaptive fault monitoring algorithm, as proposed by [18,30] and [70], which are more adaptable to be used in a large-scale distributed environment. It is also important to include a security service and an SLA service in the federated platform.…”
Section: Discussionmentioning
confidence: 99%
“…There are extensive studies in the literature on failure detection systems [16,31,45,70]. On the other hand, few systems are designed to scale with a large number of nodes as those found on clouds.…”
Section: Fault Tolerance Service and High Availabilitymentioning
“…Each node periodically disseminates its status information to a number of randomlyselected nodes and relays status information received from other nodes. This method is also used to detect and advertise node failures across the cluster [35].…”
Section: Background: the Systems We Targetmentioning
Large-scale applications are ever-increasingly geo-distributed. Maintaining the highest possible data locality is crucial to ensure high performance of such applications. Dynamic replication addresses this problem by dynamically creating replicas of frequently accessed data close to the clients. This data is often stored in decentralized storage systems such as Dynamo or Voldemort, which offer support for mutable data. However, existing approaches to dynamic replication for such mutable data remain centralized, thus incompatible with these systems. In this paper we introduce a writeenabled dynamic replication scheme that leverages the decentralized architecture of such storage systems. We propose an algorithm enabling clients to locate tentatively the closest data replica without prior request to any metadata node. Large-scale experiments on various workloads show a read latency decrease of up to 42% compared to other state-ofthe-art, caching-based solutions.
“…One of the key messages of this book is that it is important to distinguish between porting a code Table 1: Syntactical constructs used in several failure detector protocols. ϕ is the accrual failure detector discussed in (Hayashibara, 2004;Hayashibara et al, 2004). D is the eventually perfect failure detector of (Chandra & Toueg, 1996).…”
Section: Failure Detection Protocols In the Application Layermentioning
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.