The failure detector abstraction

Freiling, Felix C.; Guerraoui, Rachid; Kuznetsov, Petr

doi:10.1145/1883612.1883616

Cited by 43 publications

(27 citation statements)

References 109 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…It is believed that the failure detector abstraction is a fundamental one and should sit as a first-class citizen of a distributed programming library. Additionally, failure detectors are important because of the possibility to classify problems in distributed computing [10].…”

Section: B Failure Detectorsmentioning

confidence: 99%

Enhanced failure detection mechanism in MapReduce

Memishi

Pérez

Antoniu

2012

2012 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

Abstract-The popularity of MapReduce programming model has increased interest in the research community for its improvement. Among the other directions, the point of fault tolerance, concretely the failure detection issue seems to be a crucial one, but that until now has not reached its satisfying level. Motivated by this, I decided to devote my main research during this period into having a prototype system architecture of MapReduce framework with a new failure detection service, containing both analytical (theoretical) and implementation part. I am confident that this work should lead the way for further contributions in detecting failures to any NoSQL App frameworks, and cloud storage systems in general.

show abstract

Section: B Failure Detectorsmentioning

confidence: 99%

Enhanced failure detection mechanism in MapReduce

Memishi

Pérez

Antoniu

2012

2012 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

show abstract

“…The literature on failure detectors is rich; see for example the recent surveys [22,34]. We focus on the eventual strong 3S failure detector that is known to be the weakest failure detector required to solve consensus [8,9] in message-passing systems, when the majority of the processes are non-crashed.…”

Section: Consensusmentioning

confidence: 99%

When consensus meets self-stabilization

Dolev

Kat

Schiller

2010

Journal of Computer and System Sciences

View full text Add to dashboard Cite

This paper presents a shared-memory self-stabilizing failure detector, asynchronous consensus and replicated state-machine algorithm suite, the components of which can be started in an arbitrary state and converge to act as a virtual state-machine. Self-stabilizing algorithms can cope with transient faults. Transient faults can alter the system state to an arbitrary state and hence, cause a temporary violation of the safety property of the consensus. Started in an arbitrary state, the long lived, memory bounded and selfstabilizing failure detector, asynchronous consensus, and replicated state-machine suite, presented in the paper, recovers to satisfy eventual safety and eventual liveness requirements. Several new techniques and paradigms are introduced. The bounded memory failure detector abstracts away synchronization assumptions using bounded heartbeat counters combined with a balance-unbalance mechanism. The practically infinite paradigm is introduced in the scope of self-stabilization, where an execution of, say, 2 64 sequential steps is regarded as (practically) infinite. Finally, we present the first self-stabilizing wait-free reset mechanism that ensures eventual safety and can be used to implement efficient self-stabilizing timestamps that are of independent interest.

show abstract

“…However, given the FLP impossibility [4], i.e., consensus can not be solved deterministically in asynchronous distributed systems in which even a single process can fail by crashing, deploying high-available distributed systems on the Internet is a challenge. In order to circumvent the impossibility of solving consensus in asynchronous distributed systems, Chandra and Toueg introduced failure detectors based on timeouts [5][6][7].…”

Section: Introductionmentioning

confidence: 99%

A QoS-configurable failure detection service for internet applications

Turchetti

Duarte

Arantes³

et al. 2016

J Internet Serv Appl

View full text Add to dashboard Cite

Unreliable failure detectors are a basic building block of reliable distributed systems. Failure detectors are used to monitor processes of any application and provide process state information. This work presents an Internet Failure Detector Service (IFDS) for processes running in the Internet on multiple autonomous systems. The failure detection service is adaptive, and can be easily integrated into applications that require configurable QoS guarantees. The service is based on monitors which are capable of providing global process state information through a SNMP MIB. Monitors at different networks communicate across the Internet using Web Services. The system was implemented and evaluated for monitored processes running both on single LAN and on PlanetLab. Experimental results are presented, showing the performance of the detector, in particular the advantages of using the self-tuning strategies to address the requirements of multiple concurrent applications running on a dynamic environment.

show abstract

The failure detector abstraction

Cited by 43 publications

References 109 publications

Enhanced failure detection mechanism in MapReduce

Enhanced failure detection mechanism in MapReduce

When consensus meets self-stabilization

A QoS-configurable failure detection service for internet applications

Contact Info

Product

Resources

About