Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI Comm shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI Comm shrink operation requires a failure detection and consensus algorithm. This paper presents three novel failure detection and consensus algorithms using Gossiping. Stochastic pinging is used to quickly detect failures during the execution of the algorithm, failures are then disseminated to all the fault-free processes in the system and consensus on the failures is detected using the three consensus techniques. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that the stochastic pinging detects all the failures in the system. In all the algorithms, the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus. The third approach is a three-phase distributed failure detection and consensus algorithm and provides consistency guarantees even in very large and extreme-scale systems while at the same time being memory and bandwidth efficient.
Consensus is one of the fundamental problems in multi-agent systems and distributed computing, in which agents or processing nodes are required to reach global agreement on some data value, decision, action, or synchronisation. In the absence of centralised coordination, achieving global consensus is challenging especially in dynamic and large-scale distributed systems with faulty processes. This paper presents a fully decentralised phase transition protocol to achieve global consensus on the convergence of an underlying information dissemination process. The proposed approach is based on Epidemic protocols, which are a randomised communication and computation paradigm and provide excellent scalability and fault-tolerant properties. The experimental analysis is based on simulations of a large-scale information dissemination process and the results show that global agreement can be achieved without deterministic and global communication patterns, such as those based on centralised coordination.
Refactoring is the process of changing the code of the software such that its internal design is improved without altering its observable behavior. Method Extraction is the process of separating out a subset of method's statements into another method and replacing their occurrence in the original method with a call to this new method. Method extraction is a classical problem to improve the modularity of the system and is used in extracting methods from long procedural programs. It can also be used in extracting aspects from object oriented code. Thus it makes the software easier to understand, maintain and reusable. In the earlier days of method extraction, programmer selected a random set of statements for extraction which was made more sensible by specifying the variables of interest and separating the statements concerning them into a method. Thus, program slicing became part of method extraction. Many slicing algorithms exist in the literature; they first convert the program into some alternative representation and then apply some correctness preserving transformations on it to produce slice and its complement. This process was identified to be expensive and an algorithm was proposed to act directly on the source code. It statically analyzes the source code to produce the slice but fails to handle dynamic constructs like aliasing and polymorphism effectively. To overcome this limitation we propose a new slicing algorithm that dynamically analyzes source code to produce static slices. It exploits the behavior preservation requirement of refactoring and uses the data collected during testing, which we perform prior to refactoring, for slicing. This algorithm suits better to the refactoring domain.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.