Fault Tolerant Wide-Area Parallel Computing

Weissman, Jon

doi:10.1007/3-540-45591-4_168

Cited by 17 publications

(6 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Wide-area FL methods can be a possible solution [118] for this kind of scenario. In wide-area FL methods, a replica of each application/algorithm runs at different transmission substations, as shown in Figure 9 [119], to avoid overloading the available computation and communication resources of that particular station. Thus, fault can be located even with less number of devices installed at different end-terminals of transmission links.…”

Section: Wide-area Fl Approachmentioning

confidence: 99%

A Review of Fault Diagnosing Methods in Power Transmission Systems

et al. 2020

View full text Add to dashboard Cite

Transient stability is important in power systems. Disturbances like faults need to be segregated to restore transient stability. A comprehensive review of fault diagnosing methods in the power transmission system is presented in this paper. Typically, voltage and current samples are deployed for analysis. Three tasks/topics; fault detection, classification, and location are presented separately to convey a more logical and comprehensive understanding of the concepts. Feature extractions, transformations with dimensionality reduction methods are discussed. Fault classification and location techniques largely use artificial intelligence (AI) and signal processing methods. After the discussion of overall methods and concepts, advancements and future aspects are discussed. Generalized strengths and weaknesses of different AI and machine learning-based algorithms are assessed. A comparison of different fault detection, classification, and location methods is also presented considering features, inputs, complexity, system used and results. This paper may serve as a guideline for the researchers to understand different methods and techniques in this field. transmission system topologies can be minimized by using the interspersed sensors for the collection of voltage and current signals. The second limitation is the lack of computational capability and communication. Synchronized global positioning system (GPS) sampling and high-speed broadband communications for IEDs in power grids are proposed in [8]. These technical advancements assure the quick response to faulty scenarios and the effective functioning of online monitoring mechanisms based on sensor networks. The availability of high-performance computing solutions gives provision to the implementation of higher computation complexity methods [7].Short circuit faults are more likely to appear in power systems (PS) than the series faults, break in the path of current. Shunt faults result in catastrophes and leave hazardous effects on PS. Short circuit faults can be divided into symmetrical and asymmetrical faults and further classification is presented in Figure 1 for the three-phase system [11].

show abstract

Section: Wide-area Fl Approachmentioning

confidence: 99%

A Review of Fault Diagnosing Methods in Power Transmission Systems

et al. 2020

View full text Add to dashboard Cite

show abstract

“…For more general models, including also data parallel programs, we have shown the notable case of ProActive: from the programming model properties, its designers have developed an optimized checkpointing protocol. Some researchers also focused on the data parallel model, but restricted to programs featuring consistency steps (see Section 1) and with the only goal of defining cost models [58].…”

Section: Discussionmentioning

confidence: 99%

“…In this way consistency could be guaranteed at those steps at the cost of periodic global synchronizations (e.g. see [58]). As described in the following section, in this paper we have chosen to avoid such period synchronizations to introduce fully asynchronous checkpointing protocols.…”

Section: Example Of Inconsistent States In Data Parallel Programsmentioning

confidence: 99%

Fault tolerance for data parallel programs

Bertolli

Vanneschi

2010

Concurrency and Computation

View full text Add to dashboard Cite

The main issues when supporting fault tolerance based on checkpointing and rollback recovery for High-Performance applications are related to the scalability of the introduced support, the possibility of analyzing the induced overhead and, in more general terms, the optimization of the trade-off between failure-free and recovery performances. In this paper we describe our contribution in fault tolerance for high-level structured parallelism models. We take a different viewpoint w.r.t. existing contributions, by introducing a methodology to derive interesting properties to support fault tolerance. We show how to apply this methodology to a general data parallel model, deriving useful properties to introduce a class of checkpointing protocols. Thanks to this methodology, this class of protocols is not affected by the described issues. We exemplify two checkpointing protocols and the related rollback recovery techniques. For each protocol we also derive cost models statically describing the failure-free performance, which can be used for performance tuning or to target some Quality of Service parameter. To assess the innovation of the results we analytically and experimentally compare the introduced protocols with two literature protocols. Results show that while the protocols introduced in this paper permit the definition of cost models and have a good scalability, the literature protocols do not always have these properties

show abstract

“…The most common failure modes include machine faults in which hosts go down and get rebooted, and network faults, where links go down. Finding a single monolithic solution for fault tolerance that is acceptable to all user applications is unlikely [2]. Among the recovery models, there are strategies such as rollback recovery, in which you can go back to a previous correct state that has been previously saved.…”

Section: Introductionmentioning

confidence: 99%

IaaS Cloud as a virtual environment for experimentation in checkpoint analysis

León

Gomez-Sanchez

Franco

et al. 2019

JC&ST

View full text Add to dashboard Cite

Cloud Computing offers the possibility of computing resources, allowing remote access to software, storage and data processing through the Internet. Infrastructures as a Service (IaaS), it is a flexible space which can be used as an experimental environment, in which experiments can be carried out similar to a real environment, such as in a cluster can be carried out. Before making installations and changes in a production cluster or select resource in the cloud, it is important to analyze the impact of this change. For this reason we propose using the cloud to carry out the study of previous viability. In this paper, we observe the viability of using the cloud to analyze the behavior of the Checkpoint as one of the Fault Tolerance strategies, establishing the differences that exist in the information generated in a real environment (cluster) and a virtual environment (cloud). The results obtained show that due to the variability of the cloud, the impact on the benefits cannot be analyzed. However, the cloud is suitable for extracting the spatial and temporal behavior pattern of the checkpoint, which helps to characterize it and this will help us to know the right configuration and the development of methodologies and tools that simulate and predict the execution of the checkpoint in a real environment.

show abstract

Fault Tolerant Wide-Area Parallel Computing

Cited by 17 publications

References 6 publications

A Review of Fault Diagnosing Methods in Power Transmission Systems

A Review of Fault Diagnosing Methods in Power Transmission Systems

Fault tolerance for data parallel programs

IaaS Cloud as a virtual environment for experimentation in checkpoint analysis

Contact Info

Product

Resources

About