To guarantee high availability, automation systems must be fault-tolerant. To this end, they must provide redundant solutions for the critical parts of the system. Classical fault tolerance patterns such as standby or N-modular redundancy provide system stability in the case of a fault. Fault tolerance is subsequently degraded or, depending on the number of deployed replicas, often even unavailable until the system has been repaired.We introduce a combination of a component-based framework, redundancy patterns, and a runtime manager, which is able to provide fault tolerance, to detect host failures, and to trigger a reconfiguration of the system at runtime. This combined solution maintains system operation in case a fault occurs and automatically restores fault tolerance. The proposed solution is validated using a case study of an industrial distributed automation system. The validation shows how our solution quickly restores fault tolerance without the need for operator intervention or immediate hardware replacement while limiting the impact on other applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.