Rollback and retry is a common approach used to achieve error recovery in datapaths that tolerate transient faults. In this approach, each segment of a computation is duplicated and the results are compared using fault-tolerant comparators. If the compared values are unequal, the segment is rolled back to the preceding correct state (rollback point) and retried from that state. We introduce early comparison and rollback strategies for use in such datapaths. These strategies utilize comparators during the computational portion of the segment and can initiate a rollback before the segment is completed. We illustrate through examples how these strategies can reduce hardware costs (number of comparators needed) and the delay in recovering from a transient fault compared to conventional strategies.Index Terms-Fault tolerance, high level synthesis, rollback and retry.
We consider a primary-backup approach t o provide fault-tolerance service under a model in which the clients play an active role when their service requests are not fulfilled. Each client maintains an ordered list of servers and sends its service requests t o the first server in its list. If the server does not respond within a specified timeout period, the client retransmits the request t o the next server in its list.Under this model, we construct protocols that tolerate crash failures, send-omission failures, and receiveomission failures. For each t y p e of failure, our protocol is optimal with respect t o the degree of replication. More precisely, our protocols tolerate up to f server failures using only f + 1 servers. In addition, these protocols tolerate an arbitrary number of client failures. Further, the protocols ensure that the service provided by the system is functionally equivalent t o that provided by a single failure-free server.
Fault tolerance measures (such as fault detectability and fault locatability) of systems employing Algorithm-Based Fault Tolerance (ABFT) are determined by a binary relationship between fault patterns and error patterns. This relationship specifies whether a giveii fault pattern can induce a given error pattern. We develop a succinct and canonical representation of this relationship and present an efficient algorithm for obtaining this representation. We show that two ABFT systems have the same fault tolerance measures only when their canonical representation are identical.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.