To provide high dependability in a multithreaded system despite hardware faults, the system must detect and correct errors in its shared memory system. Recent
IntroductionTwo trends motivate increased interest in fault tolerance for multithreaded shared-memory computer architectures. First, multithreaded systems-including traditional multiprocessors, chip multiprocessors, and simultaneously multithreaded processors-have come to dominate the commodity computing market. Second, the industrial roadmap [7] and recent research [17] forecast increases in hardware error rates due to decreasing transistor sizes and voltages. For example, smaller devices are more susceptible to having their charges disrupted by alpha particles or cosmic radiation [21].Many researchers have developed effective fault tolerance measures for microprocessor cores, using techniques such as redundant multithreading [16,15,20] and DIVA [2]. However, to provide fault tolerance in a multithreaded system, the machine must also be able to detect and correct errors in its shared memory system, including errors in the cache coherence protocol. Whereas we can efficiently detect errors in data storage and transmission using error codes, it is far more difficult to ensure the correct execution of a complex, distributed coherence protocol with multiple interacting controllers. To provide comprehensive, end-to-end error detection, recent research has explored online (dynamic) checking of cache coherence. A coherence checker can either operate stand-alone [5,4] or as an integral part of an online memory consistency checker [12,13] that also detects errors in the interactions between the memory system and the processor cores. Once a coherence checker detects an error, the system can recover to a prefault state using one of several existing recovery mechanisms [19,14]. Coherence checking is a powerful error detection mechanism, but existing coherence checkers are costly to implement, introduce high interconnection network traffic overhead, and do not scale well to large systems. These costs and limitations preclude their use in low-cost commodity systems.In this work, we develop the Token Coherence Signature Checker (TCSC), which is a low-cost, scalable alternative to prior cache coherence checkers. It can be used by itself to detect memory system errors, or it can be used as part of a memory consistency checker [12,13]. With TCSC, every cache and memory controller maintains a signature that represents its recent history of cache coherence events. Periodically, these signatures are gathered at a verifier which determines if an error has occurred. The cost advantages of signature-based error detection come at the expense of an arbitrarily small (but non-zero) probability of undetected errors. This paper makes three main contributions:• TCSC is the first signature-based scheme that completely checks cache coherence and can detect all types of coherence errors with arbitrarily high probability. The use of signatures significantly lowers hardware costs...