AS DEVICES SHRINK toward the nanometer scale, on-chip interconnects are becoming a critical bottleneck in meeting performance and power consumption requirements of chip designs. Industry and academia recognize the interconnect problem as an important design constraint, and, consequently, researchers have proposed packet-based on-chip communication networks, known as networks on chips (NoCs), to address the challenges of increasing interconnect complexity. [1][2][3][4][5] NoC designs promise to deliver fast, reliable, energy-efficient communication between on-chip components. Because most application traffic is bursty in nature, packet-switched networks are suitable for NoCs. 2,4,5 Another effect of shrinking feature size is that power supply voltage and device V t decrease, and wires become unreliable because they are increasingly susceptible to noise sources such as crosstalk, coupling noise, soft errors, and process variation.6 Using aggressive voltage-scaling techniques to reduce a system's power consumption further increases the system's susceptibility to various noise sources. Providing resilience from such transient delay and logic errors is critical for proper system operation.Error detection or correction mechanisms can protect the system from transient errors that occur in the communication subsystem. These schemes can use end-to-end flow control (network level) or switch-to-switch flow control (link level). In a simple retransmission scheme, the sender adds error detection codes (parity or cyclic redundancy check codes) to the original data, and the receiver checks the received data for correctness. If it detects an error, it requests the sender to retransmit the data. Alternatively, the sender can add error-correcting codes (such as Hamming codes) to the data, and the receiver can correct errors. Hybrid schemes with combined retransmission and error correction capabilities are also possible. Because the error detection/correction capability, area-power overhead, and performance of the various schemes differ, the choice of error recovery scheme for an application requires exploring multiple power-performance-reliability trade-offs.In this article, we relate these three major design constraints to characterize efficient error recovery mechanisms for the NoC design environment. We explore error control mechanisms at the data link and network layers and present the schemes' architectural details. We investigate the energy efficiency, error protection efficiency, and performance impact of various error recovery mechanisms.Our objective is twofold: First, we want to identify the major power overhead issues of various error recovery schemes, so that designers can create efficient mechanisms to address them. Second, we want to provide the
Analysis of Error Recovery Schemes for Networks on ChipsEditor's note: Error resiliency is a must for NoCs, but it must not incur undue costsparticularly in terms of energy consumption. Here, the authors present an authoritative discussion of the trade-offs involved in va...