As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject to extreme process variation, particle-induced transient errors, and transistor wear-out. Unless these challenges are addressed, computer vendors can expect low yields and short mean-timesto-failure. In this paper, we examine the challenges of designing complex computing systems in the presence of transient and permanent faults. We select one small aspect of a typical chip multiprocessor (CMP) system to study in detail, a single CMP router switch. To start, we develop a unified model of faults, based on the time-tested bathtub curve. Using this convenient abstraction, we analyze the reliability versus area tradeoff across a wide spectrum of CMP switch designs, ranging from unprotected designs to fully protected designs with online repair and recovery capabilities. Protection is considered at multiple levels from the entire system down through arbitrary partitions of the design. To better understand the impact of these faults, we evaluate our CMP switch designs using circuit-level timing on detailed physical layouts. Our experimental results are quite illuminating. We find that designs are attainable that can tolerate a larger number of defects with less overhead than naïve triple-modular redundancy, using domain-specific techniques such as end-to-end error detection, resource sparing, automatic circuit decomposition, and iterative diagnosis and reconfiguration.
The traditional approach to worst-case static-timing analysis is becoming unacceptably conservative due to an ever-increasing number of circuit and process effects. We propose a fundamentally different framework that aims to significantly improve the accuracy of timing predictions through fully probabilistic analysis of gate and path delays. We describe a bottom-up approach for the construction of joint probability density function of path delays, and present novel analytical and algorithmic methods for finding the full distribution of the maximum of a random path delay space with arbitrary path correlations. General Terms: Algorithms INTRODUCTIONOver the years it has been widely acknowledged that the uncertainty about the true design and manufacturing conditions is a major cause of unnecessary over-design and resulting underperformance of circuits [1] [2]. The sources of this uncertainty are manifold, and are due to the limitations of the actual design practices, uncertainty about the environmental design characteristics (cross-talk noise, temperature and supply voltage variation), and the inherent variation of the underlying process parameters. With the advance of deep sub-micron technologies, process variability and, in particular, intra-chip variation, has been increasing. This is due to various processing and device physics factors such as random dopant placement in the channel, spatially correlated and proximity-caused Lgate variation, and interconnect metal thickness variation [2].The emergence of intra-chip parameter variability as a dominant source of uncertainty and circuit degradation requires a new set of approaches to circuit timing analysis, whose role is to guarantee that the predicted maximum clock speed is as close as possible to the actual (silicon) timing behavior. Industrial experience shows that the gap between the worst-case timing constraints predicted by the tools, and the final silicon performance is routinely greater than what can be tolerated and is sometimes as high as 30% [3].What is wrong with the existing tools and approaches? Circuitdependent parametric yield loss is predicted to become a key issue in nanometer silicon technologies [4]. The fundamental problem is that the standard timing techniques are incapable of accurately predicting parametric yield of a circuit due to their nonprobabilistic formulation. One particular result of this failing is the well-known conservatism of traditional worst-case modeling techniques. We may distinguish at least two levels of conservatism. The first is the practice of defining the worst-case timing behavior of a cell by performing circuit analysis in SPICE that simultaneously sets all the device model parameters to their worst-case values. Several approaches to reduce this type of conservatism have been proposed [5]. When we move to the level of cell-based static timing analysis, an additional level of conservatism is created by the non-probabilistic delay computation of the traditional analysis. This conservatism is relatively new but ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.