Abstract:We propose a fault tolerance method for torus NoCs capable of increase the yield with minimal performance overhead. The proposed approach consists in detecting and diagnosing interconnect faults using BIST structures and activating alternative paths for the faulty links. Experimental results show that alternative fault-free paths are found by the dynamic routing for 95% of the diagnosed faults (stuck-at and pairwise shorts within a single link or between any two links).
“…The system test is modeled at the transaction level in [18] in order to facilitate test design space exploration, as well as the validation of test strategies and schedules. Interconnect faults in Torus NoCs are detected and diagnosed using BIST structures in [19]. Afterwards, the NoC is repaired by activating alternative paths for faulty links.…”
a b s t r a c tAs complexity and size of Systems-on-Chip (SoC) grow, debugging becomes a bottleneck for designing IC products. In this paper, we present an approach for online debug of NoC-based multiprocessor SoCs. Our approach utilizes monitors and filters implemented in hardware. Monitors and filters observe and filter transactions at run-time. They are connected to a Debug Unit (DU). Transaction-based programmable Finite State Machines (FSMs) in the DU check assertions online to validate the correct relation of transactions at run-time. The experimental results show efficiency and performance of our approach.
“…The system test is modeled at the transaction level in [18] in order to facilitate test design space exploration, as well as the validation of test strategies and schedules. Interconnect faults in Torus NoCs are detected and diagnosed using BIST structures in [19]. Afterwards, the NoC is repaired by activating alternative paths for faulty links.…”
a b s t r a c tAs complexity and size of Systems-on-Chip (SoC) grow, debugging becomes a bottleneck for designing IC products. In this paper, we present an approach for online debug of NoC-based multiprocessor SoCs. Our approach utilizes monitors and filters implemented in hardware. Monitors and filters observe and filter transactions at run-time. They are connected to a Debug Unit (DU). Transaction-based programmable Finite State Machines (FSMs) in the DU check assertions online to validate the correct relation of transactions at run-time. The experimental results show efficiency and performance of our approach.
“…Failure detection and diagnosis have been extensively explored in other works and hence, we assume that an appropriate detection mechanism like the ones used in [6,7] detects the faulty links, and then, stores the fault information in the configuration register of each router. In the proposed method, since the local fault information is enough, there is no need to exchange this information between adjacent routers.…”
“…However, they use virtual channels and/or memory tables to avoid deadlock in the network, which are normally synonymous of area overhead and power consumption. In [3] the authors propose a partially adaptive routing strategy to cope faulty links based on the minimal change in the XY path. Consequently, virtual channels and tables are not used, and the technique in [3] has a smaller area overhead.…”
Section: Introductionmentioning
confidence: 99%
“…In [3] the authors propose a partially adaptive routing strategy to cope faulty links based on the minimal change in the XY path. Consequently, virtual channels and tables are not used, and the technique in [3] has a smaller area overhead. However, because the routing is only partially adaptive, it is not always possible to find an alternative faultyfree path, especially in the presence of multiple faults.…”
A strategy to handle multiple defects in the NoC links with almost no impact on the communication delay is presented. The fault-tolerant method can guarantee the functionally of the NoC with multiple defects in any link, and with multiple faulty links. The proposed technique uses information from test phase to map the application and to configure fault-tolerant features along the NoC links. Results from an application remapped in the NoC show that the communication delay is almost unaffected, with minimal impact and overhead when compared to a fault-free system. We also show that our proposal has a variable impact in performance while traditional fault-tolerant solution like Hamming Code has a constant impact. Besides our proposal can save among 15% to 100% the energy when compared Hamming Code.The fault model considers a set of interlink and intralink faults. Several simulations have been made for some fault scenarios, and we show the average communication delay for each one. To show performance results we decided to analyze the impact of each fault case in the 3x4 NoC as shown below:Case I. Original Router without faulty link. The communication time is the best because it runs at 885 MHz, and techniques for fault tolerance are not used.Case II. ARDS router without faulty link. In this case adaptive routing and data splitting are not used, and the communication delay is very similar to the original router, once it uses only one multiplexer to bypass DS block.Case III. ARDS router with intralink faults affecting RR_links using only the adaptive routing strategy. We have inserted 34 intralink faults (except in torus links), one in each RR_link to evaluate the average communication delay in this case. As Case II, the communication delay has almost no impact because DS block is not used.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.