The HaL SPARC64TM Processol; the first 64-bit SPARC-V9 architecture implementation, uses several techniques to ensure a high degree of system reliability, error detection, and error recovery. The CPU of the multi-chip module processor has a superscalal; speculative issue unit, and an out-oforder execution datapath. These two processor components complicate the maintenance of precise state in the event of errors. By exploiting the SPARC-V9 architectural features, and the micro-architecture for speculative execution, S P A R C M~~ maintains precise state in the event of exceptions and errors, logs and reports errors, and facilitates error detection during full system bringup. This paper presents details of error detection and handling in the CPU, the cache system, and the Memory Management Unit (MMU). The HaL R1 system also implements a fault-secure memory system design. The memory system corrects all single-bit errors, detects double bit errors, detects single address line failures, and detects all single dynamic RAM (DRAM) chip failures. Certain debug features have been added to the system that are useful during system bring-up. 1: IntroductionDesign philosophy and design trade-offs are strongly affected by the overall system goals. Historically, high reliability and availability system goals were the domain of military, industrial, aerospace, main-frame, and communications applications [ 11. Recently, reliability and availability goals have also assumed importance in microprocessor-based workstation environments. For instance, reliability studies (reported in [l]) in the late 1970's on systems like the B5500, Univac 1108, IBM Dual 370/165, PDP-10, and CRAY-1 indicate a mean-time-to-crash on the order of 10 to 15 hours, which translates (considering the performance during the 70's) to about 2x10" instructions executed between failures. In toda 's high performance workstations (approximately 2x10 instructions per second performance) this would have translated to a mean-time-to-crash value of almost 17 minutes. The fact that current workstations typically run for far longer than 2.17 minutes before crashing not only points to the use of robust design techniques and technology, but also to an increased emphasis on reliable design methods. The most visible aspect of this is the use of error correcting codes in DRAM-based memory in most of the commercial workstations.Although the specification of reliability requirements (in terms of mean-time-to-failure) or availability requirements (average system down time, which is a function of mean-time-to-failure and mean-time-to-repair) could be made precise; translating these requirements into precise design decisions and design trade-offs is not straightforward'. The problem is compounded by the difficulty in proving that design decisions and trade-offs do meet the specified reliability requirements. This is a recognized problem. However, it is possible to include error-checking mechanisms that help the designers to obtain appropriate error detection and recovery techniques....
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.