Proceedings of the 24th Annual International Symposium on Computer Architecture 1997
DOI: 10.1145/264107.264141
|View full text |Cite
|
Sign up to set email alerts
|

Hardware fault containment in scalable shared-memory multiprocessors

Abstract: Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The pr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
13
0

Year Published

1999
1999
2009
2009

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 25 publications
(13 citation statements)
references
References 21 publications
0
13
0
Order By: Relevance
“…Using the formal definitions given in Section 3.3, the routing function, R, is defined as follows: 7 These routing subfunctions compose both the old and the new routing functions but provide different routing capabilities, depending on which routing function they compose. The routing subfunction R E o =A n allows escape routing on the E o =A n channels when it composes the old routing function, but allows fully adaptive routing on those same channels when it composes the new routing function.…”
Section: The Optimized Fully Adaptive Double Schemementioning
confidence: 99%
See 1 more Smart Citation
“…Using the formal definitions given in Section 3.3, the routing function, R, is defined as follows: 7 These routing subfunctions compose both the old and the new routing functions but provide different routing capabilities, depending on which routing function they compose. The routing subfunction R E o =A n allows escape routing on the E o =A n channels when it composes the old routing function, but allows fully adaptive routing on those same channels when it composes the new routing function.…”
Section: The Optimized Fully Adaptive Double Schemementioning
confidence: 99%
“…Static reconfiguration, for example, consists of first stopping and draining all user traffic from the network before commencing and completing network-wide reconfiguration [4], [5], [6], [7]. With this approach, network drainage typically occurs by actively discarding all nondelivered packets not yet reaching their destination nodes.…”
Section: Introductionmentioning
confidence: 99%
“…Software techniques by themselves cannot provide hardware fault containment; they require the cooperation of the hardware, as described by Teodosiu et al [1997]. We assume that the hardware exhibits fail-stop behavior, which implies that after a fault the hardware stops working without generating erroneous results.…”
Section: Support For Hardware Fault Containmentmentioning
confidence: 99%
“…A fault containment unit has to be self-sufficient; therefore, it cannot be smaller than a node because a node controller failure will render the entire node useless. Further details are beyond the scope of this paper; they are covered in Teodosiu et al [1997].…”
Section: Support For Hardware Fault Containmentmentioning
confidence: 99%
“…In transitioning between the old and new routing functions during network reconfiguration, additional dependencies among network resources may be introduced, causing what is referred to as reconfiguration-induced deadlock. Current techniques typically handle this situation through static reconfiguration-meaning that application traffic is stopped and, usually, dropped from the network during the reconfiguration process (see, for example, [17], [18]). While this approach guarantees the prevention of reconfiguration-induced deadlock, it can lead to unacceptable packet latencies and dropping frequencies for many applications, like compute data-centers, high availability webservers, video-servers, and process control servers.…”
Section: Introductionmentioning
confidence: 99%