Hardware fault containment in scalable shared-memory multiprocessors

Teodosiu, Dan; Baxter, Joel; Govil, Kinshuk; Chapin, John; Rosenblum, Mendel; Horowitz, Mark

doi:10.1145/264107.264141

Cited by 25 publications

(13 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using the formal definitions given in Section 3.3, the routing function, R, is defined as follows: 7 These routing subfunctions compose both the old and the new routing functions but provide different routing capabilities, depending on which routing function they compose. The routing subfunction R E o =A n allows escape routing on the E o =A n channels when it composes the old routing function, but allows fully adaptive routing on those same channels when it composes the new routing function.…”

Section: The Optimized Fully Adaptive Double Schemementioning

confidence: 99%

See 1 more Smart Citation

Deadlock-free dynamic reconfiguration schemes for increased network dependability

Pinkston

Pang²,

Duato

2003

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Network-based parallel computing systems often require the ability to reconfigure the routing algorithm to reflect changes in network topology if and when voluntary or involuntary changes occur. The process of reconfiguring a network's routing capabilities may be very inefficient and/or deadlock-prone if not handled properly. In this paper, we propose efficient and deadlock-free dynamic reconfiguration schemes that are applicable to routing algorithms and networks which use wormhole, virtual cut-through, or store-andforward switching, combined with hard link-level flow control. One requirement is that the network architecture use virtual channels or duplicate physical channels for deadlock-handling as well as performance purposes. The proposed schemes do not impede the injection, transmission, or delivery of user packets during the reconfiguration process. Instead, they provide uninterrupted service, increased availability/reliability, and improved overall quality-of-service support as compared to traditional techniques based on static reconfiguration.

show abstract

Section: The Optimized Fully Adaptive Double Schemementioning

confidence: 99%

“…Static reconfiguration, for example, consists of first stopping and draining all user traffic from the network before commencing and completing network-wide reconfiguration [4], [5], [6], [7]. With this approach, network drainage typically occurs by actively discarding all nondelivered packets not yet reaching their destination nodes.…”

Section: Introductionmentioning

confidence: 99%

Deadlock-free dynamic reconfiguration schemes for increased network dependability

Pinkston

Pang²,

Duato

2003

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Software techniques by themselves cannot provide hardware fault containment; they require the cooperation of the hardware, as described by Teodosiu et al [1997]. We assume that the hardware exhibits fail-stop behavior, which implies that after a fault the hardware stops working without generating erroneous results.…”

Section: Support For Hardware Fault Containmentmentioning

confidence: 99%

“…A fault containment unit has to be self-sufficient; therefore, it cannot be smaller than a node because a node controller failure will render the entire node useless. Further details are beyond the scope of this paper; they are covered in Teodosiu et al [1997].…”

Section: Support For Hardware Fault Containmentmentioning

confidence: 99%

Cellular disco

Govil

Teodosiu

Huang

et al. 2000

ACM Trans. Comput. Syst.

Self Cite

View full text Add to dashboard Cite

Despite the fact that large-scale shared-memory multiprocessors have been commercially available for several years, system software that fully utilizes all their features is still not available, mostly due to the complexity and cost of making the required changes to the operating system. A recently proposed approach, called Disco, substantially reduces this development cost by using a virtual machine monitor that leverages the existing operating system technology. In this paper we present a system called Cellular Disco that extends the Disco work to provide all the advantages of the hardware partitioning and scalable operating system approaches. We argue that Cellular Disco can achieve these benefits at only a small fraction of the development cost of modifying the operating system. Cellular Disco effectively turns a large-scale shared-memory multiprocessor into a virtual cluster that supports fault containment and heterogeneity, while avoiding operating system scalability bottlenecks. Yet at the same time, Cellular Disco preserves the benefits of a shared-memory multiprocessor by implementing dynamic, fine-grained resource sharing, and by allowing users to overcommit resources such as processors and memory. This hybrid approach requires a scalable resource manager that makes local decisions with limited information while still providing good global performance and fault containment. In this paper we describe our experience with a Cellular Disco prototype on a 32-processor SGI Origin 2000 system. We show that the execution time penalty for this approach is low, typically within 10% of the best available commercial operating system for most workloads, and that it can manage the CPU and memory resources of the machine significantly better than the hardware partitioning approach.

show abstract

“…In transitioning between the old and new routing functions during network reconfiguration, additional dependencies among network resources may be introduced, causing what is referred to as reconfiguration-induced deadlock. Current techniques typically handle this situation through static reconfiguration-meaning that application traffic is stopped and, usually, dropped from the network during the reconfiguration process (see, for example, [17], [18]). While this approach guarantees the prevention of reconfiguration-induced deadlock, it can lead to unacceptable packet latencies and dropping frequencies for many applications, like compute data-centers, high availability webservers, video-servers, and process control servers.…”

Section: Introductionmentioning

confidence: 99%

A methodology for developing deadlock-free dynamic network reconfiguration processes. Part II

Lysne

Pinkston

Duato

2005

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Dynamic network reconfiguration is defined as the process of changing from one routing function to another while the network remains up and running. The main challenge is in avoiding deadlock anomalies while keeping restrictions on packet injection and forwarding minimal. Current approaches either require virtual channels in the network or they work only for a limited set of routing algorithms and/or fault patterns. In this paper, we present a methodology for devising deadlock free and dynamic transitions between old and new routing functions that is consistent with newly proposed theory [1]. The methodology is independent of topology, can be applied to any deadlock-free routing function, and puts no restrictions on the routing function changes that can be supported. Furthermore, it does not require any virtual channels to guarantee deadlock freedom. This research is motivated by current trends toward using increasingly larger Internet and transaction processing servers based on clusters of PCs that have very high availability and dependability requirements, as well as other local, system, and storage area network-based computing systems.Index Terms-Interconnection network, dynamic reconfiguration, deadlock-freedom methodology, system reliability and availability.

show abstract

Hardware fault containment in scalable shared-memory multiprocessors

Cited by 25 publications

References 21 publications

Deadlock-free dynamic reconfiguration schemes for increased network dependability

Deadlock-free dynamic reconfiguration schemes for increased network dependability

Cellular disco

A methodology for developing deadlock-free dynamic network reconfiguration processes. Part II

Contact Info

Product

Resources

About