Abstract-Massively parallel computing systems are being built with thousands of nodes. The interconnection network plays a key role for the performance of such systems. However, the high number of components significantly increases the probability of failure. Additionally, failures in the interconnection network may isolate a large fraction of the machine. It is therefore critical to provide an efficient fault-tolerant mechanism to keep the system running, even in the presence of faults. This paper presents a new fault-tolerant routing methodology that does not degrade performance in the absence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Fully adaptive routing is used along both subpaths. The methodology assumes a static fault model and the use of a checkpoint/restart mechanism. However, there are scenarios where the faults cannot be avoided solely by using an intermediate node. Thus, we also provide some extensions to the methodology. Specifically, we propose disabling adaptive routing and/or using misrouting on a per-packet basis. We also propose the use of more than one intermediate node for some paths. The proposed fault-tolerant routing methodology is extensively evaluated in terms of fault tolerance, complexity, and performance.
In many aspects of human activity, there has been a continuous struggle between the forces of centralization and decentralization. Computing exhibits the same phenomenon; we have gone from mainframes to PCs and local networks in the past, and over the last decade we have seen a centralization and consolidation of services and applications in data centers and clouds. We position that a new shift is necessary. Technological advances such as powerful dedicated connection boxes deployed in most homes, high capacity mobile end-user devices and powerful wireless networks, along with growing user concerns about trust, privacy, and autonomy requires taking the control of computing applications, data, and services away from some central nodes (the "core") to the other logical extreme (the "edge") of the Internet. We also position that this development can help blurring the boundary between man and machine, and embrace social computing in which humans are part of the computation and decision making loop, resulting in a human-centered system design. We refer to this vision of human-centered edge-device based computing as Edge-centric Computing. We elaborate in this position paper on this vision and present the research challenges associated with its implementation.
Multiprocessor interconnection networks may reach congestion with high traffic loads, which prevents reaching the wished performance. Unfortunately, many of the mechanisms proposed in the literature for congestion control either suffer from a lack of robustness, being unable to work properly with different traffic patterns or message lengths, or detect congestion relying on global information that wastes some network bandwidth. This paper presents a family of mechanisms to avoid network congestion in wormhole networks. All of them need only local information, applying message throttling when it is required. The proposed mechanisms use different strategies to detect network congestion and also apply different corrective actions. The mechanisms are evaluated and compared for several network loads and topologies, noticeably improving network performance with high loads but without penalizing network behavior for low and medium traffic rates, where no congestion control is required.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.