Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures

Domke, Jens; Hoefler, Torsten; Matsuoka, Satoshi

doi:10.1109/sc.2014.54

Cited by 23 publications

(16 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fig.2 shows the architecture of the proposed combination of OFS and OMNeT++. In general, we have followed similar ideas to those proposed in [10], [14], but we introduce several improvements. The main contributions of this methodology are the following:…”

Section: Methodology Descriptionmentioning

confidence: 99%

“…Next, we use the IB tools included in OFS to obtain the information needed to build the network in the simulation tool. For the time being, we have put efforts in the development of a software layer which integrates OFS and the network simulators proposed in [8], [9] and [10]. The rest of this paper is organized as follows: Section II shows an overview of the InfiniBand architecture.…”

Section: Motivationmentioning

confidence: 99%

“…After writing NED files, the programmer has to populate with C++ code the modules defined in the NED files, in order to implement their functionality. Examples of OMNeT++-based simulation modeles for HPC interconnection networks have been used in [8], [9] and [10].…”

Section: The Omnet++ Frameworkmentioning

confidence: 99%

See 2 more Smart Citations

Combining OpenFabrics Software and Simulation Tools for Modeling InfiniBand-Based Interconnection Networks

Maglione-Mathey

Yébenes

Escudero-Sahuquillo

et al. 2016

2016 2nd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB)

View full text Add to dashboard Cite

The design of interconnection networks is becoming extremely important for High-Performance Computing (HPC) systems in the Exascale Era. Design decisions like the selection of the network topology, routing algorithm, fault tolerance and/or congestion control are crucial for the network performance. Besides, the interconnection network designers are also focused on creating middleware layers compatible to different network technologies, which make it possible for these technologies to interoperate. One example is the OpenFabrics Software (OFS) used in HPC for breakthrough applications that require high efficiency computing, wire-speed messaging, microsecond latencies and fast I/O for storage and file systems. OFS is compatible with several HPC interconnect technologies, like InfiniBand, iWarp or RoCE. One challenge in the design of new features for improving the interconnection network performance is to model in specific simulation tools the latency introduced by the OFS modules into the network traffic. In this paper, we present a work-in-progress methodology to combine the OFS middleware with OMNeT++based simulation tools, so that we can use some of the OFS modules, like OpenSM or ibsim, combined with simulation tools. We also propose a set of tools for analyzing the properties of different network topologies. Future work will consist on modeling other OFS modules functionality in network simulators.

show abstract

Section: Methodology Descriptionmentioning

confidence: 99%

Section: Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Combining OpenFabrics Software and Simulation Tools for Modeling InfiniBand-Based Interconnection Networks

Maglione-Mathey

Yébenes

Escudero-Sahuquillo

et al. 2016

2016 2nd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB)

View full text Add to dashboard Cite

show abstract

“…Then, it Component AFR MTTF Reliability Network [12,17] 1.00% 876, 000 4-nines NIC [12,17] 1.00% 876, 000 4-nines DRAM [18] 39.5% 22, 177 2-nines CPU [18] 41.9% 20, 906 2-nines Server [17,39] 47.9% 18, 304 2-nines Table 2: Worst case scenario reliability data. The reliability is estimated over a period of 24 hours and expressed in the "nines" notation; the MTTF is expressed in hours.…”

Section: Dare: Safety and Livenessmentioning

confidence: 99%

“…Various sources provide failure data of systems and system components [12,18,31,36]. Yet, systems range from very reliable ones with AFRs per component below 0.2% [31] to relatively unreliable ones with component failure log events at an annual rate of more than 40% [18] (here we assume that a logged error impacted the function of the device).…”

Section: Fine-grained Failure Modelmentioning

confidence: 99%

Dare

Poke

Hoefler

2015

Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

The increasing amount of data that needs to be collected and analyzed requires large-scale datacenter architectures that are naturally more susceptible to faults of single components. One way to offer consistent services on such unreliable systems are replicated state machines (RSMs). Yet, traditional RSM protocols cannot deliver the needed latency and request rates for future large-scale systems. In this paper, we propose a new set of protocols based on Remote Direct Memory Access (RDMA) primitives. To asses these mechanisms, we use a strongly consistent key-value store; the evaluation shows that our simple protocols improve RSM performance by more than an order of magnitude. Furthermore, we show that RDMA introduces various new options, such as log access management. Our protocols enable operators to fully utilize the new capabilities of the quickly growing number of RDMA-capable datacenter networks.

show abstract

The Implementation and Evaluation of High-Speed Link Monitoring Tool for Supercomputer

et al. 2019

Communications in Computer and Information Science

View full text Add to dashboard Cite

Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures

Cited by 23 publications

References 30 publications

Combining OpenFabrics Software and Simulation Tools for Modeling InfiniBand-Based Interconnection Networks

Combining OpenFabrics Software and Simulation Tools for Modeling InfiniBand-Based Interconnection Networks

Dare

The Implementation and Evaluation of High-Speed Link Monitoring Tool for Supercomputer

Contact Info

Product

Resources

About