Oscar H. Mondragon scite author profile

SummaryWe present a detailed examination of time agreement characteristics for nodes within extreme-scale parallel computers. Using a software tool we introduce in this paper, we quantify attributes of clock skew among nodes in three representative high-performance computers sited at three national laboratories. Our measurements detail the statistical properties of time agreement among nodes and how time agreement drifts over typical application execution durations.We discuss the implications of our measurements, why the current state of the field is inadequate, and propose strategies to address observed shortcomings. KEYWORDSclock synchronization, large-scale systems, system software, time service INTRODUCTIONThe trend towards increasing node counts in high-performance computing (HPC) is motivating a move toward greater levels of concurrency in HPC systems. Today's software environment is now being called on to produce new solutions for emerging issues including managing system power, resilience, and performance characteristics. The distributed algorithms that underlie such services operate much more efficiently in the presence of tightly synchronized clocks. For example, tightly synchronized clocks benefit well-known gang scheduling techniques and complex consensus algorithms. To illustrate the point, such time synchronization enables more aggressive assumptions about communication and synchronization patterns, the removal of unnecessary locks, and a wide range of other applications. Clock-based techniques are already frequently deployed in cloud and data center distributed systems for precisely these reasons.We examined the time synchronization on some of the world's fastest and most powerful machines. These leadership-class systems employ high-end hardware connected by an extremely low-latency, low-jitter, interconnect in a carefully controlled environment, in contrast to widely distributed cloud-based systems based on commodity hardware and networks. Because of this, we assumed that these systems would have more stable, predictable hardware clocks, and close base time agreement using only standard time synchronization systems like Network Time Protocol (NTP). We did not believe that the complex hardware and software techniques used to provide time synchronization in wide-area systems would be necessary in leadership systems.Our results demonstrate that the actual time uncertainty for leadership-class machines is often unexpectedly large, in some cases over 600 milliseconds despite network latencies of less than two microseconds. Building on this, we set out to thoroughly quantify the magnitude of the time synchronization challenge in leadership-class systems. This study shows that the current time protocol in use, NTP, is not suitable for providing the level of time synchronization necessary for important system software tasks such as coordinated scheduling. Based on this, we conclude

show abstract

How I Learned to Stop Worrying and Love In Situ Analytics

Levy

Ferreira

Widener

et al. 2016

View full text Add to dashboard Cite

Understanding Performance Interference in Next-Generation HPC Systems

Mondragon

Bridges

Levy

et al. 2016

View full text Add to dashboard Cite

Fast and Precise: Parallel Processing of Vehicle Traffic Videos Using Big Data Analytics

Perafan-Villota

Mondragon

Mayor-Toro

2022

IEEE Trans. Intell. Transport. Syst.

View full text Add to dashboard Cite

Heterogeneity-Aware Data Placement in Hybrid Clouds

Marquez

González

Mondragon

2019

View full text Add to dashboard Cite

Quantifying Scheduling Challenges for Exascale System Software

Mondragon

Bridges

Jones

2015

View full text Add to dashboard Cite

The move towards high-performance computing (HPC) applications comprised of coupled codes and the need to dramatically reduce data movement is leading to a reexamination of time-sharing vs. space-sharing in HPC systems. In this paper, we discuss and begin to quantify the performance impact of a move away from strict space-sharing of nodes for HPC applications. Specifically, we examine the potential performance cost of time-sharing nodes between application components, we determine whether a simple coordinated scheduling mechanism can address these problems, and we research how suitable simple constraint-based optimization techniques are for solving scheduling challenges in this regime. Our results demonstrate that current generalpurpose HPC system software scheduling and resource allocation systems are subject to significant performance deficiencies which we quantify for six representative applications. Based on these results, we discuss areas in which additional research is needed to meet the scheduling challenges of next-generation HPC systems.

show abstract

Computational and Communication Infrastructure Challenges for Resilient Cloud Services

et al. 2022

View full text Add to dashboard Cite

Fault tolerance and the availability of applications, computing infrastructure, and communications systems during unexpected events are critical in cloud environments. The microservices architecture, and the technologies that it uses, should be able to maintain acceptable service levels in the face of adverse circumstances. In this paper, we discuss the challenges faced by cloud infrastructure in relation to providing resilience to applications. Based on this analysis, we present our approach for a software platform based on a microservices architecture, as well as the resilience mechanisms to mitigate the impact of infrastructure failures on the availability of applications. We demonstrate the capacity of our platform to provide resilience to analytics applications, minimizing service interruptions and keeping acceptable response times.

show abstract

Scheduling In-Situ Analytics in Next-Generation Applications

Mondragon¹,

Bridges²,

Levy³

et al. 2016

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.