Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asynchronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non-trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.Index Terms-parallel I/O; checkpoint-restart; immutable data; adaptive multilevel asynchronous I/O
As computing systems grow to exascale levels of performance, the smallest elements of a single processor can greatly affect the entire computer system (e.g. its power consumption). As future generations of processors are developed, simulation at the gate level is necessary to ensure that the necessary target performance benchmarks are met prior to fabrication. The most common simulation tools available today utilize either a single node or small clusters and as such create a bottleneck in the development process. This paper focuses on the massively parallel simulation of logic gate circuit models using supercomputer systems. The focus of this performance study leverages the OpenSPARC T2 processor design using Rensselaer's Optimistic Simulation System (ROSS). We conduct simulations of the crossbar component on both a 24-core SMP machine and an IBM Blue Gene/L. Using a single SMP core as the baseline, our performance experiments on 1024 cores of the Blue Gene/L demonstrate more than 131-times faster execution. Our results capitalize on the balanced compute and network power of the Blue Gene/L system.
The semantics of HPC storage systems are defined by the consistency models to which they abide. Storage consistency models have been less studied than their counterparts in memory systems, with the exception of the POSIX standard and its strict consistency model. The use of POSIX consistency imposes a performance penalty that becomes more significant as the scale of parallel file systems increases and the access time to storage devices, such as node-local solid storage devices, decreases. While some efforts have been made to adopt relaxed storage consistency models, these models are often defined informally and ambiguously as by-products of a particular implementation. In this work, we establish a connection between memory consistency models and storage consistency models and revisit the key design choices of storage consistency models from a high-level perspective. Further, we propose a formal and unified framework for defining storage consistency models and a layered implementation that can be used to easily evaluate their relative performance for different I/O workloads. Finally, we conduct a comprehensive performance comparison of two relaxed consistency models on a range of commonly-seen parallel I/O workloads, such as checkpoint/restart of scientific applications and random reads of deep learning applications. We demonstrate that for certain I/O scenarios, a weaker consistency model can significantly improve the I/O performance. For instance, in small random reads that typically found in deep learning applications, session consistency achieved an 5x improvement in I/O bandwidth compared to commit consistency, even at small scales.
The Exascale Computing Project (ECP) provides a unique opportunity to advance computational science and engineering (CSE) through an accelerated growth phase in extreme-scale computing. Central to the project is the development of next-generation applications and software technologies that can exploit emerging architectures for optimal performance and provide high-fidelity, multiphysics, multiscale capabilities. However, disruptive changes in computer architectures and the complexities of tackling new frontiers in extreme-scale modeling, simulation, and analysis present daunting challenges to the productivity of software developers and the sustainability of software artifacts. Members of the CSE community-especially at extreme scales but more broadly at all scales of computing-face an urgent need to improve developer productivity, positively impacting product quality, development time, and staffing resources, and software sustainability, reducing the cost of maintaining, sustaining, and evolving software capabilities.This report summarizes technical and cultural challenges in scientific software productivity and sustainability. We introduce work by the IDEAS project within ECP (called IDEAS-ECP, https://ideas-productivity. org) to foster and advance software productivity and sustainability for extreme-scale CSE, including partnerships with complementary groups. IDEAS goals are to qualitatively change the culture of extreme-scale computational science and to provide a foundation (through software productivity methodologies and an extreme-scale software ecosystem) that enables transformative and reliable next-generation predictive science and decision support. Work spans four synergistic strategies: (1) curating methodologies to improve software practices of individuals and teams, (2) incrementally and iteratively upgrading software practices, (3) establishing software communities, and (4) engaging in community outreach. Because these issues are relevant throughout all scales of scientific computing, we aim for broad readership-and we hope that these experiences and resources may be useful in other contexts, as individuals and teams work within their own projects, institutions, and communities to advance software practices and overall productivity.Members of the IDEAS-ECP project serve as catalysts to address the challenges facing ECP teams by focusing on improving how teams conduct software efforts. A central activity is productivity and sustainability improvement planning (PSIP)-a lightweight, iterative workflow where teams identify their most urgent software bottlenecks and work to overcome them. We explain how teams are more productively tackling science goals through PSIP advances in areas such as software builds, testing, refactoring, and onboarding. As the ECP community works toward an extreme-scale scientific software ecosystem composed of high-quality, reusable software components and libraries, we are advancing methodologies to support Software Development Kits and to improve transparency and reproducibil...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.