For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be transferred through the network and saved in a reliable storage. As such, the time taken to checkpoint becomes a critical issue which directly impacts the total execution time. Therefore, incremental checkpoint as a less intrusive method to reduce the waste time has been gaining significant attentions in the HPC community. In this paper, we built a model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints. Moreover, a method to find the number of those incremental checkpoints is given. Furthermore, most of the comparison results between the incremental checkpoint model and the full checkpoint model [19] on the same failure data set show that the total waste time in the incremental checkpoint model is significantly smaller than the waste time in the full checkpoint model.
Today's increased computing speeds allow conventional sequential machines to effectively emulate associative computing techniques. Here is a parallel programming paradigm designed for a wide range of computing engines. A ssociative computing evolved in an era when associative memories were both relatively new and, because they required a comparator at each bit of memory, relatively expensive. In the early 1970s. Goodyear Aerospace improved upon early associative processing techniques with its Staran SIMD (single instruction. multiple data) computer.' Goodyear realized that the massively parallel search capability of bit-serial SIMDs could simulate associative searching, with the cost advantage of sharing the comparison logic (that is, the processing elements) over all the bits in an entire row of memory. This approach provided two additional benefits: The word widths could be very large (from 256 bits to 64 kilobits), and the data could be processed in situ using the same PEs.However, today's lower hardware costs and increased computing speeds allow parallel techniques to be effectively emulated on conventional sequential machines. Accessing data by associative searching rather than addresses and processing data in memory require a new programming style. One goal of our research is to develop a parallel programming paradigm that is suitable for many diverse applications, is efficient to write and execute, and can be used on a wide range of computing engines, from PCs and workstations to massively parallel supercomputers.Our associative-computing (ASC) paradigm is an extension of the general associative processing techniques developed by Goodyear. We use two-dimensional tables as the basic data structure. Our paradigm has an efficient associative-based, dynamic memory-allocation mechanism that does not use pointers. It incorporates data parallelism at the base level, so that programmers do not have to specify low-level sequential tasks such as sorting, looping, and parallelization.Our paradigm supports all of the standard data-parallel and massively parallel computing algorithms. It combines numerical computation (such as convolution. matrix multiplication, and graphics) with nonnumerical computing (such as compilation, graph algorithms, rule-based systems, and language interpreters).* This article focuses on the nonnumerical aspects of ASC. The ASC modelThe ASC model is the basis of a high-level associative-programming paradigm and language. As described in the sidebar, "Properties of the ASC model," the extended model provides a basis for algorithm development and analysis similar to the
Incremental checkpoint mechanism was introduced to reduce high checkpoint overhead of regular (full) checkpointing, especially in high-performance computing systems. To gain an extra advantage from the incremental checkpoint technique, we propose an optimal checkpoint frequency function that globally minimizes the expected wasted time of the incremental checkpoint mechanism. Also, the re-computing time coefficient used to approximate the re-computing time is derived. Moreover, to reduce the complexity in the recovery state, full checkpoints are performed from time to time. In this paper we present an approach to evaluate the appropriate constant number of incremental checkpoints between two consecutive full checkpoints. Although the number of incremental checkpoints is constant, the checkpoint interval derived from the proposed model varies depending on the failure rate of the system. The checkpoint time is illustrated in the case of a Weibull distribution and can be easily simplified to the exponential case.
The rate of failures in HPC systems continues to increase as the number of components comprising the systems increases. System logs are one of the valuable information sources that can be used to analyze system failures and their root causes. However, system log files are usually too large and complex to analyze manually. There are some existing log clustering tools that seek to help analysts in exploring these logs, however they fail to satisfy our needs with respect to scalability, usability and quality of results. Thus, we have developed a log clustering tool to better address these needs. In this paper we present our novel approach and initial experimental results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.