In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to describe spatial correlation. We further utilize the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. We implemented a failure prediction framework, called PREdictor of Failure Events Correlated Temporal-Spatially (hP), which explores correlations among failures and forecasts the time-between-failure of future instances. We evaluate the performance of hP in both offline prediction of failure by using the Los Alamos HPC traces and online prediction in an institute-wide clusters coalition environment. Experimental results show the system achieves more than 76% accuracy in offline prediction and more than 70% accuracy in online prediction during the time from
a b s t r a c tAn efficient resource allocation is a fundamental requirement in high performance computing (HPC) systems. Many projects are dedicated to large-scale distributed computing systems that have designed and developed resource allocation mechanisms with a variety of architectures and services. In our study, through analysis, a comprehensive survey for describing resource allocation in various HPCs is reported. The aim of the work is to aggregate under a joint framework, the existing solutions for HPC to provide a thorough analysis and characteristics of the resource management and allocation strategies. Resource allocation mechanisms and strategies play a vital role towards the performance improvement of all the HPCs classifications. Therefore, a comprehensive discussion of widely used resource allocation strategies deployed in HPC environment is required, which is one of the motivations of this survey. Moreover, we have classified the HPC systems into three broad categories, namely: (a) cluster, (b) grid, and (c) aforementioned systems are cataloged into pure software and hybrid/hardware solutions. The system classification is used to identify approaches followed by the implementation of existing resource allocation strategies that are widely presented in the literature.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.