With the growing scale of HPC applications, there has been an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Time Between Failures (MTBF) in current systems encourages the research of suitable fault tolerance solutions. Message logging combined with uncoordinated checkpoint compose a scalable rollback-recovery solution. However, message logging techniques are usually responsible for most of the overhead during failure-free executions. Taking this into consideration, this paper proposes the Hybrid Message Pessimistic Logging (HMPLHMPL) which focuses on combining the fast recovery feature of pessimistic receiver-based message logging with the low failure-free overhead introduced by pessimistic sender-based message logging. The HMPLHMPL manages messages using a distributed controller and storage to avoid harming system’s scalability. Experiments show that the HMPLHMPL is able to reduce overhead by 34% during failure-free executions and 20% in faulty executions when compared with a pessimistic receiver-based message logging.This research has been supported by the MINECO (MICINN) Spain under contracts TIN2011-24384 and TIN2014-53172-P.Peer ReviewedPostprint (author's final draft
Software performance anomaly detection is a major challenge in complex industrial cyber-physical systems. The automated comparison of runtime execution metrics to reference ones provides a potential solution. We introduce the concept of software passports, intended to act as a signature construct for runtime performance behaviour of reference executions. Our software passport design is based on Extra-Functional Behaviour (EFB) metrics. Amongst such metrics, our focus has been especially on CPU time, read and write communication event counts of different processes. The notion of phases for systems with repetitive tasks during their execution and its fundamental role in our software passports has also been elaborated. We employ regression modelling of our collected data for comparative purposes. The comparison reveals inconsistencies between the execution at hand and the software passport, if present. Such inconsistencies are strong indicators for presence of performance anomalies. Our design is capable of detecting synthetically introduced performance anomalies to the real execution tracing data from a semiconductor photolithography machine.
Next generation data centers will likely be based on the emerging paradigm of disaggregated function-blocks-as-a-unit departing from the current state of mainboard-as-a-unit. Multiple functional blocks or bricks such as compute, memory and peripheral will be spread through the entire system and interconnected together via one or multiple high speed networks. The amount of memory available will be very large distributed among multiple bricks. This new architecture brings various benefits that are desirable in today’s data centers such as fine-grained technology upgrade cycles, fine-grained resource allocation, and access to a larger amount of memory and accelerators. An analysis of the impact and benefits of memory disaggregation is presented in this paper. One of the biggest challenges when analyzing these architectures is that memory accesses should be modeled correctly in order to obtain accurate results. However, modeling every memory access would generate a high overhead that can make the simulation unfeasible for real data center applications. A model to represent and analyze memory disaggregation has been designed and a statistics-based queuing-based full system simulator was developed to rapidly and accurately analyze applications performance in disaggregated systems. With a mean error of 10%, simulation results pointed out that the network layers may introduce overheads that degrade applications’ performance up to 66%. Initial results also suggest that low memory access bandwidth may degrade up to 20% applications’ performance.This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 687632 (dReDBox project) and TIN2015-65316-P - Computacion de Altas Prestaciones VII.Peer ReviewedPostprint (published version
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.