Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a type of neural network called autoncoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node. We obtain a very good accuracy (values ranging between 90% and 95%) and we also demonstrate that the approach can be deployed on the supercomputer nodes without negatively affecting the computing units performance.
Energy efficiency and datacentre automation are critical targets of the research and deployment agenda of CINECA and its research partners in the Energy Efficient System Laboratory of the University of Bologna and the Integrated System Laboratory in ETH Zurich. In this manuscript, we present the primary outcomes of the research conducted in this domain and under the umbrella of several European, National and Private funding schemes. These outcomes consist of: (i) the ExaMon scalable, flexible, holistic monitoring framework, which is capable of ingesting 70GB/day of telemetry data of the entire CINECA datacentre and link this data with machine learning and artificial intelligence techniques and tools. (ii) The exploitation of ExaMon to evaluates the viability of machinelearning based job scheduling, power prediction and deep-learning based anomaly detection of compute nodes. (iii) The viability of scalable, out-of-band and high-frequency power monitoring in compute nodes, by leveraging low cost and open source embedded hardware and edge-computing, namely DiG. (iv) Finally, the viability of run time library to exploit communication regions in large-scale application to reduce the energy consumption without impairing the execution time, namely COUNTDOWN.
We report on a self-sustainable, wireless accelerometer-based system for wear detection in a band saw blade. Due to the combination of low power hardware design, thermal energy harvesting with a small thermoelectric generator (TEG), an ultra-low power wake-up radio, power management and the low complexity algorithm implemented, our solution works perpetually while also achieving high accuracy. The onboard algorithm processes sensor data, extracts features, performs the classification needed for the blade’s wear detection, and sends the report wirelessly. Experimental results in a real-world deployment scenario demonstrate that its accuracy is comparable to state-of-the-art algorithms executed on a PC and show the energy-neutrality of the solution using a small thermoelectric generator to harvest energy. The impact of various low-power techniques implemented on the node is analyzed, highlighting the benefits of onboard processing, the nano-power wake-up radio, and the combination of harvesting and low power design. Finally, accurate in-field energy intake measurements, coupled with simulations, demonstrate that the proposed approach is energy autonomous and can work perpetually.
In this paper we present D.A.V.I.D.E. (Development for an Added Value Infrastructure Designed in Europe), an innovative and energy efficient High Performance Computing cluster designed by E4 Computer Engineering for PRACE (Partnership for Advanced Computing in Europe). D.A.V.I.D.E. is built using best-in-class components (IBM's POWER8-NVLink CPUs, NVIDIA TESLA P100 GPUs, Mellanox InfiniBand EDR 100 Gb/s networking) plus custom hardware and an innovative system middleware software. D.A.V.I.D. E. features (i) a dedicated power monitor interface, built around the BeagleBone Black Board that allows high frequency sampling directly from the power backplane and scalable integration with the internal node telemetry and system level power management software; (ii) a custom-built chassis, based on OpenRack form factor, and liquid cooling that allows the system to be used in modern, energy efficient, datacenter; (iii) software components designed for enabling fine grain power monitoring, power management (i.e. power capping and energy aware job scheduling) and application power profiling, based on dedicated machine learning components. Software APIs are offered to developers and users to tune the computing node performance and power consumption around on the application requirements.The first pilot system that we will deploy at the beginning of 2017, will demonstrate key HPC applications from different fields ported and optimized for this innovative platform.
On the race toward exascale supercomputing systems are facing imall, power and energy consumption fueled by the end of Dennard's scaling start to show their impact on limiting supercomputers peak In this paper we present and describeanew methodology based power and aggregation of them for fast analysis and visualization. We propose a turn-key system which uses MQTT communication technology to measure and control power and performance. This methodology is shown as an integrated feature of the D.A.V.I.D.E. supercomputing machine.
Abstract-Solutions for accurate and fine-grain monitoring are at the basis of the growth of future large-scale green high performance computing (HPC) infrastructures. The capability of these systems to adapt to specific application requirements relies on sensing and correlating several distributed physical parameters with application phases and states. Meeting such requirements allows thus to achieve a better use of the resources, higher throughput and higher energy-efficiency. As the capability of drawing such correlations relies on the synchronization across a network of nodes and measuring devices, the use of synchronization protocols becomes a critical component. Novel low-cost embedded devices start to include hardware support for network synchronization protocols to achieve a high resolution time accuracy. These devices are promising for monitoring physical parameters of HPC infrastructures. In this paper we evaluate how the performance of the two widely used network synchronization protocols, namely the Network Time Protocol and IEEE 1588, scale on a state-of-the-art embedded platform, namely a Beaglebone Black Board.
The increasing use of Internet-of-Things (IoT) devices for monitoring a wide spectrum of applications, along with the challenges of "big data" streaming support they often require for data analysis, is nowadays pushing for an increased attention to the emerging edge computing paradigm. In particular, smart approaches to manage and analyze data directly on the network edge, are more and more investigated, and Artificial Intelligence (AI) powered edge computing is envisaged to be a promising direction. In this paper, we focus on Data Centers (DCs) and Supercomputers (SCs), where a new generation of high-resolution monitoring systems is being deployed, opening new opportunities for analysis like anomaly detection and security, but introducing new challenges for handling the vast amount of data it produces. In detail, we report on a novel lightweight and scalable approach to increase the security of DCs / SCs, that involves AI-powered edge computing on high-resolution power consumption. The method -called pAElla -targets real-time Malware Detection (MD), it runs on an out-of-band IoT-based monitoring system for DCs / SCs, and involves Power Spectral Density of power measurements, along with AutoEncoders. Results are promising, with an F1-score close to 1, and a False Alarm and Malware Miss rate close to 0%. We compare our method with State-of-the-Art MD techniques and show that, in the context of DCs / SCs, pAElla can cover a wider range of malware, significantly outperforming SoA approaches in terms of accuracy. Moreover, we propose a methodology for online training suitable for DCs / SCs in production, and release open dataset and code.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.