Multi-core devices are envisioned to support the development of next-generation safety-critical systems, enabling the on-chip integration of functions of different criticality. This integration provides multiple system-level potential benefits such as cost, size, power, and weight reduction. However, safety certification becomes a challenge and several fundamental safety technical requirements must be addressed, such as temporal and spatial independence, reliability, and diagnostic coverage. This survey provides a categorization and overview at different device abstraction levels (nanoscale, component, and device) of selected key research contributions that support the compliance with these fundamental safety requirements.
Deep learning technology has enabled the development of increasingly complex safety-related autonomous systems using high-performance computers, such as GPU, which provide the required high computing performance for the execution of parallel computing algorithms, such as matrix–matrix multiplications (a central computing element of deep learning software libraries). However, the safety certification of parallel computing software algorithms and GPU-based safety-related systems is a challenge to be addressed. For example, achieving the required fault-tolerance and diagnostic coverage for random hardware errors. This paper contributes with a safe matrix–matrix multiplication software implementation for GPUs with random hardware error-detection capabilities (permanent, transient) that can be used with different architectural patterns for fault-tolerance, and which serves as a foundation for the implementation of safe deep learning libraries for GPUs. The proposed contribution is complementary and can be combined with other techniques, such as algorithm-based fault tolerance. In particular, (i) we provide the high-performance matrix multiplication CUTLASS library with a catalog of diagnostic mechanisms to detect random hardware errors down to the arithmetic operation level; and (ii) we measure the performance impact incurred by the adoption of these mechanisms and their achievable diagnostic coverage with a set of representative matrix dimensions. To that end, we implement these algebraic operations, targeting CUDA cores with single instructions and multiple-thread math instructions in an NVIDIA Xavier NX GPU.
With the ever increasing industrial demand for bigger, faster and more efficient systems, a growing number of cores is integrated on a single chip. Additionally, their performance is further maximized by simultaneously executing as many processes as possible without regarding their criticality. Even safety critical domains like railway and avionics apply these paradigms under strict certification regulations. As the number of cores is continuously expanding, the importance of cost-effectiveness grows. One way to increase the cost-efficiency of such System on Chip (SoC) is to enhance the way the SoC handles its power resources. By increasing the power efficiency, the reliability of the SoC is raised because the lifetime of the battery lengthens. Secondly, by having less energy consumed, the emitted heat is reduced in the SoC which translates into fewer cooling devices. Though energy efficiency has been thoroughly researched, there is no application of those power saving methods in safety critical domains yet. The EU project SAFEPOWER 1 targets this research gap and aims to introduce certifiable methods to improve the power efficiency of mixed-criticality real-time systems (MCRTES). This article will introduce the requirements that a power efficient SoC has to meet and the challenges such a SoC has to overcome.
Graphics Processing Unit (GPU) devices and their associated software programming languages and frameworks can deliver the computing performance required to facilitate the development of next-generation high-performance safety-critical systems such as autonomous driving systems. However, the integration of complex, parallel and computationally demanding software functions with different safety-criticality levels on GPU devices with shared hardware resources contributes to several safety certification challenges. This survey categorizes and provides an overview of research contributions that address GPU devices’ random hardware failures, systematic failures and independence of execution.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.