Efficient On-Line Error Detection and Mitigation for Deep Neural Network Accelerators

Schorn, Christoph; Guntoro, Andre; Ascheid, Gerd

doi:10.1007/978-3-319-99130-6_14

Cited by 33 publications

(23 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we focus on the soft errors that occur in the form of bit-flips in the data path of a DNN-accelerator. Our error model is in line with the work done in [7], [16], [42], [52]. One of the major reason of soft errors in these modern hardware systems is due to the striking of high energy particles which cause the hardware to malfunction (for example bit flip) .…”

Section: A Error Injection Modelsupporting

confidence: 79%

“…Figure 1 shows an example of errors occurring in one variable with 8 bit-depth, where input (a single variable) can be represented as a sequence of target bits and, the input can be distorted at each bit level in a stochastic process. In our experiments, errors are injected in the output of every convolution layer in a DNN similar to [27], [42], [43].…”

Section: A Error Injection Modelmentioning

confidence: 99%

See 1 more Smart Citation

ERDNN: Error-Resilient Deep Neural Networks With a New Error Correction Layer and Piece-Wise Rectified Linear Unit

et al. 2020

View full text Add to dashboard Cite

Deep Learning techniques have been successfully used to solve a wide range of computer vision problems. Due to their high computation complexity, specialized hardware accelerators are being proposed to achieve high performance and efficiency for deep learning-based algorithms. However, soft errors, i.e., bit flipping errors in the layer output, are often caused due to process variation and high energy particles in these hardware systems. These can significantly reduce model accuracy. To remedy this problem, we propose new algorithms that effectively reduce the impact of errors, thus keeping high accuracy. We firstly propose to incorporate an Error Correction Layer (ECL) into neural networks where convolution is performed multiple times in each layer and majority reporting is conducted for the outputs at bit level. We found that ECL can eliminate most errors while bypassing the bit-error when the bits at the same position are corrupted multiple times under the simulated condition. In order to solve this problem, we analyze the impact of errors depending on the position of bits, thus observing that errors in most significant bit (MSB) positions tend to severely corrupt the output of the network compared to the errors in the least significant bit (LSB) positions. According to this observation, we propose a new specialized activation function, called Piece-wise Rectified Linear Unit (PwReLU), which selectively suppresses errors depending on the bit positions, resulting in an increased model resistance against the errors. Compared to existing activation functions, the proposed PwReLU outperforms with large accuracy margins of up-to 20% even with very high bit error rates (BERs). Our extensive experiments show that the proposed ECL and PwReLU work in a complementary manner, achieving comparable accuracy to the error-free networks even at a severe BER of 0.1% on CIFAR10, CIFAR100, and ImageNet.

show abstract

Section: A Error Injection Modelsupporting

confidence: 79%

Section: A Error Injection Modelmentioning

confidence: 99%

ERDNN: Error-Resilient Deep Neural Networks With a New Error Correction Layer and Piece-Wise Rectified Linear Unit

et al. 2020

View full text Add to dashboard Cite

show abstract

“…The evaluations described in Section 5 highlight the fact that each individual performance evaluation technique is limited according to a certain set of constraints and assumptions. By better understanding these, for example through the use of techniques such as sensitivity analysis of feature maps (as described in our experiment), introspection methods [21,5], fault injection [24], mutation testing [7], a combination of evidence may be found that provides a convincing argument that the performance requirements are met. Explicitly evaluating the machine learning approach and its performance evaluation measure against the set of claims defined in the assurance claim points leads to a greater level of confidence that the performance requirements have been met.…”

Section: Extrapolation Of Resultsmentioning

confidence: 99%

Confidence Arguments for Evidence of Performance in Machine Learning for Highly Automated Driving Functions

Burton

Gauerhof

Sethy

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Due to their ability to efficiently process unstructured and highly dimensional input data, machine learning algorithms are being applied to perception tasks for highly automated driving functions. The consequences of failures and insufficiencies in such algorithms are severe and a convincing assurance case that the algorithms meet certain safety requirements is therefore required. However, the task of demonstrating the performance of such algorithms is non-trivial, and as yet, no consensus has formed regarding an appropriate set of verification measures. This paper provides a framework for reasoning about the contribution of performance evidence to the assurance case for machine learning in an automated driving context and applies the evaluation criteria to a pedestrian recognition case study.

show abstract

“…However, most of these mitigation techniques are based on redundancy, for example, DMR: dual modular redundancy [58] and TMR: triple modular redundancy [35]. The redundancy based approaches, although considered to be very effective for other application domains [19], are highly inefficient for DNN-based systems because of the compute intensive nature of the DNNs [48], and may incur significant area, power/energy, and performance overheads. Hence, a completely new set of resource-efficient reliability mechanisms is required for robust machine learning systems.…”

Section: Reliability Threatsmentioning

confidence: 99%

Robust Computing for Machine Learning-Based Systems

Hanif

Khalid

Putra

et al. 2020

Dependable Embedded Systems

View full text Add to dashboard Cite

The drive for automation and constant monitoring has led to rapid development in the field of Machine Learning (ML). The high accuracy offered by the state-of-the-art ML algorithms like Deep Neural Networks (DNNs) has paved the way for these algorithms to being used even in the emerging safety-critical applications, e.g., autonomous driving and smart healthcare. However, these applications require assurance about the functionality of the underlying systems/algorithms. Therefore, the robustness of these ML algorithms to different reliability and security threats has to be thoroughly studied and mechanisms/methodologies have to be designed which result in increased inherent resilience of these ML algorithms. Since traditional reliability measures like spatial and temporal redundancy are costly, they may not be feasible for DNN-based ML systems which are already super computer and memory intensive. Hence, new robustness methods for ML systems are required. Towards this, in this chapter, we present our analyses illustrating the impact of different reliability and security vulnerabilities on the accuracy of DNNs. We also discuss techniques that can be employed to design ML algorithms such that they are inherently resilient to reliability and security threats. Towards the end, the chapter provides open research challenges and further research opportunities.

show abstract

Efficient On-Line Error Detection and Mitigation for Deep Neural Network Accelerators

Cited by 33 publications

References 18 publications

ERDNN: Error-Resilient Deep Neural Networks With a New Error Correction Layer and Piece-Wise Rectified Linear Unit

ERDNN: Error-Resilient Deep Neural Networks With a New Error Correction Layer and Piece-Wise Rectified Linear Unit

Confidence Arguments for Evidence of Performance in Machine Learning for Highly Automated Driving Functions

Robust Computing for Machine Learning-Based Systems

Contact Info

Product

Resources

About