To reduce energy consumption, it is possible to operate embedded systems at sub-nominal conditions (e.g., reduced voltage, limited eDRAM refresh rate) that can introduce bit errors in their memories. These errors can affect the stored values of CNN weights and activations, compromising their accuracy. In this paper, we introduce Embedded Ensemble CNNs (E 2 CNNs), our architectural design methodology to conceive ensembles of convolutional neural networks to improve robustness against memory errors compared to a single-instance network. Ensembles of CNNs have been previously proposed to increase accuracy at the cost of replicating similar or different architectures. Unfortunately, SoA ensembles do not suit well embedded systems, in which memory and processing constraints limit the number of deployable models. Our proposed architecture solves that limitation applying SoA compression methods to produce an ensemble with the same memory requirements of the original architecture, but with improved error robustness. Then, as part of our new E 2 CNNs design methodology, we propose a heuristic method to automate the design of the voter-based ensemble architecture that maximizes accuracy for the expected memory error rate while bounding the design effort. To evaluate the robustness of E 2 CNNs for different error types and densities, and their ability to achieve energy savings, we propose three error models that simulate the behavior of SRAM and eDRAM operating at sub-nominal conditions. Our results show that E 2 CNNs achieves energy savings of up to 80 % for LeNet-5, 90 % for AlexNet, 60 % for GoogLeNet, 60 % for MobileNet and 60 % for an optimized industrial CNN, while minimizing the impact on accuracy. Furthermore, the memory size can be decreased up to 54 % by reducing the number of members in the ensemble, with a more limited impact on the original accuracy than obtained through pruning alone.
Inferences using Convolutional Neural Networks (CNNs) are resource and energy intensive. Therefore, their execution on highly constrained edge devices demands the careful co-optimization of algorithms and hardware. Addressing this challenge, in this paper we present a flexible In-Memory Computing (IMC) architecture and circuit, able to scale data representations to varying bitwidths at run-time, while ensuring high level of parallelism and requiring low area. Moreover, we introduce a novel optimization heuristic, which tailors the quantization level in each CNN layer according to workloads and robustness considerations. We investigate the performance, accuracy and energy requirements of our co-design approach on CNNs of varying sizes, obtaining up to 76.2% increases in efficiency and up to 75.6% reductions in run-time with respect to fixed-bitwidth alternatives, for negligible accuracy degradation.
Energy consumption is a significant obstacle to integrate deep learning into edge devices. Two common techniques to curve it are quantization, which reduces the size of the memories (static energy) and the number of accesses (dynamic energy), and voltage scaling. However, static random access memories (SRAMs) are prone to failures when operating at sub-nominal voltages, hence potentially introducing errors in computations. In this paper we first analyze the resilience of artificial intelligence (AI) based methods for edge devicesin particular convolutional neural networks (CNNs)-to SRAM errors when operating at reduced voltages. Then, we compare the relative energy savings introduced by quantization and voltage scaling, both separately and together. Our experiments with an industrial use case confirm that CNNs are quite resilient to bit errors in the model, particularly for fixed-point implementations (5.7 % accuracy loss with an error rate of 0.0065 errors per bit). Quantization alone can lead to savings of up to 61.3 % in the dynamic energy consumption of the memory subsystem, with an additional reduction of up to 11.0 % introduced by voltage scaling; all at the price of a 13.6 % loss in accuracy.
The increasing size of Convolutional Neural Networks (CNNs) and the high computational workload required for inference pose major challenges for their deployment on resource-constrained edge devices. In this paper, we address them by proposing a novel In-Memory Computing (IMC) architecture. Our IMC strategy allows us to efficiently perform arithmetic operations based on bitline computing, enabling a high degree of parallelism while reducing energy-costly data transfers. Moreover, it features a hybrid memory structure, where a portion of each subarray, dedicated to storing CNN weights, is implemented as high-density, zero-standby-power Resistive RAM. Finally, it exploits an innovative method for storing quantized weights based on their value, named Weight Data Mapping (WDM), which further increases efficiency. Compared to state-of-the-art IMC alternatives, our solution provides up to 93% improvements in energy efficiency and up to 6x less run-time when performing inference on Mobilenet and AlexNet neural networks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.