A new trend tends to deploy deep learning algorithms to edge environments to mitigate privacy and latency issues from cloud computing. Diverse edge deep learning accelerators are devised to speed up the inference of deep learning algorithms on edge devices. Various edge deep learning accelerators feature different characteristics in terms of power and performance, which make it a very challenging task to efficiently and uniformly compare different accelerators. In this paper, we introduce EDLAB, an end-to-end benchmark, to evaluate the overall performance of edge deep learning accelerators. EDLAB consists of state-of-the-art deep learning models, a unified workload preprocessing and deployment framework, as well as a collection of comprehensive metrics.In addition, we propose parameterized models to model the hardware performance bound so that EDLAB can identify the hardware potentials and the hardware utilization of different deep learning applications. Finally, we employ EDLAB to benchmark three edge deep learning accelerators and analyze the benchmarking results. From the analysis we obtain some insightful observations that can guide the design of efficient deep learning applications.
Edge devices have been widely adopted to bring deep learning applications onto low power embedded systems, mitigating the privacy and latency issues of accessing cloud servers. The increasingly computational demand of complex neural network models leads to large latency on edge devices with limited resources. Many application scenarios are real-time and have a strict latency constraint, while conventional neural network compression methods are not latency-oriented. In this work, we propose a novel compact neural networks training method to reduce the model latency on latency-critical edge systems. A latency predictor is also introduced to guide and optimize this procedure. Coupled with the latency predictor, our method can guarantee the latency for a compact model by only one training process. The experiment results show that, compared to state-of-the-art model compression methods, our approach can well-fit the 'hard' latency constraint by significantly reducing the latency with a mild accuracy drop. To satisfy a 34ms latency constraint, we compact ResNet-50 with 0.82% of accuracy drop. And for GoogLeNet, we can even increase the accuracy by 0.3%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.