The convolution layer is the key building block in many neural network designs. Most high-performance implementations of the convolution operation rely on GEMM (General Matrix Multiplication) to achieve high computational throughput with a large workload size. However, in mobile environments, the user experience priority puts focus on low-latency inferences over a single or limited batch size. This signifies two major problems of current GEMM-based solutions: 1) GEMM-based solutions require mapping the convolution operation to GEMM, causing overheads in both computation and memory, 2) GEMM-based solutions lose large opportunities of data reuse while mapping, leading to under-utilization of the given hardware. Through an in-depth analysis of current GEMM-based solutions, we identify the root cause of these problems, and we propose mGEMM, a convolution solution that overcomes the aforementioned problems, without changes in accuracy. mGEMM expands the structure of GEMM in such a way that it can accommodate the convolution operation without any overhead, while the existing algorithms suffer from inefficiencies in converting the convolution operation to a static GEMM algorithm. Our extensive evaluations done over various neural networks and test devices show that mGEMM outperforms the existing solutions in the aspects of latency, memory overhead, and energy consumption. In running a real-world application, YoloV3-Tiny object detection, mGEMM achieves up to 1.29× and 1.58× speedup in total latency and convolution latency compared to the state-of-the-art, resulting in 15.5% reduction in energy consumption while using only near-minimum heap memory. CCS CONCEPTS• Computing methodologies → Parallel algorithms.
DVFS (dynamic voltage and frequency scaling) is a system-level technique that adjusts voltage and frequency levels of CPU/GPU at runtime to balance energy efficiency and high performance. DVFS has been studied for many years, but it is considered still challenging to realize a DVFS that performs ideally for mobile devices for two main reasons: i) an optimal power budget distribution between CPU and GPU in a power-constrained platform can only be defined by the application performance, but conventional DVFS implementations are mostly application-agnostic; ii) mobile platforms experience dynamic thermal environments for many reasons such as mobility and holding methods, but conventional implementations are not adaptive enough to such environmental changes. In this work, we propose a deep reinforcement learning-based frequency scaling technique, zTT. zTT learns thermal environmental characteristics and jointly scales CPU and GPU frequencies to maximize the application performance in an energy-efficient manner while achieving zero thermal throttling. Our evaluations for zTT implemented on Google Pixel 3a and NVIDIA JETSON TX2 platform with various applications show that zTT can adapt quickly to changing thermal environments, consistently resulting in high application performance with energy efficiency. In a high-temperature environment where a rendering application with the default mobile DVFS fails to keep producing more than a target frame rate, zTT successfully manages to do so even with 23.9% less average power consumption. CCS CONCEPTS• Human-centered computing → Ubiquitous and mobile computing systems and tools; • Software and its engineering → Power management.
With the advent of mobile processors integrating CPU and GPU, high-performance tasks, such as deep learning, gaming, and image processing are running on mobile devices. To fully exploit CPU and GPU's capability on mobile devices, we need to utilize their processing capability as much as possible. However, it is challenging due to the nature of mobile devices whose users are sensitive to battery consumption and device temperature. Many researchers have studied techniques enabling energy-efficient operations in mobile processors, mostly at managing the temperature and power consumption below predefined thresholds. DVFS (Dynamic Voltage and Frequency Scaling) is a technique that reduces heat generation and power consumption from the circuit by adjusting CPU or GPU voltage-frequency levels at runtime. To best utilize its benefits, many DVFS techniques have been developed for mobile processors. Still, it is challenging to implement a DVFS that performs ideally for mobile devices, and there are several reasons behind this difficulty.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.