Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of GPU, Xeon Phi and FPGA

Giefers, Heiner; Staar, Peter; Bekas, Costas; Hagleitner, Christoph

doi:10.1109/ispass.2016.7482073

Cited by 24 publications

(18 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, their high performance comes at the cost of high power dissipation [2]. FPGAs offer opportunities for exploiting low-level fine-grained parallelism by customizing data paths to the requirements of a specific algorithm/application [3].…”

Section: Introductionmentioning

confidence: 99%

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

Qasaimeh

Denolf

et al. 2019

2019 IEEE International Conference on Embedded Software and Systems (ICESS)

134

View full text Add to dashboard Cite

Developing high performance embedded vision applications requires balancing run-time performance with energy constraints. Given the mix of hardware accelerators that exist for embedded computer vision (e.g. multi-core CPUs, GPUs, and FPGAs), and their associated vendor optimized vision libraries, it becomes a challenge for developers to navigate this fragmented solution space. To aid with determining which embedded platform is most suitable for their application, we conduct a comprehensive benchmark of the run-time performance and energy efficiency of a wide range of vision kernels. We discuss rationales for why a given underlying hardware architecture innately performs well or poorly based on the characteristics of a range of vision kernel categories. Specifically, our study is performed for three commonly used HW accelerators for embedded vision applications: ARM57 CPU, Jetson TX2 GPU and ZCU102 FPGA, using their vendor optimized vision libraries: OpenCV, VisionWorks and xfOpenCV. Our results show that the GPU achieves an energy/frame reduction ratio of 1.1-3.2× compared to the others for simple kernels. While for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2-22.3×. It is also observed that the FPGA performs increasingly better as a vision application's pipeline complexity grows.

show abstract

Section: Introductionmentioning

confidence: 99%

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

Qasaimeh

Denolf

et al. 2019

2019 IEEE International Conference on Embedded Software and Systems (ICESS)

134

View full text Add to dashboard Cite

show abstract

“…We measured a vision kernel's dynamic power while excluding the static power required to power the rest of the platform. This better reflects the actual workload that is being deployed to the system since certainly for small kernels, the compute energy [4] (energy consumed for computation only) and data transfer energy are usually dominated by the static power. In the vision pipeline evaluation, we compared the performance of HW accelerators in terms of their energy delay products (EDP).…”

Section: Benchmarking Approachmentioning

confidence: 99%

“…However, their high performance comes at the cost of high power dissipation [3]. FPGAs offer opportunities for using low-level fine-grained parallelism by customizing processing/control units and data paths to the requirements of a specific algorithm or application [4].…”

Section: Introductionmentioning

confidence: 99%

“…Contributions. The main contributions of this paper are: (1) Benchmark representative vision kernels and complete pipelines using OpenCV, Visionworks and xfOpenCV libraries on the ARM57 CPU, Nvidia Jetson TX2 (GPU-accelerated) and Xilinx UltraScale (FPGA-accelerated), (2) Benchmark a set of five neural networks implementations using OpenCV DNN module, Nvidia TensorRT and Xilinx DPU, (3) Provide an insight into the reasons behind the observed run-time, power, and energy consumption performance for each evaluated platform and discuss rationales for why a given underlying hardware architecture innately performs well or poorly, and (4) Provide easily reproducible open-source benchmarking templates that only use publicly available vision libraries.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Benchmarking vision kernels and neural network inference accelerators on embedded platforms

Qasaimeh

Denolf

Khodamoradi

et al. 2021

Journal of Systems Architecture

View full text Add to dashboard Cite

Developing efficient embedded vision applications requires exploring various algorithmic optimization trade-offs and a broad spectrum of hardware architecture choices. This makes navigating the solution space and finding the design points with optimal performance trade-offs a challenge for developers. To help provide a fair baseline comparison, we conducted comprehensive benchmarks of accuracy, run-time, and energy efficiency of a wide range of vision kernels and neural networks on multiple embedded platforms: ARM57 CPU, Nvidia Jetson TX2 GPU and Xilinx ZCU102 FPGA. Each platform utilizes their optimized libraries for vision kernels (OpenCV, VisionWorks and xfOpenCV) and neural networks (OpenCV DNN, TensorRT and Xilinx DPU). For vision kernels, our results show that the GPU achieves an energy/ frame reduction ratio of 1.1-3.2 compared to the others for simple kernels. However, for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2-22.3. For neural networks [Inception-v2 and ResNet-50, ResNet-18, Mobilenet-v2 and SqueezeNet], it shows that the FPGA achieves a speed up of [2.5, 2.1, 2.6, 2.9 and 2.5] and an EDP reduction ratio of [1.5, 1.1, 1.4, 2.4 and 1.7] compared to the GPU FP16 implementations, respectively.

show abstract

“…For the past years, Field-Programmable Gate Arrays (FPGAs) have been shown to be promising platforms to achieve computational FLOPS/Watt performance comparable to Graphics Processing Units (GPUs) while achieving higher energy efficiency (Mittal and Vetter 2014). Such energy efficiency makes FPGA promising platforms to accelerate the next generations of Neural Networks (Nurvitadhi et al 2017), Sparse Matrix Algebra (Giefers et al 2016), network applications (Nurvitadhi et al 2016), financial market applications (Schryver et al 2011), image processing (Fowers et al 2012), and data centres (Weerasinghe et al 2015). As a result, a common configuration is to have GPP acting as a host and FPGA as a hardware accelerator.…”

Section: Introductionmentioning

confidence: 99%

Fast Code Exploration for Pipeline Processing in FPGA Accelerators

Rosa¹

View full text Add to dashboard Cite

The increasing demand for energy efficient computing has endorsed the usage of Field-Programmable Gate Arrays to create hardware accelerators for large and complex codes. However, implementing such accelerators involve two complex decisions. The first one lies in deciding which code snippet is the best to create an accelerator, and the second one lies in how to implement the accelerator. When considering both decisions concomitantly, the problem becomes more complicated since the code snippet implementation affects the code snippet choice, creating a combined design space to be explored. As such, a fast design space exploration for the accelerators implementation is crucial to allow the exploration of different code snippets. However, such design space exploration suffers from several time-consuming tasks during the compilation and evaluation steps, making it not a viable option to the snippets exploration. In this work, we focus on the efficient implementation of pipelined hardware accelerators and present our contributions on speeding up the pipelines creation and their design space exploration. Towards loop pipelining, the proposed approaches achieve up to 100× speed-up when compared to the state-uf-the-art methods, leading to 164 hours saving in a full design space exploration with less than 1% impact in the final results quality. Towards design space exploration, the proposed methods achieve up to 9.5× speed-up, keeping less than 1% impact in the results quality.

show abstract

Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of GPU, Xeon Phi and FPGA

Cited by 24 publications

References 39 publications

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

Benchmarking vision kernels and neural network inference accelerators on embedded platforms

Fast Code Exploration for Pipeline Processing in FPGA Accelerators

Contact Info

Product

Resources

About