Luke Kljucaric scite author profile

As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine-learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many apps, like real-time video processing, are focused on latency of computations rather than strictly on throughput. This research analyzes multiple compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency and maximum throughput for optical character recognition. Considering these models are comprised of fundamental neural-network operations yet architecturally different from each other, these models can stress devices in different, yet insightful, ways that generalizations of the performance of other models can be drawn from. Many devices featuring ML-specific hardware and optimizations are analyzed including Intel and AMD CPUs, Xilinx and Intel FPGAs, NVIDIA GPUs, and Google TPUs. Overall, ML-oriented hardware added to the Intel Xeon CPUs helps to boost throughput by 3.7 × and to reduce latency by up to 34.7 ×, which makes the latency of Intel Xeon CPUs competitive on more parallel models. The TPU devices were limited in terms of throughput due to large data-transfer times and not competitive in terms of latency. The FPGA frameworks showcase the lowest latency on the Xilinx Alveo U200 FPGA achieving 0.48 ms on AlexNet using Mipsology Zebra and 0.39 ms on GoogLeNet using Vitis-AI. Through their custom acceleration datapaths coupled with high-performance SRAM, the FPGAs are able to keep critical model data closer to processing elements for lower latency. The massively parallel and high-memory GPU devices with Tensor Core accelerators achieve the best throughput. The NVIDIA Tesla A100 GPU showcases the highest throughput at 42,513 and 52,484 images/second for AlexNet and GoogLeNet, respectively. 1

show abstract

Deep-Learning Inferencing with High-Performance Hardware Accelerators

Kljucaric

George

2019

View full text Add to dashboard Cite

In order to improve their performance-per-watt capabilities over general-purpose architectures, FPGAs are commonly employed to accelerate applications. With the exponential growth of available data, machine-learning apps have generated greater interest in order to better understand that data and increase autonomous processing. As FPGAs become more readily available through cloud services like Amazon Web Services F1 platform, it is worth studying the performance of accelerating machine-learning apps on FPGAs over traditional fixed-logic devices, like CPUs and GPUs. FPGA frameworks for accelerating convolutional neural networks, which are used in many machine-learning apps, have started emerging for accelerated-application development. This thesis aims to compare the performance of these emerging frameworks on two commonly-used convolutional neural networks, GoogLeNet and AlexNet. Specifically, handwritten Chinese character recognition is benchmarked across multiple currently available FPGA frameworks on Xilinx and Intel FPGAs and compared against multiple CPU and GPU architectures featured on AWS, Google's Cloud platform, the University of Pittsburgh's Center for Research Computing (CRC), and Intel's vLab Academic Cluster. All NVIDIA GPUs have proven to have the best performance over every other device in this study. The Zebra framework available for Xilinx FPGAs showed to have an average 8.3× and 9.3× better performance and efficiency, respectively, over the OpenVINO framework available for Intel FPGAs. Although the Zebra framework on the Xilinx VU9P showed better efficiency than the Pascal-based GPUs, the

show abstract

Accelerating Regular-Expression Matching on FPGAs with High-Level Synthesis

Callanan

Kljucaric

George³

2021

View full text Add to dashboard Cite

Clustering Classification on FPGAs for Neuromorphic Feature Extraction

Kljucaric

George

2023

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Luke Kljucaric

Architectural Analysis of Deep Learning on Edge Accelerators

Deep Learning Inferencing with High-performance Hardware Accelerators

Deep-Learning Inferencing with High-Performance Hardware Accelerators

Accelerating Regular-Expression Matching on FPGAs with High-Level Synthesis

Clustering Classification on FPGAs for Neuromorphic Feature Extraction

Contact Info

Product

Resources

About