David Sheffield scite author profile

2D image convolution is ubiquitous in image processing and computer vision problems such as feature extraction. Exploiting parallelism is a common strategy for accelerating convolution. Parallel processors keep getting faster, but algorithms such as image convolution remain memory bounded on parallel processors such as GPUs. Therefore, reducing memory communication is fundamental to accelerating image convolution. To reduce memory communication, we reorganize the convolution algorithm to prefetch image regions to register, and we do more work per thread with fewer threads. To enable portability to future architectures, we implement a convolution autotuner that sweeps the design space of memory layouts and loop unrolling configurations. We focus on convolution with small filters (2x2-7x7), but our techniques can be extended to larger filter sizes. Depending on filter size, our speedups on two NVIDIA architectures range from 1.2x to 4.5x over state-of-the-art GPU libraries.Index Terms-Convolution, parallel, GPU, autotuning INTRODUCTIONConvolution is a key component in most algorithms for feature extraction, image segmentation, object tracking, and object recognition. In a recent "periodic table" of the fifteen most recurring computational patterns in image processing and computer vision literature, convolution ranked as the most ubiquitous, followed by histogram accumulation, vector distance, and quadratic optimization [1]. Our work focuses on image convolution with small nonseperable filters (2x2 to 7x7), which are extremely common for edge detection, feature extraction [2], and difference of gaussians [3].The computer architecture community has developed manythreaded processors that offer tremendous boosts in peak FLOP/s over traditional single-core CPUs. However, improvements to memory bandwidth and latency have lagged behind the improvements to the processors themselves. As a result, the performance of convolution and other algorithms with low computational complexity tend to be limited by the memory bandwidth, much like trying to drink a thick milkshake through a narrow straw.Parallel processors keep getting faster, but algorithms like convolution remain memory-bounded on these architectures. The solution is to redesign algorithms with the goal of minimizing communication among off-chip memory, on-chip shared memory, and registers. On a variety of parallel architectures, reducing and optimizing memory-and interprocess communication has accelerated memory-bounded problems in linear algebra [4] and graph traversal [5] by as much as an

show abstract

A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers

Anderson

Sheffield

Keutzer

2012

View full text Add to dashboard Cite

Specializing FGPU for Persistent Deep Learning

Chiou

Hsu

et al. 2019

View full text Add to dashboard Cite

Automatic generation of application-specific accelerators for FPGAs from python loop nests

Sheffield

Anderson

Keutzer

2012

View full text Add to dashboard Cite

We present Three Fingered Jack, a highly productive approach to mapping vectorizable applications to the FPGA. Our system applies traditional dependence analysis and reordering transformations to a restricted set of Python loop nests. It does this to uncover parallelism and divide computation between multiple parallel processing elements (PEs) that are automatically generated through high-level synthesis of the optimized loop body. Design space exploration on the FPGA proceeds by varying the number of PEs in the system. Over four benchmark kernels, our system achieves 3× to 6× relative to soft-core C performance.

show abstract

Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs

Nurvitadhi

Naik

Boutros

et al. 2019

View full text Add to dashboard Cite

Hardware/software codesign for mobile speech recognition

Sheffield

Anderson

Lee

et al. 2013

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

David Sheffield

Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC

Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC

Communication-minimizing 2D convolution in GPU registers

A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers

Specializing FGPU for Persistent Deep Learning

Automatic generation of application-specific accelerators for FPGAs from python loop nests

Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs

Hardware/software codesign for mobile speech recognition

Contact Info

Product

Resources

About