Gangzhao Lu scite author profile

The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This paper aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2× (up to 3×) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces the end-to-end training time of MobileNetV2 and EfficientNet-B0 by 9.7% and 7.3% respectively, and reduces the end-to-end inference time of MobileNet and EfficentNet by 12.2% and 13.5% respectively.

show abstract

Optimizing GPU Memory Transactions for Convolution Operations

Zhang

Wang

2020

View full text Add to dashboard Cite

Performance modeling for MPI applications with low overhead fine-grained profiling

Zhang

et al. 2019

Future Generation Computer Systems

View full text Add to dashboard Cite

Exploring large-scale small file storage for search engines

Zhang

et al. 2015

J Supercomput

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Gangzhao Lu

Fine-grained Powercap Allocation for Power-constrained Systems based on Multi-objective Machine Learning

Optimizing Depthwise Separable Convolution Operations on GPUs

Optimizing GPU Memory Transactions for Convolution Operations

Performance modeling for MPI applications with low overhead fine-grained profiling

Exploring large-scale small file storage for search engines

Contact Info

Product

Resources

About