Optimizing Depthwise Separable Convolution Operations on GPUs

Lu, Gangzhao; Zhang, Weizhe; Wang, Zheng

doi:10.1109/tpds.2021.3084813

Cited by 37 publications

(10 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As explained in section 1, depthwise-separable convolutions (DWSConv) [7] are a common design choice for reducing the computational cost of DL models. However, because DWSConv involves far fewer floating point operations than standard 2D convolutions, its execution time on a GPU is dominated by the memory access latency [15]. To overcome this bottleneck, existing implementations of DWSConv try to accelerate execution by using large batch sizes.…”

Section: Analysis and Discussionmentioning

confidence: 99%

Deep Learning Framework for Real-time Fetal Brain Segmentation in MRI

Faghihpirayesh¹,

Karimi²,

Erdoğmuş³

et al. 2022

Preprint

View full text Add to dashboard Cite

Fetal brain segmentation is an important first step for slicelevel motion correction and slice-to-volume reconstruction in fetal MRI. Fast and accurate segmentation of the fetal brain on fetal MRI is required to achieve real-time fetal head pose estimation and motion tracking for slice re-acquisition and steering. To address this critical unmet need, in this work we analyzed the speed-accuracy performance of a variety of deep neural network models, and devised a symbolically small convolutional neural network that combines spatial details at high resolution with context features extracted at lower resolutions. We used multiple branches with skip connections to maintain high accuracy while devising a parallel combination of convolution and pooling operations as an input downsampling module to further reduce inference time. We trained our model as well as eight alternative, state-of-the-art networks with manually-labeled fetal brain MRI slices and tested on two sets of normal and challenging test cases. Experimental results show that our network achieved the highest accuracy and lowest inference time among all of the compared state-of-the-art real-time segmentation methods. We achieved average Dice scores of 97.99% and 84.04% on the normal and challenging test sets, respectively, with an inference time of 3.36 milliseconds per image on an NVIDIA GeForce RTX 2080 Ti. Code, data, and the trained models are available at this repo.

show abstract

Section: Analysis and Discussionmentioning

confidence: 99%

Deep Learning Framework for Real-time Fetal Brain Segmentation in MRI

Faghihpirayesh¹,

Karimi²,

Erdoğmuş³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…However, the runtime is longer than that of the original PSMNet. As mentioned in [47,48], the reason may be that the cuDNN library does not fully support depthwise convolutions and pointwise convolutions. For the GPU platform of the cuDNN library, the optimization of classic convolutions on end-to-end training is better.…”

Section: Discussionmentioning

confidence: 99%

Optimizing 3D Convolution Kernels on Stereo Matching for Resource Efficient Computations

Xiao

Yamane

2021

Sensors

View full text Add to dashboard Cite

Despite recent stereo matching algorithms achieving significant results on public benchmarks, the problem of requiring heavy computation remains unsolved. Most works focus on designing an architecture to reduce the computational complexity, while we take aim at optimizing 3D convolution kernels on the Pyramid Stereo Matching Network (PSMNet) for solving the problem. In this paper, we design a series of comparative experiments exploring the performance of well-known convolution kernels on PSMNet. Our model saves the computational complexity from 256.66G MAdd (Multiply-Add operations) to 69.03G MAdd (198.47G MAdd to 10.84G MAdd for only considering 3D convolutional neural networks) without losing accuracy. On Scene Flow and KITTI 2015 datasets, our model achieves results comparable to the state-of-the-art with a low computational cost.

show abstract

“…The feature mixing operation in equation 2 with the following pointwise activation function (i.e., ReLU) may offer a sufficient rank of the feature with the efficient operation. However, the inverted bottleneck and the variants below usually need a large expansion ratio ρ > 1 to secure the expressiveness (Sandler et al, 2018;Howard et al, 2019;Tan & Le, 2019), so the actual speed is hampered by the grouped operation that requires more optimization on GPU (Gibson et al, 2020;Lu et al, 2021).…”

Section: Efficient Building Blocksmentioning

confidence: 99%

Learning Features with Parameter-Free Layers

Han¹,

Yoo²,

Kim³

et al. 2022

Preprint

View full text Add to dashboard Cite

Trainable layers such as convolutional building blocks are the standard network design choices by learning parameters to capture the global context through successive spatial operations. When designing an efficient network, trainable layers such as the depthwise convolution is the source of efficiency in the number of parameters and FLOPs, but there was little improvement to the model speed in practice. This paper argues that simple built-in parameter-free operations can be a favorable alternative to the efficient trainable layers replacing spatial operations in a network architecture. We aim to break the stereotype of organizing the spatial operations of building blocks into trainable layers. Extensive experimental analyses based on layer-level studies with fully-trained models and neural architecture searches are provided to investigate whether parameter-free operations such as the max-pool are functional. The studies eventually give us a simple yet effective idea for redesigning network architectures, where the parameter-free operations are heavily used as the main building block without sacrificing the model accuracy as much. Experimental results on the ImageNet dataset demonstrate that the network architectures with parameter-free operations could enjoy the advantages of further efficiency in terms of model speed, the number of the parameters, and FLOPs. Code and ImageNet pretrained models are available at https://github.com/naver-ai/PfLayer.

show abstract

Optimizing Depthwise Separable Convolution Operations on GPUs

Cited by 37 publications

References 40 publications

Deep Learning Framework for Real-time Fetal Brain Segmentation in MRI

Deep Learning Framework for Real-time Fetal Brain Segmentation in MRI

Optimizing 3D Convolution Kernels on Stereo Matching for Resource Efficient Computations

Learning Features with Parameter-Free Layers

Contact Info

Product

Resources

About