Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Jiang, Lijuan; Yang, Chao; Ma, Wei

doi:10.1145/3378176

Cited by 10 publications

(2 citation statements)

References 44 publications

(71 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We note that previous studies [31], [32], [33], [34], [35], [36], [37], [38] have exploited tiling and autotuning for convolution and GEMM operations. However, these prior methods are inadequate for pointwise convolutions on GPUs due to two main drawbacks: they do not consider SM utilization when choosing the optimal tile size and are not designed for pointwise convolutions with small inputs.…”

Section: Optimizing Pointwise Convolutionmentioning

confidence: 97%

Optimizing Depthwise Separable Convolution Operations on GPUs

Zhang

Wang

2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This paper aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2× (up to 3×) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces the end-to-end training time of MobileNetV2 and EfficientNet-B0 by 9.7% and 7.3% respectively, and reduces the end-to-end inference time of MobileNet and EfficentNet by 12.2% and 13.5% respectively.

show abstract

Section: Optimizing Pointwise Convolutionmentioning

confidence: 97%

Optimizing Depthwise Separable Convolution Operations on GPUs

Zhang

Wang

2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…This rank-3 formulation would open up interesting opportunities from a computational standpoint, since one can draw ideas from the advances on tensor algebra taking place within the deep learning community, where rank-3 tensors are at the core of formulations. Of particular interest are algorithmic developments for batched matrix multiplication kernels [37,1,25,20] and hardware innovations such as tensor cores [26] and tensor processing units [21]. Since this is outside the scope of this work, we omit a full discussion on it and reserve it for a future work.…”

Section: Rank-2 Formulation Analysismentioning

confidence: 99%

A compute-bound formulation of Galerkin model reduction for linear time-invariant dynamical systems

Rizzi,

Parish,

Blonigan

et al. 2020

Preprint

View full text Add to dashboard Cite

This work aims to advance computational methods for projection-based reduced order models (ROMs) of linear time-invariant (LTI) dynamical systems. For such systems, current practice relies on ROM formulations expressing the state as a rank-1 tensor (i.e., a vector), leading to computational kernels that are memory bandwidth bound and, therefore, ill-suited for scalable performance on modern many-core and hybrid computing nodes. This weakness can be particularly limiting when tackling many-query studies, where one needs to run a large number of simulations. This work introduces a reformulation, called rank-2 Galerkin, of the Galerkin ROM for LTI dynamical systems which converts the nature of the ROM problem from memory bandwidth to compute bound. We present the details of the formulation and its implementation, and demonstrate its utility through numerical experiments using, as a test case, the simulation of elastic seismic shear waves in an axisymmetric domain. We quantify and analyze performance and scaling results for varying numbers of threads and problem sizes. Finally, we present an end-to-end demonstration of using the rank-2 Galerkin ROM for a Monte Carlo sampling study. We show that the rank-2 Galerkin ROM is one order of magnitude more efficient than the rank-1 Galerkin ROM (the current practice) and about 970X more efficient than the full order model, while maintaining excellent accuracy in both the mean and statistics of the field.

show abstract