A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Kelefouras, Vasilios; Kritikakou, Angeliki; Mporas, Iosif; Kolonias, Vasilios

doi:10.1007/s11227-015-1613-7

Cited by 18 publications

(10 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We note that previous studies [31], [32], [33], [34], [35], [36], [37], [38] have exploited tiling and autotuning for convolution and GEMM operations. However, these prior methods are inadequate for pointwise convolutions on GPUs due to two main drawbacks: they do not consider SM utilization when choosing the optimal tile size and are not designed for pointwise convolutions with small inputs.…”

Section: Optimizing Pointwise Convolutionmentioning

confidence: 98%

Optimizing Depthwise Separable Convolution Operations on GPUs

Zhang

Wang

2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This paper aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2× (up to 3×) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces the end-to-end training time of MobileNetV2 and EfficientNet-B0 by 9.7% and 7.3% respectively, and reduces the end-to-end inference time of MobileNet and EfficentNet by 12.2% and 13.5% respectively.

show abstract

Section: Optimizing Pointwise Convolutionmentioning

confidence: 98%

Optimizing Depthwise Separable Convolution Operations on GPUs

Zhang

Wang

2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…However, in loop interchange and blocking the data reuses gain much better performance compared to basic and transposed methods which are shown in Figure 2(c) and Figure 2(d) respectively. MMM speedup has been the major goal of many studies [8], [11]- [15] and is still ongoing today. BLAS [13], [16] is a basic linear algebra subprogram (BLAS) that provides a standard blocking method for matrix multiplication.…”

Section: Related Workmentioning

confidence: 99%

“…Many researchers have worked on the high-performance implementation of MMM [4], [6], [7]. Some cases have been performed on CPU platforms [8], [9], while other implementations have been performed on graphics processing unit (GPU) platforms. There are some software optimization techniques in both CPU and GPU implementation such as instruction-level parallelism (ILP), data-level parallelism (DLP), and thread level parallelism (TLP) [8].…”

Section: Introductionmentioning

confidence: 99%

“…Some cases have been performed on CPU platforms [8], [9], while other implementations have been performed on graphics processing unit (GPU) platforms. There are some software optimization techniques in both CPU and GPU implementation such as instruction-level parallelism (ILP), data-level parallelism (DLP), and thread level parallelism (TLP) [8]. In addition, the tiling technique has already been used, while due to hardware limitations tiling sizes up to 16 have been applied [10].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Matrix-matrix multiplication on graphics processing unit platform using tiling technique

Balagafshe¹,

Akoushideh

Shahbahrami

2022

IJEECS

View full text Add to dashboard Cite

Today’s hardware platforms have parallel processing capabilities and many parallel programming models have been developed. It is necessary to research an efficient implementation of compute-intensive applications using available platforms. Dense matrix-matrix multiplication is an important kernel that is used in many applications, while it is computationally intensive, especially for large matrix sizes. To improve the performance of this kernel, we implement it on the graphics processing unit (GPU) platform using the tiling technique with different tile sizes. Our experimental results show the tiling approach improves speed by 56.89% (2.32× faster) against straightforward (STF). And tile size of 32 has the highest speed compared to other tile sizes of 8 and 16.

show abstract

“…Many research works as well ATLAS [51] (one of the state of the art high performance libraries) apply loop tiling by taking into account only the cache size, i.e., the accumulated size of three rectangular tiles (one of each matrix) must be smaller or equal to the cache size; however, the elements of these tiles are not written in consecutive main memory locations (the elements of each tile sub-row are written in different main memory locations) and thus they do not use consecutive data cache locations; this means that having a set-associative cache, they cannot simultaneously fit in data cache due to the cache modulo effect. Moreover, even if the tile elements are written in consecutive main memory locations (different data array layout), the three tiles cannot simultaneously fit in data cache if the cache is two-way associative or direct mapped [52], [53]. Thus, loop tiling is efficient only when cache size, cache associativity and data array layouts, are addressed together as one problem and not separately.…”

Section: Loop Tiling and Data Array Layoutsmentioning

confidence: 99%

A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

Kelefouras

2017

Computing

Self Cite

View full text Add to dashboard Cite

Today's compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate subproblems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way.In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced a) by addressing the aforementioned transformations together as one problem and not separately, b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse).The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time.

show abstract

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Cited by 18 publications

References 63 publications

Optimizing Depthwise Separable Convolution Operations on GPUs

Optimizing Depthwise Separable Convolution Operations on GPUs

Matrix-matrix multiplication on graphics processing unit platform using tiling technique

A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

Contact Info

Product

Resources

About