A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers

Anderson, Michael J.; Sheffield, David; Keutzer, Kurt

doi:10.1109/ipdps.2012.11

Cited by 40 publications

(22 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Table 2, we empirically benchmark the bandwidth of the global memory and shared memory, again using benchmarks described in [10]. 2 Our global memory bandwidth results are for memory accesses with unit stride-adjacent threads access adjacent global memory addresses.…”

Section: Benchmarking the Memory Hierarchymentioning

confidence: 99%

Communication-minimizing 2D convolution in GPU registers

Iandola

Sheffield

Anderson

et al. 2013

2013 IEEE International Conference on Image Processing

Self Cite

View full text Add to dashboard Cite

2D image convolution is ubiquitous in image processing and computer vision problems such as feature extraction. Exploiting parallelism is a common strategy for accelerating convolution. Parallel processors keep getting faster, but algorithms such as image convolution remain memory bounded on parallel processors such as GPUs. Therefore, reducing memory communication is fundamental to accelerating image convolution. To reduce memory communication, we reorganize the convolution algorithm to prefetch image regions to register, and we do more work per thread with fewer threads. To enable portability to future architectures, we implement a convolution autotuner that sweeps the design space of memory layouts and loop unrolling configurations. We focus on convolution with small filters (2x2-7x7), but our techniques can be extended to larger filter sizes. Depending on filter size, our speedups on two NVIDIA architectures range from 1.2x to 4.5x over state-of-the-art GPU libraries.Index Terms-Convolution, parallel, GPU, autotuning INTRODUCTIONConvolution is a key component in most algorithms for feature extraction, image segmentation, object tracking, and object recognition. In a recent "periodic table" of the fifteen most recurring computational patterns in image processing and computer vision literature, convolution ranked as the most ubiquitous, followed by histogram accumulation, vector distance, and quadratic optimization [1]. Our work focuses on image convolution with small nonseperable filters (2x2 to 7x7), which are extremely common for edge detection, feature extraction [2], and difference of gaussians [3].The computer architecture community has developed manythreaded processors that offer tremendous boosts in peak FLOP/s over traditional single-core CPUs. However, improvements to memory bandwidth and latency have lagged behind the improvements to the processors themselves. As a result, the performance of convolution and other algorithms with low computational complexity tend to be limited by the memory bandwidth, much like trying to drink a thick milkshake through a narrow straw.Parallel processors keep getting faster, but algorithms like convolution remain memory-bounded on these architectures. The solution is to redesign algorithms with the goal of minimizing communication among off-chip memory, on-chip shared memory, and registers. On a variety of parallel architectures, reducing and optimizing memory-and interprocess communication has accelerated memory-bounded problems in linear algebra [4] and graph traversal [5] by as much as an

show abstract

Section: Benchmarking the Memory Hierarchymentioning

confidence: 99%

Communication-minimizing 2D convolution in GPU registers

Iandola

Sheffield

Anderson

et al. 2013

2013 IEEE International Conference on Image Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Moreover, there are good reasons to believe that neither improved compiler technology nor autotuning will make any significant headway on this problem. This lack of coverage by current library infrastructure is especially alarming because of the number of applications from important fields that fit this profile, including deep learning [8], data mining [31], astrophysics [23], image and signal processing [4], [24], hydrodynamics [10], quantum chemistry [5], and computational fluid dynamics (CFD) and the resulting partial differential equations (PDEs) through direct and multifrontal solvers [42], to name a few. Dramatically better performance on these applications can be achieved by using software that can repetitively execute small matrix/tensor operations grouped together in "batches."…”

Section: Introductionmentioning

confidence: 99%

“…Also, in combustion and astrophysics supernova applications [6], [7], [17], [23], [32], the study of a thermonuclear reaction networks (XNet package) requires the solution of many sparse linear systems of around 150 × 150. Furthermore, the need for batched routines can be illustrated in radar signal processing [4], where a batch of 200 × 200 QR decompositions is needed, as well as in hydrodynamic simulations [10], where thousands of matrix-matrix and matrix-vector (GEMV) products of matrices of around 100 × 100 are needed.…”

Section: Introductionmentioning

confidence: 99%

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Haidar

Abdelfattah

Zounon

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.

show abstract

“…In Magnetic resonance imaging (MRI), billions small 8x8 and 32x32 eigenvalue problems need to be solved. Also, a batched 200x200 QR decomposition is required to be computed in radar signal processing [3]. Hydrodynamic simulations need to compute thousands of matrix-matrix (dgemm) or matrix-vector(dgemv) products of matrices of well over 100x100 [6].…”

Section: Introductionmentioning

confidence: 99%

A Fast Batched Cholesky Factorization on a GPU

Dong

Haidar

Tomov

et al. 2014

2014 43rd International Conference on Parallel Processing

View full text Add to dashboard Cite

Abstract-Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems, while solving many small independent problems, which is usually referred to as batched problems, is not given adequate attention. In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms -nonblocked, blocked, and recursive blocked -were examined. The left-looking version of the Cholesky factorization is used to factorize the panel, and the right-looking Cholesky version is used to update the trailing matrix in the recursive blocked algorithm. Our batched Cholesky achieves up to 1.8× speedup compared to the optimized parallel implementation in the MKL library on two sockets of Intel Sandy Bridge CPUs. Further, we use the new routines to develop a single Cholesky factorization solver which targets large matrix sizes. Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU. Such an implementation does not depend on the speed of the CPU. Compared to the MAGMA library, our full GPU solution achieves 85% of the hybrid MAGMA performance which uses 16 Sandy Bridge cores, in addition to a K40 Nvidia GPU. Moreover, we achieve 80% of the practical dgemm peak of the machine, while MAGMA achieves only 75%, and finally, in terms of energy consumption, we outperform MAGMA by 1.5× in performance-per-watt for large matrices.

show abstract

A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers

Cited by 40 publications

References 11 publications

Communication-minimizing 2D convolution in GPU registers

Communication-minimizing 2D convolution in GPU registers

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

A Fast Batched Cholesky Factorization on a GPU

Contact Info

Product

Resources

About