Efficient 3D Transpositions in Graphics Processing Units

Jodra, Jose Luis; Gurrutxaga, Ibai; Muguerza, Javier

doi:10.1007/s10766-015-0366-5

Cited by 13 publications

(11 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The in-place and out-of-place transposition of the 3D matrices described in [14] utilises the performance optimisations proposed in [24]. We will demonstrate that these optimisations are not sufficient to achieve high and stable throughput when large 3D matrices are transposed.…”

Section: Prior Artmentioning

confidence: 97%

“…-We propose a modified version of NVIDIA's out-of-place algorithm by applying an enumeration scheme that delivers sustained high throughput for large matrices. -We demonstrate that the 3D matrix transposition presented in [14] is also susceptible to the TLB cache misses. An improved version of the involution transposition T yxz is suggested.…”

mentioning

confidence: 88%

“…They showed that the in-place transposition of the square matrix can be accomplished in a parallel environment using only m 2 −m 2 threads. As opposed to "killing" the redundant blocks programmatically after the necessary resources have been allocated [14]. The algorithm was very flexible in terms how the size of grid could be selected.…”

Section: Transparent Block Reorderingmentioning

confidence: 99%

“…For the purpose of our demonstration, we assume that these matrices are stored in memory row-wise and that each separate plane occupies contiguous block of memory. In principal, the involution transpositions can be reduced to the 2D transpositions [14]. For instance, transposition T yxz can be regarded as z transpositions of the xy planes.…”

Section: Involution Transposition Optimisationmentioning

confidence: 99%

“…In our research we try to tackle these problems and propose an algorithm that offers both stable and high performance regardless of the input matrix's size. Although we primarily concentrate on the in-place (IP) transposition of square matrices performed in parallel computing environments, the idea presented below can be also be used to improve the out-of-place (OOP) transposition of rectangular matrices or even 3D matrices [14].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

Gorawski

Lorek

2017

Int J Parallel Prog

View full text Add to dashboard Cite

The purpose of this paper is to highlight the performance issues of the matrix transposition algorithms for large matrices, relating to the Translation Lookaside Buffer (TLB) cache. The existing optimisation techniques such as coalesced access and the use of shared memory, regardless of their necessity and benefits, are not sufficient enough to neutralise the problem. As the data problem size increases, these optimisations do not exploit data locality effectively enough to counteract the detrimental effects of TLB cache misses. We propose a new optimisation technique that counteracts the performance degradation of these algorithms and seamlessly complements current optimisations. Our optimisation is based on detailed analysis of enumeration schemes that can be applied to either individual matrix entries or blocks (sub-matrices). The key advantage of these enumeration schemes is that they do not incur matrix storage format conversion because they operate on canonical matrix layouts. In addition, several cache-efficient matrix transposition algorithms based on enumeration schemes are offered-an improved version of the in-place algorithm for square matrices, outof-place algorithm for rectangular matrices and two 3D involutions. We demonstrate that the choice of the enumeration schemes and their parametrisation can have a direct and significant impact on the algorithm's memory access pattern. Our in-place version of the algorithm delivers up to 100% performance improvement over the existing optimisation techniques. Meanwhile, for the out-of-place version we observe up to 300% performance gain over the NVidia's algorithm. We also offer improved versions of two involution transpositions for the 3D matrices that can achieve performance increase 123Int J Parallel Prog up 300%. To the best of our knowledge, this is the first effective attempt to control the logical-to-physical block association through the design of enumeration schemes in the context of matrix transposition.

show abstract

Section: Prior Artmentioning

confidence: 97%

mentioning

confidence: 88%

Section: Transparent Block Reorderingmentioning

confidence: 99%

Section: Involution Transposition Optimisationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations