2016
DOI: 10.1109/tpds.2015.2412549
|View full text |Cite
|
Sign up to set email alerts
|

In-Place Matrix Transposition on GPUs

Abstract: Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPU to achieve g… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 15 publications
(44 reference statements)
0
8
0
Order By: Relevance
“…Our PIM implementation follows an efficient 3-step tiled approach [79,235] that (1) exploits spatial locality by operating on tiles of matrix elements, as opposed to single elements, and (2) balances the workload by partitioning the cycles across tasklets. To perform the three steps, we first factorize the dimensions of the 𝑀 × 𝑁 array as an 𝑀 ′ × 𝑚 × 𝑁 ′ × 𝑛 array, where 𝑀 = 𝑀 ′ × 𝑚 and 𝑁 = 𝑁 ′ × 𝑛.…”
Section: Matrix Transpositionmentioning
confidence: 99%
See 1 more Smart Citation
“…Our PIM implementation follows an efficient 3-step tiled approach [79,235] that (1) exploits spatial locality by operating on tiles of matrix elements, as opposed to single elements, and (2) balances the workload by partitioning the cycles across tasklets. To perform the three steps, we first factorize the dimensions of the 𝑀 × 𝑁 array as an 𝑀 ′ × 𝑚 × 𝑁 ′ × 𝑛 array, where 𝑀 = 𝑀 ′ × 𝑚 and 𝑁 = 𝑁 ′ × 𝑛.…”
Section: Matrix Transpositionmentioning
confidence: 99%
“…Fifth, while the amount of time spent on CPU-DPU transfers and DPU-CPU transfers is relatively low compared to the time spent on DPU execution for most benchmarks, we observe that CPU-DPU transfer time is very high in TRNS. The CPU-DPU transfer of TRNS performs step 1 of the matrix transposition algorithm [79,235] by issuing 𝑀 ′ × 𝑚 data transfers of 𝑛 elements, as explained in Section 4.14. Since we use a small 𝑛 value in the experiment (𝑛 = 8, as indicated in Table 3), the sustained CPU-DPU bandwidth is far from the maximum CPU-DPU bandwidth (see sustained CPU-DPU bandwidth for different transfer sizes in Figure 8a).…”
Section: Key Observation 12mentioning
confidence: 99%
“…Sung et al 13,14 have described a 3-stage in-place matrix transposition algorithm that is similar to the GKK algorithm, but goes directly from a CMO matrix to a block matrix in which the blocks are stored in row order and the elements in each block are stored in column order, ie, as in the middle matrix in Figure 5. better performance on GPUs than the 4-stage GKK algorithm because the former is able to make more efficient use of on-chip memory, resulting in enhanced performance.…”
Section: Some Recent Ipt Algorithmsmentioning
confidence: 99%
“…The low performance of NT of cuBLAS may be caused by the inefficient memory access to the elements of B. Another possible reason is that cuBLAS uses the slow in-place matrix transpose algorithm to reduce the memory footprint [14]. Observing this low efficiency issue, we are motivated to propose a method (TNN) for NT operations which finds the transpose of B first and then calls NN function of cuBLAS to finish the calculation of A × B T on GPUs.…”
Section: Motivationmentioning
confidence: 99%
“…The in-place matrix transpose algorithm does not require extra memory space. However, the in-place matrix transposition can be factored as a product of disjoint circles [21], and the number of circles could be much lower in rectangular matrices and their length is not uniform, which results in the difficulty in parallelization [14]. The state-of-the-art implementation of in-place matrix transposition achieves only 51.56 GB/s and 22.74 GB/s on GTX 980 (with a peak memory bandwidth of 224 GB/s) and Telsa K20 (with a peak memory bandwidth of 208 GB/s) respectively with single precision [14].…”
Section: Tnn: Transpose Before Multiplymentioning
confidence: 99%