In-Place Matrix Transposition on GPUs

Gómez-Luna, Juan; Sung, I-Jui; Chang, Li-Wen; González-Linares, José María; Guil, Nicolás; Hwu, Wen-mei W.

doi:10.1109/tpds.2015.2412549

Cited by 11 publications

(8 citation statements)

References 15 publications

(44 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our PIM implementation follows an efficient 3-step tiled approach [79,235] that (1) exploits spatial locality by operating on tiles of matrix elements, as opposed to single elements, and (2) balances the workload by partitioning the cycles across tasklets. To perform the three steps, we first factorize the dimensions of the 𝑀 × 𝑁 array as an 𝑀 ′ × 𝑚 × 𝑁 ′ × 𝑛 array, where 𝑀 = 𝑀 ′ × 𝑚 and 𝑁 = 𝑁 ′ × 𝑛.…”

Section: Matrix Transpositionmentioning

confidence: 99%

“…Fifth, while the amount of time spent on CPU-DPU transfers and DPU-CPU transfers is relatively low compared to the time spent on DPU execution for most benchmarks, we observe that CPU-DPU transfer time is very high in TRNS. The CPU-DPU transfer of TRNS performs step 1 of the matrix transposition algorithm [79,235] by issuing 𝑀 ′ × 𝑚 data transfers of 𝑛 elements, as explained in Section 4.14. Since we use a small 𝑛 value in the experiment (𝑛 = 8, as indicated in Table 3), the sustained CPU-DPU bandwidth is far from the maximum CPU-DPU bandwidth (see sustained CPU-DPU bandwidth for different transfer sizes in Figure 8a).…”

Section: Key Observation 12mentioning

confidence: 99%

See 1 more Smart Citation

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Gómez-Luna¹,

Hajj²,

Fernández³

et al. 2021

Preprint

View full text Add to dashboard Cite

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM).Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM (Processing-In-Memory benchmarks), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems. CCS Concepts: • Hardware → Dynamic memory; • Computing methodologies → Model development and analysis; • Computer systems organization → Architectures.

show abstract

Section: Matrix Transpositionmentioning

confidence: 99%

Section: Key Observation 12mentioning

confidence: 99%

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Gómez-Luna¹,

Hajj²,

Fernández³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Sung et al 13,14 have described a 3-stage in-place matrix transposition algorithm that is similar to the GKK algorithm, but goes directly from a CMO matrix to a block matrix in which the blocks are stored in row order and the elements in each block are stored in column order, ie, as in the middle matrix in Figure 5. better performance on GPUs than the 4-stage GKK algorithm because the former is able to make more efficient use of on-chip memory, resulting in enhanced performance.…”

Section: Some Recent Ipt Algorithmsmentioning

confidence: 99%

Algorithms for in‐place matrix transposition

Gustavson¹,

Walker

2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary This paper presents implementations of in‐place algorithms for transposing rectangular matrices. One implementation is a swap‐based algorithm described by Tretyakov and Tyrtyshnikov,1 to which we have introduced a number of variations. In particular, we show how the original algorithm can be modified to require constant additional memory. A proof of correctness is also sketched. This algorithm is compared with cycle‐following approaches and with the swap‐based GCD Transpose algorithm that partitions the matrix into a hierarchy of square submatrices. The performance of parallel implementations on a multicore system is also investigated.

show abstract

“…The low performance of NT of cuBLAS may be caused by the inefficient memory access to the elements of B. Another possible reason is that cuBLAS uses the slow in-place matrix transpose algorithm to reduce the memory footprint [14]. Observing this low efficiency issue, we are motivated to propose a method (TNN) for NT operations which finds the transpose of B first and then calls NN function of cuBLAS to finish the calculation of A × B T on GPUs.…”

Section: Motivationmentioning

confidence: 99%

“…The in-place matrix transpose algorithm does not require extra memory space. However, the in-place matrix transposition can be factored as a product of disjoint circles [21], and the number of circles could be much lower in rectangular matrices and their length is not uniform, which results in the difficulty in parallelization [14]. The state-of-the-art implementation of in-place matrix transposition achieves only 51.56 GB/s and 22.74 GB/s on GTX 980 (with a peak memory bandwidth of 224 GB/s) and Telsa K20 (with a peak memory bandwidth of 208 GB/s) respectively with single precision [14].…”

Section: Tnn: Transpose Before Multiplymentioning

confidence: 99%

Supervised Learning Based Algorithm Selection for Deep Neural Networks

Shi

Chu

2017

2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)

View full text Add to dashboard Cite

Many recent deep learning platforms rely on thirdparty libraries (such as cuBLAS) to utilize the computing power of modern hardware accelerators (such as GPUs). However, we observe that they may achieve suboptimal performance because the library functions are not used appropriately. In this paper, we target at optimizing the operations of multiplying a matrix with the transpose of another matrix (referred to as NT operation hereafter), which contribute about half of the training time of fully connected deep neural networks. Rather than directly calling the library function, we propose a supervised learning based algorithm selection approach named MTNN, which uses a gradient boosted decision tree to select one from two alternative NT implementations intelligently: (1) calling the cuBLAS library function; (2) calling our proposed algorithm TNN that uses an efficient out-of-place matrix transpose. We evaluate the performance of MTNN on two modern GPUs: NVIDIA GTX 1080 and NVIDIA Titan X Pascal. MTNN can achieve 96% of prediction accuracy with very low computational overhead, which results in an average of 54% performance improvement on a range of NT operations. To further evaluate the impact of MTNN on the training process of deep neural networks, we have integrated MTNN into a popular deep learning platform Caffe. Our experimental results show that the revised Caffe can outperform the original one by an average of 28%. Both MTNN and the revised Caffe are open-source.

show abstract

In-Place Matrix Transposition on GPUs

Cited by 11 publications

References 15 publications

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Algorithms for in‐place matrix transposition

Supervised Learning Based Algorithm Selection for Deep Neural Networks

Contact Info

Product

Resources

About