Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment

Ohshima, Satoshi; Kise, Kenji; Katagiri, Takahiro; Yuba, Toshitsugu

doi:10.1007/978-3-540-71351-7_24

Cited by 37 publications

(20 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, it is important to find an effective method to make full use of all the available computational resources of both the CPU and GPU. Recently, some approaches [3,4,5,6,7] have been developed to perform a specific task using both multi-core CPU and GPU simultaneously, instead of the CPU or GPU alone. In this paper, we present a way to distribute the workload into both the CPU and GPU, with a performance prediction model (i.e., a static strategy) including characteristics of feature extraction from the video stream data.…”

mentioning

confidence: 99%

CPU-GPU hybrid computing for feature extraction from video stream

Lee

Kim

Park

et al. 2014

IEICE Electron. Express

View full text Add to dashboard Cite

In this paper, we propose a way to distribute the video analytics workload into both the CPU and GPU, with a performance prediction model including characteristics of feature extraction from the video stream data. That is, we estimate the total execution time of a CPU-GPU hybrid computing system with the performance prediction model, and determine the optimal workload ratio and how to use the CPU cores for the given workload. Based on experimental results, we confirm that our proposed method can improve the speedups of three typical workload distributions: CPU-only, GPU-only, or CPU-GPU hybrid computing with a 50:50 workload ratio.

show abstract

mentioning

confidence: 99%

CPU-GPU hybrid computing for feature extraction from video stream

Lee

Kim

Park

et al. 2014

IEICE Electron. Express

View full text Add to dashboard Cite

show abstract

“…An optimum split of the matrix would keep the time consumed by the GPU and CPU balanced [23,33]. The multi-device (GPU and CPU) computations are overlapped and the data transfers between GPU and CPU are performed asynchronously in order to achieve the maximum performance.…”

Section: Auto-tuning a Multi-device Matrix Multiplicationmentioning

confidence: 99%

Auto-tuning techniques for linear algebra routines on hybrid platforms

Bernabé

Cuenca

García

et al. 2015

Journal of Computational Science

View full text Add to dashboard Cite

“…Ohshima et al examined CPU and GPU parallel matrix-matrix multiplications on a single node [2], a procedure that improves the local dgemm performance. There are several frameworks and libraries to exploit the power of CPU and GPU, such as StarPU [3] and MAGMA [4].…”

Section: Related Studiesmentioning

confidence: 99%

Computation-Communication Overlap of Linpack on a GPU-Accelerated PC Cluster

Ohmura

Miyoshi

Irie

et al. 2011

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYIn this paper, we propose an approach to obtaining enhanced performance of the Linpack benchmark on a GPU-accelerated PC cluster connected via relatively slow inter-node connections. For one node with a quad-core Intel Xeon W3520 processor and a NVIDIA Tesla C1060 GPU card, we implement a CPU-GPU parallel double-precision general matrix-matrix multiplication (dgemm) operation, and achieve a performance improvement of 34% compared with the GPU-only case and 64% compared with the CPU-only case. For an entire 16-node cluster, each node of which is the same as the above and is connected with two gigabit Ethernet links, we use a computation-communication overlap scheme with GPU acceleration for the Linpack benchmark, and achieve a performance improvement of 28% compared with the GPU-accelerated high-performance Linpack benchmark (HPL) without overlapping. Our overlap GPU acceleration solution uses overlaps in which the main inter-node communication and data transfer to the GPU device memory are overlapped with the main computation task on the CPU cores. These overlaps use multi-core processors, which almost all of today's high-performance computers use. In particular, as well as using a CPU core for communication tasks, we also simultaneously use other CPU cores and the GPU for computation tasks. In order to enable overlap between inter-node communication and computation tasks, we eliminate their close dependence by breaking the main computation task into smaller tasks and rescheduling. Based on a scheme in which part of the CPU computation power is simultaneously used for tasks other than computation tasks, we experimentally find the optimal computation ratio for CPUs; this ratio differs from the case of parallel dgemm operation of one node.

show abstract

Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment

Cited by 37 publications

References 9 publications

CPU-GPU hybrid computing for feature extraction from video stream

CPU-GPU hybrid computing for feature extraction from video stream

Auto-tuning techniques for linear algebra routines on hybrid platforms

Computation-Communication Overlap of Linpack on a GPU-Accelerated PC Cluster

Contact Info

Product

Resources

About