dCUDA: Hardware Supported Overlap of Computation and Communication

Gysi, Tobias; Bär, Jeremia; Hoefler, Torsten

doi:10.1109/sc.2016.51

Cited by 21 publications

(12 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The CPU and GPU communicate through a non-coherent PCIe bus model. This is representative of most previous works that have attempted intra-kernel networking using helper threads on the host [12,16,37]. dGPU also serves as the baseline for all results that report normalized energy consumption or speedups.…”

Section: Methodsmentioning

confidence: 99%

“…For example, DCGN [37] quotes latencies of 330µs and Gravel [31] uses a 125µs timeout to lush pending messages. Even recent works on powerful modern hardware, such as dCUDA [12], only achieve latencies of approximately 20µs in the best case.…”

Section: High Latenciesmentioning

confidence: 99%

“…GPUNet [16] provides a socket-based abstraction for GPUs. Both Distributed Computing for GPU Networks (DCGN) [37] and dCUDA [12] implement a device-side MPI library for GPU kernels that attempts to hide long-latency GPU network events across the cluster. Gravel [31] optimizes irregular GPU messaging applications by employing host-side coalescing of network operations.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

ComP-net

LeBeane

Hamidouche

Benton

et al. 2018

Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Current state-of-the-art in GPU networking advocates a hostcentric model that reduces performance and increases code complexity. Recently, researchers have explored several techniques for networking within a GPU kernel itself. These approaches, however, sufer from high latency, waste energy on the host, and are not scalable with larger/more GPUs on a node. In this work, we introduce Command Processor Networking (ComP-Net), which leverages the availability of scalar cores integrated on the GPU itself to provide highperformance intra-kernel networking. ComP-Net enables eicient synchronization between the Command Processors and Compute Units on the GPU through a line locking scheme implemented in the GPU's shared last-level cache. We illustrate that ComP-Net can improve application performance by up to 20% and provide up to 50% reduction in energy consumption vs. competing networking techniques across a Jacobi stencil, allreduce collective, and machine learning applications. CCS CONCEPTS • Computer systems organization → Heterogeneous (hybrid) systems;

show abstract

Section: Methodsmentioning

confidence: 99%

Section: High Latenciesmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

ComP-net

LeBeane

Hamidouche

Benton

et al. 2018

Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

“…SnuCL [17] and SnuCL-D [15] enable OpenCL applications to run in a distributed manner without any modification. dCUDA [14] automatically overlaps on-node computation and inter-node communication with hardware support and device-side remote memory access operations. It combines the MPI and CUDA programming models into a single CUDA kernel.…”

Section: Related Workmentioning

confidence: 99%

HDArray: Parallel Array Interface for Distributed Heterogeneous Devices

Cho

Kwon

Midkiff

2019

Languages and Compilers for Parallel Computing

View full text Add to dashboard Cite

Heterogeneous clusters with nodes containing one or more accelerators, such as GPUs, have become common. While MPI provides inter-address space communication, and OpenCL provides a process with access to heterogeneous computational resources, programmers are forced to write hybrid programs that manage the interaction of both of these systems. This paper describes an array programming interface that provides users with automatic and manual distributions of data and work. Using work distribution and kernel def and use information, communication among processes and devices in a process is performed automatically. By providing a unified programming model to the user, program development is simplified.

show abstract

“…As a recent study, CuMAS [29] offers automatic overlapping of data transfers and kernel executions, but it focuses on scheduling multiple CUDA applications, rather than scheduling of a single application's data transfers. dCUDA [30] is a runtime system that overlaps computation with inter-node communication on a multi-GPU environment but it relies on the programmer to implement the CUDA kernels. Daino [31], a compiler-based framework for executing Adaptive Mesh Refinement (AMR) applications on GPUs, requires user directives but its runtime hides many details of data movement.…”

Section: Related Workmentioning

confidence: 99%

Overlapping Data Transfers with Computation on GPU with Tiles

Bastem

Unat

Zhang

et al. 2017

2017 46th International Conference on Parallel Processing (ICPP)

View full text Add to dashboard Cite

GPUs are employed to accelerate scientific applications however they require much more programming effort from the programmers particularly because of the disjoint address spaces between the host and the device. OpenACC and OpenMP 4.0 provide directive based programming solutions to alleviate the programming burden however synchronous data movement can create a performance bottleneck in fully taking advantage of GPUs. We propose a tiling based programming model and its library that simplifies the development of GPU programs and overlaps the data movement with computation. The programming model decomposes the data and computation into tiles and treats them as the main data transfer and execution units, which enables pipelining the transfers to hide the transfer latency. Moreover, partitioning application data into tiles allows the programmer to still take advantage of GPU even though application data cannot fit into the device memory. The library leverages C++ lambda functions, OpenACC directives, CUDA streams and tiling API from TiDA to support both productivity and performance. We show the performance of the library on a data transfer-intensive and a compute-intensive kernels and compare its speedup against OpenACC and CUDA. The results indicate that the library can hide the transfer latency, handle the cases where there is no sufficient device memory, and achieves reasonable performance.

show abstract

dCUDA: Hardware Supported Overlap of Computation and Communication

Cited by 21 publications

References 23 publications

ComP-net

ComP-net

HDArray: Parallel Array Interface for Distributed Heterogeneous Devices

Overlapping Data Transfers with Computation on GPU with Tiles

Contact Info

Product

Resources

About