SC16: International Conference for High Performance Computing, Networking, Storage and Analysis 2016
DOI: 10.1109/sc.2016.51
|View full text |Cite
|
Sign up to set email alerts
|

dCUDA: Hardware Supported Overlap of Computation and Communication

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 21 publications
(12 citation statements)
references
References 23 publications
0
12
0
Order By: Relevance
“…The CPU and GPU communicate through a non-coherent PCIe bus model. This is representative of most previous works that have attempted intra-kernel networking using helper threads on the host [12,16,37]. dGPU also serves as the baseline for all results that report normalized energy consumption or speedups.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…The CPU and GPU communicate through a non-coherent PCIe bus model. This is representative of most previous works that have attempted intra-kernel networking using helper threads on the host [12,16,37]. dGPU also serves as the baseline for all results that report normalized energy consumption or speedups.…”
Section: Methodsmentioning
confidence: 99%
“…For example, DCGN [37] quotes latencies of 330µs and Gravel [31] uses a 125µs timeout to lush pending messages. Even recent works on powerful modern hardware, such as dCUDA [12], only achieve latencies of approximately 20µs in the best case.…”
Section: High Latenciesmentioning
confidence: 99%
See 1 more Smart Citation
“…SnuCL [17] and SnuCL-D [15] enable OpenCL applications to run in a distributed manner without any modification. dCUDA [14] automatically overlaps on-node computation and inter-node communication with hardware support and device-side remote memory access operations. It combines the MPI and CUDA programming models into a single CUDA kernel.…”
Section: Related Workmentioning
confidence: 99%
“…As a recent study, CuMAS [29] offers automatic overlapping of data transfers and kernel executions, but it focuses on scheduling multiple CUDA applications, rather than scheduling of a single application's data transfers. dCUDA [30] is a runtime system that overlaps computation with inter-node communication on a multi-GPU environment but it relies on the programmer to implement the CUDA kernels. Daino [31], a compiler-based framework for executing Adaptive Mesh Refinement (AMR) applications on GPUs, requires user directives but its runtime hides many details of data movement.…”
Section: Related Workmentioning
confidence: 99%