Efficient GPU Computation Using Task Graph Parallelism

Lin, Dian-Lun; Huang, Tsung‐Wei

doi:10.1007/978-3-030-85665-6_27

Cited by 20 publications

(2 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The parallel computing community has a number of algorithms including static mapping [41], dynamic work-stealing [20], [21], asymptotic profiling [42], and other system-defined strategies [5], [8], [10], [16]. Vendor-specific features such as CUDA Graph [2], [43] and SYCL [9] offer asynchronous graph scheduling for task parallelism but implementation details are unknown. On the other hand, automatic GPU placement has been studied in machine learning community [44], [45].…”

Section: B Vlsi Placementmentioning

confidence: 99%

Concurrent CPU-GPU Task Programming using Modern C++

Huang¹,

Lin²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we introduce Heteroflow, a new C++ library to help developers quickly write parallel CPU-GPU programs using task dependency graphs. Heteroflow leverages the power of modern C++ and task-based approaches to enable efficient implementations of heterogeneous decomposition strategies. Our new CPU-GPU programming model allows users to express a problem in a way that adapts to effective separation of concerns and expertise encapsulation. Compared with existing libraries, Heteroflow is more cost-efficient in performance scaling, programming productivity, and solution generality. We have evaluated Heteroflow on two real applications in VLSI design automation and demonstrated the performance scalability across different CPU-GPU numbers and problem sizes. At a particular example of VLSI timing analysis with million-scale tasking, Heteroflow achieved 7.7× runtime speed-up (99 vs 13 minutes) over a baseline on a machine of 40 CPU cores and 4 GPUs.

show abstract

Section: B Vlsi Placementmentioning

confidence: 99%

Concurrent CPU-GPU Task Programming using Modern C++

Huang¹,

Lin²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Legion [3,9] is also a task-based runtime designed for distributed machines with heterogeneous nodes. More recently, CudaGraph [14] enables developers to write or capture GPU operations only and organizes them into graphs to reduce kernel launch overheads of CUDA. OpenMP [4,18] target constructs have been introduced in the specification version 4.0.…”

Section: Related Workmentioning

confidence: 99%

Enhancing MPI+OpenMP Task Based Applications for Heterogeneous Architectures with GPU Support

Ferat

Pereira

Roussel

et al. 2022

OpenMP in a Modern World: From Multi-Device Support to Meta Programming

View full text Add to dashboard Cite

Heterogeneous supercomputers are widespread over HPC systems and programming efficient applications on these architectures is a challenge. Task-based programming models are a promising way to tackle this challenge. Since OpenMP 4.0 and 4.5, the target directives enable to offload pieces of code to GPUs and to express it as tasks with dependencies. Therefore, heterogeneous machines can be programmed using MPI+OpenMP(task+target) to exhibit a very high level of concurrent asynchronous operations for which data transfers, kernel executions, communications and CPU computations can be overlapped. Hence, it is possible to suspend tasks performing these asynchronous operations on the CPUs and to overlap their completion with another task execution. Suspended tasks can resume once the associated asynchronous event is completed in an opportunistic way at every scheduling point. We have integrated this feature into the MPC framework and validated it on a AXPY microbenchmark and evaluated on a MPI+OpenMP(tasks) implementation of the LULESH proxy applications. The results show that we are able to improve asynchronism and the overall HPC performance, allowing applications to benefit from asynchronous execution on heterogeneous machines.

show abstract