Tasking in Accelerators: Performance Evaluation

Toledo, Leonel; Peña, Antonio J.; Catalán, Sandra; Valero-Lara, Pedro

doi:10.1109/pdcat46702.2019.00034

Cited by 8 publications

(5 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using Dynamic Parallelism the programmers could invoke kernels inside the device without the need for switching context back to the CPU. However, the launching of kernels from other kernels has a large associated computational cost [1].…”

Section: Background 21 Cudamentioning

confidence: 99%

“…Although GPU capacity has increased significantly, the scalability of algorithms and applications still faces important challenges [1]. One important problem regarding scalability is the hardware resource assignment.…”

Section: Related Workmentioning

confidence: 99%

“…It is undeniable that GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, some applications are facing problems in terms of scalability and some algorithms seem to limit the amount of work that one GPU can perform at a given time [1]. This is mainly due to the assignment of hardware resources and the occupancy of the device, which makes it difficult to benefit from the whole GPU capacity.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs

et al. 2022

Self Cite

View full text Add to dashboard Cite

The main contribution of this work is to increase the coding productivity of GPU programming by using the concept of Static Graphs. GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, there are still some problems in terms of scalability and limitations to the amount of work that a GPU can perform at a time. To minimize the overhead associated with the launch of GPU kernels, as well as to maximize the use of GPU capacity, we have combined the new CUDA Graph API with the CUDA programming model (including CUDA math libraries) and the OpenACC programming model. We use as test cases two different, well-known and widely used problems in HPC and AI: the Conjugate Gradient method and the Particle Swarm Optimization. In the first test case (Conjugate Gradient) we focus on the integration of Static Graphs with CUDA. In this case, we are able to significantly outperform the NVIDIA reference code, reaching an acceleration of up to 11× thanks to a better implementation, which can benefit from the new CUDA Graph capabilities. In the second test case (Particle Swarm Optimization), we complement the OpenACC functionality with the use of CUDA Graph, achieving again accelerations of up to one order of magnitude, with average speedups ranging from 2× to 4×, and performance very close to a reference and optimized CUDA code. Our main target is to achieve a higher coding productivity model for GPU programming by using Static Graphs, which provides, in a very transparent way, a better exploitation of the GPU capacity. The combination of using Static Graphs with two of the current most important GPU programming models (CUDA and OpenACC) is able to reduce considerably the execution time w.r.t. the use of CUDA and OpenACC only, achieving accelerations of up to more than one order of magnitude. Finally, we propose an interface to incorporate the concept of Static Graphs into the OpenACC Specifications.

show abstract

Section: Background 21 Cudamentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs

et al. 2022

Self Cite

View full text Add to dashboard Cite

show abstract

“…It is undeniable that GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, some applications are facing problems in terms of scalability and some algorithms seem to limit the amount of work that one GPU can perform at a single time [1]. This is mainly due to the assignment of hardware resources and the occupancy of the device, which makes it difficult to benefit from the whole GPU capacity.…”

Section: Introductionmentioning

confidence: 99%

Static Graphs for Coding Productivity in OpenACC

Toledo

Valero-Lara

Vetter

et al. 2021

2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Self Cite

View full text Add to dashboard Cite

The main contribution of this work is to increase the coding productivity for GPU programming by using the concept of Static Graphs. To do so, we have combined the new CUDA Graph API with the OpenACC programming model. We use as test cases a well-known and widely used problems in HPC and AI: the Particle Swarm Optimization. We complement the OpenACC functionality with the use of CUDA Graph, achieving accelerations of more than one order of magnitude, and a performance very close to a reference and optimized CUDA code. Finally, we propose a new specification to incorporate the concept of Static Graphs into the OpenACC specification.

show abstract

“…For these reasons, these have gained a broad acceptance. Some representative task-based models are Intel Threading Building Blocks [27], CUDA graphs [23], OpenMP [13] and OmpSs [7]. The two former are hardware-centric models that expose the architectural features in the language, requiring programmers a considerable level of expertise to achieve productivity, while also preventing portability.…”

Section: Introductionmentioning

confidence: 99%

Static Analysis to Enhance Programmability and Performance in OmpSs-2

Munera

Royuela

Ferrer

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Task-based parallel programming models based on compiler directives have proved their effectiveness at describing parallelism in High-Performance Computing (HPC) applications. Recent studies show that cutting-edge Real-Time applications, such as those for unmanned vehicles, can successfully exploit these models. In this scenario, OpenMP is a de facto standard for HPC, and is being studied for Real-Time systems due to its time-predictability and delimited functional safety. However, changes in OpenMP take time to be standardized because it sweeps along a large community. OmpSs, instead, is a task-based model for fastprototyping that has been a forerunner of OpenMP since its inception. OmpSs-2, its successor, aims at the same goal, and defines several features that can be introduced in future versions of OpenMP. This work targets compiler-based optimizations to enhance the programmability and performance of OmpSs-2. Regarding the former, we present an algorithm to determine the data-sharing attributes of OmpSs-2 tasks. Regarding the latter, we introduce a new algorithm to automatically release OmpSs-2 task dependencies before a task has completed. This work evaluates both algorithms in a set of well-known benchmarks, and discusses their applicability to the current and future specifications of OpenMP.

show abstract

Tasking in Accelerators: Performance Evaluation

Cited by 8 publications

References 14 publications

Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs

Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs

Static Graphs for Coding Productivity in OpenACC

Static Analysis to Enhance Programmability and Performance in OmpSs-2

Contact Info

Product

Resources

About