2020
DOI: 10.1147/jrd.2019.2962428
|View full text |Cite
|
Sign up to set email alerts
|

OpenMP 4.5 compiler optimization for GPU offloading

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(4 citation statements)
references
References 1 publication
0
4
0
Order By: Relevance
“…Any wavefront can be executed on a single SIMD for four cycles [11] , if each cycle executes one command, the wavefront will be divided into 4 batches, each batch occupies 16 channels, and the execution is completed for all 64 channels. A CU has four vector units, so the total throughput is 64 single-precision operations per cycle [12] . The hardware architecture of its computing unit is shown in Figure 5.…”
Section: Dcu Architecturementioning
confidence: 99%
“…Any wavefront can be executed on a single SIMD for four cycles [11] , if each cycle executes one command, the wavefront will be divided into 4 batches, each batch occupies 16 channels, and the execution is completed for all 64 channels. A CU has four vector units, so the total throughput is 64 single-precision operations per cycle [12] . The hardware architecture of its computing unit is shown in Figure 5.…”
Section: Dcu Architecturementioning
confidence: 99%
“…The DCU has 16GB of global memory and 64KB of shared memory within each CU. The experimental tests were carried out on a single node single DCU card 11 .…”
Section: Test Environmentmentioning
confidence: 99%
“…ExaHyPE, an Exascale Hyperbolic PDE design [30] used a pragma-based GPU parallelization approach for object-oriented code, and documented lessons learned. Several other related works include demonstrating GPU support for OpenMP offloading features in compilers in Flang/Clang [3,25], a proof-ofconcept implementation of offloading for FPGA based accelerators [14,26], and an interprocedural statical analysis heuristic at runtime to select optimal grid sizes for offloaded target team constructs [27], among others. There are publicly available benchmark suites to evaluate heterogeneous application performance, e.g.…”
Section: Related Workmentioning
confidence: 99%
“…The other compilers do not generate excess data movement for su3-v0 because of compiler optimization passes, e.g. the XL compiler uses interprocedural static compiler analysis to determine that all threads in a team execute the same code [27]. The second mini-app, gppnaive, sums data contributions over 4 nested loops.…”
Section: Performance Issues Across Compilersmentioning
confidence: 99%