Concurrent Execution of Deferred OpenMP Target Tasks with Hidden Helper Threads

Tian, Shilei; Doerfert, Johannes; Chapman, Barbara

doi:10.1007/978-3-030-95953-1_4

Cited by 17 publications

(6 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent work we implemented loop transformation constructions introduced in OpenMP 5.1 [70,71], asynchronous offloading for OpenMP [132], efficient lowering of idiomatic OpenMP code to GPUs (under review), OpenMP-aware compiler optimizations with informative and actionable remarks for users (under review), a portable OpenMP device (=gpu) runtime written in OpenMP 5.1 (including atomic 2) partial( 4) partial( 8) partial( 16) partial (32) partial (64) partial (128) partial( 256 support) [133], a virtual GPU as debugging friendly offloading target on the host [134], improved diagnostics and execution information [135,136]. We redone the OpenMP GPU code generation in LLVM/Clang [137] to improve performance and correctness.…”

Section: Recent Progressmentioning

confidence: 99%

ECP Software Technology Capability Assessment Report V3.0

Heroux

McInnes

et al. 2022

View full text Add to dashboard Cite

The Exascale Computing Project (ECP) Software Technology (ST) focus area is responsible for (1) developing critical software capabilities that will enable the successful execution of ECP applications and (2) providing key components of a productive and sustainable exascale computing ecosystem that will position the US Department of Energy (DOE) and the broader high-performance computing (HPC) community with a firm foundation for future extreme-scale computing capabilities.This ECP ST Capability Assessment Report (CAR) provides an overview and assessment of current ECP ST capabilities and activities, giving stakeholders and the broader HPC community information that can be used to assess ECP ST progress and plan their own efforts accordingly. ECP ST leaders commit to updating this document on regular basis (every 6-12 months). Highlights from this version of the report are presented here.This version of the CAR contains the following updates relative to the previous revision.• This report highlights the progress with the Extreme-scale Scientific Software Stack (E4S) efforts.In particular, this report discusses how E4S continues to gain traction as a first-class entity in the HPC ecosystem, enabling new conversations with users, facilities, vendors, other US agencies, and international partners.• The several-page summaries of each ECP Level 4 project were updated to reflect recent progress and next steps (Section 4). Of particular note are the experiences of our teams on early-access systems for Frontier.• The E4S is described further. E4S is now updated via quarterly releases. E4S is the primary integration and delivery vehicle for ECP ST capabilities (Section 2.1.1).• The ECP ST software development kit (SDK) effort further refined its groupings (Section 2.1.2).The ECP ST focus area represents the key bridge between exascale systems and the scientists developing applications that will run on those platforms. ECP ST efforts contribute to approximately 70 software products (Section 2.1.3) in six technical areas (Table 1). Since publishing the previous revision of the CAR, the team has continued to evolve the product dictionary of official product names, which enables more rigorous mapping of ECP ST deliverables to stakeholders (Section 2.1.4).Programming Models & Runtimes: In addition to developing key enhancements to MPI and OpenMP for scalable systems with accelerated node architectures, the team is working on performance portability layers (Kokkos and RAJA) and participating in OpenMP and OpenACC software design and development that will enable applications to write much of their source code without needing to provide vendor-specific implementations for each exascale system. One legacy of ECP ST efforts is anticipated to be a software stack that supports Intel and AMD accelerators in addition to NVIDIA's accelerators (Section 4.1).Development Tools: The team is enhancing existing widely used compilers (e.g., LLVM) and performance tools for next-generation platforms. Compilers are critical for heterogeneous archi...

show abstract

Section: Recent Progressmentioning

confidence: 99%

ECP Software Technology Capability Assessment Report V3.0

Heroux

McInnes

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Such a mechanism would allow the threads to dispatch many target regions concurrently, even letting a single OpenMP thread manage an "infinite" number of target regions, thus resolving the problem not only for OMPC but for all target devices. In fact, this limitation has already been pointed out by the libomptarget developers [33], but has not been entirely fixed yet.…”

Section: Future Workmentioning

confidence: 99%

The OpenMP Cluster Programming Model

Yviquel,

Pereira,

Francesquini

et al. 2022

Preprint

View full text Add to dashboard Cite

Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP's offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra-and internode parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application.

show abstract

“…In [25], several approaches are presented to overlap GPU operations with computations thanks to OpenMP target constructions. They proposed to run asynchronous target tasks within dedicated threads, which are preempted by blocking operations.…”

Section: Related Workmentioning

confidence: 99%

“…The completion of GPU operations implies synchronizations that end up blocking threads. Hence, the LLVM OpenMP runtime executes asynchronous target tasks on dedicated Hidden Helper Threads (HHT) [25] implemented as kernel threads. Thus, the operating system can preempt threads blocking on GPU operations, and Standard OpenMP threads can be rescheduled onto physical cores to progress other tasks in parallel.…”

Section: Openmp Target In Mpcmentioning

confidence: 99%

See 1 more Smart Citation

Enhancing MPI+OpenMP Task Based Applications for Heterogeneous Architectures with GPU Support

Ferat

Pereira

Roussel

et al. 2022

OpenMP in a Modern World: From Multi-Device Support to Meta Programming

View full text Add to dashboard Cite

Heterogeneous supercomputers are widespread over HPC systems and programming efficient applications on these architectures is a challenge. Task-based programming models are a promising way to tackle this challenge. Since OpenMP 4.0 and 4.5, the target directives enable to offload pieces of code to GPUs and to express it as tasks with dependencies. Therefore, heterogeneous machines can be programmed using MPI+OpenMP(task+target) to exhibit a very high level of concurrent asynchronous operations for which data transfers, kernel executions, communications and CPU computations can be overlapped. Hence, it is possible to suspend tasks performing these asynchronous operations on the CPUs and to overlap their completion with another task execution. Suspended tasks can resume once the associated asynchronous event is completed in an opportunistic way at every scheduling point. We have integrated this feature into the MPC framework and validated it on a AXPY microbenchmark and evaluated on a MPI+OpenMP(tasks) implementation of the LULESH proxy applications. The results show that we are able to improve asynchronism and the overall HPC performance, allowing applications to benefit from asynchronous execution on heterogeneous machines.

show abstract

Concurrent Execution of Deferred OpenMP Target Tasks with Hidden Helper Threads

Cited by 17 publications

References 5 publications

ECP Software Technology Capability Assessment Report V3.0

ECP Software Technology Capability Assessment Report V3.0

The OpenMP Cluster Programming Model

Enhancing MPI+OpenMP Task Based Applications for Heterogeneous Architectures with GPU Support

Contact Info

Product

Resources

About