Invasive Compute Balancing for Applications with Shared and Hybrid Parallelization

Schreiber, Martin; Riesinger, Christoph; Neckel, Tobias; Bungartz, Hans‐Joachim; Breuer, Alexander

doi:10.1007/s10766-014-0336-3

Cited by 7 publications

(5 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We see that we can, at little loss of efficiency, for many setups reduce the number of used cores. For codes deploying multiple MPI ranks per node, other ranks then can grab these freed cores [9].…”

Section: Resultsmentioning

confidence: 99%

An Experience Report on (Auto-)tuning of Mesh-Based PDE Solvers on Shared Memory Systems

Charrier

Weinzierl

2018

Parallel Processing and Applied Mathematics

View full text Add to dashboard Cite

With the advent of manycore systems, shared memory parallelisation has gained importance in high performance computing. Once a code is decomposed into tasks or parallel regions, it becomes crucial to identify reasonable grain sizes, i.e. minimum problem sizes per task that make the algorithm expose a high concurrency at low overhead. Many papers do not detail what reasonable task sizes are, and consider their findings craftsmanship not worth discussion. We have implemented an autotuning algorithm, a machine learning approach, for a project developing a hyperbolic equation system solver. Autotuning here is important as the grid and task workload are multifaceted and change frequently during runtime. In this paper, we summarise our lessons learned. We infer tweaks and idioms for general autotuning algorithms and we clarify that such a approach does not free users completely from grain size awareness.

show abstract

Section: Resultsmentioning

confidence: 99%

An Experience Report on (Auto-)tuning of Mesh-Based PDE Solvers on Shared Memory Systems

Charrier

Weinzierl

2018

Parallel Processing and Applied Mathematics

View full text Add to dashboard Cite

show abstract

“…As a first step, the OpenMP and Threading Building Blocks parallelization models have been modified to allow for varying resources, see e.g. [13].…”

Section: Related Work Mpi 2's Dynamic Process Modelmentioning

confidence: 99%

An Emulation Layer for Dynamic Resources with MPI Sessions

Fecht

Schreiber

Schulz

et al. 2022

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

The current static job scheduling on supercomputers for MPIbased applications is well known to be a limiting factor for the exploitation of a system's top performance in terms of application throughput. Hence, allowing fully flexible and dynamically varying job sizes would provide multiple advantages compared to the current approach, e.g., by prioritizing jobs dynamically and optimizing resource usage by transferring resources economically. A critical step in achieving dynamic resource management with MPI on supercomputers is the development of sound and robust interfaces between MPI applications and the runtime system. Our approach extends the concept of MPI Sessions, a new concept introduced with MPI 4.0, by adding new features to support varying computing resources via the MPI process set abstraction. We then show how these features can be used, as a proof of concept, to request (active) and cope with (passive) varying resources from an application's perspective. To validate of our approach, we develop libmpidynres, a C library providing an emulated MPI Sessions environment on top of existing MPI implementations without MPI Sessions support, which we then use to integrate our proposed extensions to the interface specification. Using this proof-of-concept environment, we show how an MPI Sessions enabled application can use process sets to handle dynamically varying resources.

show abstract

“…If a job exceeds the specified time limit then usually the job is cancelled from the job scheduling system. 3 A different interesting approach to manage resources is the field of invasive computing [32], where a job can request and release resources dynamically while it is running. This helps to share resources while executing many jobs in parallel.…”

Section: Idling With Standard Scheduling Techniquesmentioning

confidence: 99%

Prediction and reduction of runtime in non-intrusive forward UQ simulations

2019

Self Cite

View full text Add to dashboard Cite

To foster predictive simulations, a variety of methods have recently been developed to efficiently tackle uncertainty quantification (UQ) in complex, computational intensive problems. Many of these methods are non-intrusive and, thus, result in a (large) number of embarrassingly parallel black-box evaluations of the underlying simulation codes. While the focus of development is typically on the number of black-box evaluations, which represents the bulk of the computational workload, an additional level of potential performance gains exists. In many scenarios, uncertain input leads not only to uncertain outputs, but also to a varying and thus stochastic runtime of the simulation codes. For scheduling the individual black-box runs, this information is typically not taken into account, resulting in non-negligible idling times on parallel systems. In this contribution, we compare a variety of different scheduling strategies for non-intrusive UQ scenarios using the non-intrusive polynomial chaos approach. In particular, we propose to construct a surrogate model for the runtime of the application using the identical UQ methodology as for the original problem. Using this model to predict the runtimes for subsequent black-box runs allows for (heuristical) optimization of the scheduling. The method has been tested for the forward quantification of uncertainty on academic models and on a pedestrian simulation in the context of evacuation scenarios. This approach allows speed-up factors of about two for the total runtime and can be generalised to a large variety of applications that incorporate parameter-dependent runtime.

show abstract

Invasive Compute Balancing for Applications with Shared and Hybrid Parallelization

Cited by 7 publications

References 37 publications

An Experience Report on (Auto-)tuning of Mesh-Based PDE Solvers on Shared Memory Systems

An Experience Report on (Auto-)tuning of Mesh-Based PDE Solvers on Shared Memory Systems

An Emulation Layer for Dynamic Resources with MPI Sessions

Prediction and reduction of runtime in non-intrusive forward UQ simulations

Contact Info

Product

Resources

About