Design and analysis of scheduling strategies for multi-CPU and multi-GPU architectures

Lima, João Vicente Ferreira; Gautier, Thierry; Danjean, Vincent; Raffin, Bruno; Maillard, Nicolas

doi:10.1016/j.parco.2015.03.001

Cited by 19 publications

(14 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It therefore suggests to constrain the dsyrk and dtrsm tasks to run exclusively on GPUs. Performance gains when constraining some tasks to GPUs were already reported by Lima et al 51 However, their results were achieved using scheduler hints provided by programmer annotations. In our case, the suggestion of when and which tasks to constrain to GPUs is inferred from the solution of the ABE without relying on programmer's knowledge about task's architecture affinity.…”

Section: Initial Motivationmentioning

confidence: 93%

A visual performance analysis framework for task‐based parallel applications running on hybrid clusters

Pinto

Schnorr

Stanisic

et al. 2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Programming paradigms in High-Performance Computing have been shifting toward task-based models that are capable of adapting readily to heterogeneous and scalable supercomputers. The performance of task-based application heavily depends on the runtime scheduling heuristics and on its ability to exploit computing and communication resources. Unfortunately, the traditional performance analysis strategies are unfit to fully understand task-based runtime systems and applications: they expect a regular behavior with communication and computation phases, while task-based applications demonstrate no clear phases. Moreover, the finer granularity of task-based applications typically induces a stochastic behavior that leads to irregular structures that are difficult to analyze. Furthermore, the combination of application structure, scheduler, and hardware information is generally essential to understand performance issues. This paper presents a flexible framework that enables one to combine several sources of information and to create custom visualization panels allowing to understand and pinpoint performance problems incurred by bad scheduling decisions in task-based applications. Three case-studies using StarPU-MPI, a task-based multi-node runtime system, are detailed to show how our framework can be used to study the performance of the well-known Cholesky factorization. Performance improvements include a better task partitioning among the multi-(GPU, core) to get closer to theoretical lower bounds, improved MPI pipelining in multi-(node, core, GPU) to reduce the slow start, and changes in the runtime system to increase MPI bandwidth, with gains of up to 13% in the total makespan.

show abstract

Section: Initial Motivationmentioning

confidence: 93%

A visual performance analysis framework for task‐based parallel applications running on hybrid clusters

Pinto

Schnorr

Stanisic

et al. 2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this paper, we address an intermediate setting, where tasks are independent, but share input data, and we analyze both makespan and communication performance. More recently, a study comparing different schedulers have been carried out in the context of dense linear algebra factorizations on heterogeneous systems [22]. Although, this study is closely related to the work we present in the present paper, it doesn't tackle neither the matrix product, nor the static (resp.…”

Section: Related Workmentioning

confidence: 97%

Comparison of Static and Runtime Resource Allocation Strategies for Matrix Multiplication

Beaumont

Eyrauld-Dubois

Guermouche

et al. 2015

2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

View full text Add to dashboard Cite

The tremendous increase in the size and heterogeneity of supercomputers makes it very difficult to predict the performance of a scheduling algorithm. In this context, relying on purely static scheduling and resource allocation strategies, that make scheduling and allocation decisions based on the dependency graph and the platform description, is expected to lead to large and unpredictable makespans whenever the behavior of the platform does not match the predictions. For this reason, the common practice in most runtime libraries is to rely on purely dynamic scheduling strategies, that make short-sighted scheduling decisions at runtime based on the estimations of the duration of the different tasks on the different available resources and on the state of the machine. In this paper, we consider the special case of Matrix Multiplication, for which a number of static allocation algorithms to minimize the amount of communications have been proposed. Through a set of extensive simulations, we analyze the behavior of static, dynamic, and hybrid strategies, and we assess the possible benefits of introducing more static knowledge and allocation decisions in runtime libraries.

show abstract

“…We implemented extensions in the OpenMP runtime developed in our team, LIBKOMP [5,3], which is based on the XKAAPI [1,9] runtime system. XKAAPI is a task-based runtime system, using workstealing as a general scheduling strategy.…”

Section: Extension Of the Task Scheduler To Support Affinitymentioning

confidence: 99%

“…The way XKAAPI enables ready tasks and steals them. The scheduling framework in XKAAPI [1,9] relies on virtual functions for selecting a victim and selecting a place to push a ready task. When a processor becomes idle, the runtime system calls a function to browse the topology to find a locality domain, and steal a task from its task queue.…”

Section: Extension Of the Task Scheduler To Support Affinitymentioning

confidence: 99%

Description, Implementation and Evaluation of an Affinity Clause for Task Directives

Virouleau

Roussel

Broquedis

et al. 2016

OpenMP: Memory, Devices, and Tasks

Self Cite

View full text Add to dashboard Cite

OpenMP 4.0 introduced dependent tasks, which give the programmer a way to express fine grain parallelism. Using appropriate OS support (such as NUMA libraries), the runtime can rely on the information in the depend clause to dynamically map the tasks to the architecture topology. Controlling data locality is one of the key factors to reach a high level of performance when targeting NUMA architectures. On this topic, OpenMP does not provide a lot of flexibility to the programmer yet, which lets the runtime decide where a task should be executed. In this paper, we present a class of applications which would benefit from having such a control and flexibility over tasks and data placement. We also propose our own interpretation of the new affinity clause for the task directive, which is being discussed by the OpenMP Architecture Review Board. This clause enables the programmer to give hints to the runtime about tasks placement during the program execution, which can be used to control the data mapping on the architecture. In our proposal, the programmer can express affinity between a task and the following resources: a thread, a NUMA node, and a data. We then present an implementation of this proposal in the Clang-3.8 compiler, and an implementation of the corresponding extensions in our OpenMP runtime LIBKOMP. Finally, we present a preliminary evaluation of this work running two task-based OpenMP kernels on a 192-core NUMA architecture, that shows noticeable improvements both in terms of performance and scalability.

show abstract

Design and analysis of scheduling strategies for multi-CPU and multi-GPU architectures

Cited by 19 publications

References 27 publications

A visual performance analysis framework for task‐based parallel applications running on hybrid clusters

A visual performance analysis framework for task‐based parallel applications running on hybrid clusters

Comparison of Static and Runtime Resource Allocation Strategies for Matrix Multiplication

Description, Implementation and Evaluation of an Affinity Clause for Task Directives

Contact Info

Product

Resources

About