Porting the PLASMA Numerical Library to the OpenMP Standard

YarKhan, Asim; Kurzak, Jakub; Łuszczek, Piotr; Dongarra, Jack

doi:10.1007/s10766-016-0441-6

Cited by 31 publications

(28 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, now, the PLASMA library undergoes a process of porting from the QUARK task scheduler to the OpenMP task scheduler and that can change the PLASMA performance a little, but the stable version is still based on QUARK. On the other hand, in the work of Yarkhan et al (2017), we can see in Fig. 15 that the QUARK-based PLASMA implementation and its OpenMP version achieve almost identical performance-both somewhat worse than the MKL (on 20 cores of the Haswell processor, which is similar to our environment).…”

Section: Related Workmentioning

confidence: 60%

The Parallel Tiled WZ Factorization Algorithm for Multicore Architectures

Bylina

2019

International Journal of Applied Mathematics and Computer Science

View full text Add to dashboard Cite

The aim of this paper is to investigate dense linear algebra algorithms on shared memory multicore architectures. The design and implementation of a parallel tiled WZ factorization algorithm which can fully exploit such architectures are presented. Three parallel implementations of the algorithm are studied. The first one relies only on exploiting multithreaded BLAS (basic linear algebra subprograms) operations. The second implementation, except for BLAS operations, employs the OpenMP standard to use the loop-level parallelism. The third implementation, except for BLAS operations, employs the OpenMP task directive with the depend clause. We report the computational performance and the speedup of the parallel tiled WZ factorization algorithm on shared memory multicore architectures for dense square diagonally dominant matrices. Then we compare our parallel implementations with the respective LU factorization from a vendor implemented LAPACK library. We also analyze the numerical accuracy. Two of our implementations can be achieved with near maximal theoretical speedup implied by Amdahl’s law.

show abstract

Section: Related Workmentioning

confidence: 60%

The Parallel Tiled WZ Factorization Algorithm for Multicore Architectures

Bylina

2019

International Journal of Applied Mathematics and Computer Science

View full text Add to dashboard Cite

show abstract

“…Although in several cases the tasking model has replaced nested parallelism to exploit irregular applications [3,39], the latter still outperforms the former in some cases. This is, for example, the case of imbalanced loops, where dynamic scheduling or tasking may suffer from poor cache behavior and low data reuse due to the inability to bind tasks to cores [8].…”

Section: Nested Parallelism In Hpcmentioning

confidence: 99%

“…Both utilize the Cholesky decomposition to capture the mean and covariance of the system state. Overall, the GPA-aided SINU is a real-time application that can exploit two levels of parallelism: in the outer level, the computation of the two functionalities (i.e., computing position, velocity and orientation, and estimating errors) can be performed in parallel; in the inner level, the computation of the Cholesky decomposition used in the Kalman Filter [39] can be further parallelized. The use of nested parallel regions can however prevent the scheduler from fulfilling priorities or ensuring work-conserving executions.…”

Section: Gps-aided Sinumentioning

confidence: 99%

The Cooperative Parallel: A Discussion About Run-Time Schedulers for Nested Parallelism

Royuela

Serrano

García-Gasulla

et al. 2019

OpenMP: Conquering the Full Hardware Spectrum

View full text Add to dashboard Cite

Nested parallelism is a well-known parallelization strategy to exploit irregular parallelism in HPC applications. This strategy also fits in critical real-time embedded systems, composed of a set of concurrent functionalities. In this case, nested parallelism can be used to further exploit the parallelism of each functionality. However, current run-time implementations of nested parallelism can produce inefficiencies and load imbalance. Moreover, in critical real-time embedded systems, it may lead to incorrect executions due to, for instance, a work non-conserving scheduler. In both cases, the reason is that the teams of OpenMP threads are a black-box for the scheduler, i.e., the scheduler that assigns OpenMP threads and tasks to the set of available computing resources is agnostic to the internal execution of each team. This paper proposes a new run-time scheduler that considers dynamic information of the OpenMP threads and tasks running within several concurrent teams, i.e., concurrent parallel regions. This information may include the existence of OpenMP threads waiting in a barrier and the priority of tasks ready to execute. By making the concurrent parallel regions to cooperate, the shared computing resources can be better controlled and a work conserving and priority driven scheduler can be guaranteed.

show abstract

“…OpenMP 4.5 further extended the tasking capabilities. For example, OpenMP 4.5 added task priorities that are critical for obtaining high performance using some of our PLASMA routines [22]. These OpenMP standards are supported by popular compilers, including the GNU Compiler Collection (GCC) and the Intel C Compiler (ICC).…”

Section: Openmp Standardmentioning

confidence: 99%

Symmetric Indefinite Linear Solver Using OpenMP Task on Multicore Architectures

Yamazaki¹,

Kurzak²,

Wu³

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Recently, the Open Multi-Processing (OpenMP) standard has incorporated task-based programming, where a function call with input and output data is treated as a task. At run time, OpenMP's superscalar scheduler tracks the data dependencies among the tasks and executes the tasks as their dependencies are resolved. On a shared-memory architecture with multiple cores, the independent tasks are executed on different cores in parallel, thereby enabling parallel execution of a seemingly sequential code. With the emergence of many-core architectures, this type of programming paradigm is gaining attention-not only because of its simplicity, but also because it breaks the artificial synchronization points of the program and improves its thread-level parallelization. In this paper, we use these new OpenMP features to develop a portable high-performance implementation of a dense symmetric indefinite linear solver. Obtaining high performance from this kind of solver is a challenge because the symmetric pivoting, which is required to maintain numerical stability, leads to data dependencies that prevent us from using some common performance-improving techniques. To fully utilize a large number of cores through tasking, while conforming to the OpenMP standard, we describe several techniques. Our performance results on current many-core architectures-including Intel's Broadwell, Intel's Knights Landing, IBM's Power8, and Arm's ARMv8-demonstrate the portable and superior performance of our implementation compared with the Linear Algebra PACKage (LAPACK). The resulting solver is now available as a part of the PLASMA software package.

show abstract

Porting the PLASMA Numerical Library to the OpenMP Standard

Cited by 31 publications

References 31 publications

The Parallel Tiled WZ Factorization Algorithm for Multicore Architectures

The Parallel Tiled WZ Factorization Algorithm for Multicore Architectures

The Cooperative Parallel: A Discussion About Run-Time Schedulers for Nested Parallelism

Symmetric Indefinite Linear Solver Using OpenMP Task on Multicore Architectures

Contact Info

Product

Resources

About