LU, QR, and Cholesky factorizations: Programming model, performance analysis and optimization techniques for the Intel Knights Landing Xeon Phi

Haidar, Azzam; Tomov, Stanimire; Arturov, Konstantin; Guney, Murat Efe; Story, Shane; Dongarra, Jack

doi:10.1109/hpec.2016.7761591

Cited by 12 publications

(7 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides extending ideas from the batched linear algebra routines, manycore algorithms can also be built on ideas from the hybrid linear algebra algorithms. This was demonstrated for the case of KNL processors in [17]. The difficult-to-parallelize tasks are the panel factorizations (see Section 3), and these are the tasks offloaded for execution to the CPUs in the hybrid algorithms.…”

Section: Related Workmentioning

confidence: 94%

“…The difficult-to-parallelize tasks are the panel factorizations (see Section 3), and these are the tasks offloaded for execution to the CPUs in the hybrid algorithms. As the KNL is self-hosted (i.e., there is no additional CPU host), a virtual CPU abstraction was created from a subset of the KNL cores that enabled hybrid algorithms to run efficiently on homogeneous manycore processors [17]. The panel factorizations can be done in parallel with the trailing matrix updates in factorizations like QR, LU, and Cholesky (see Section 3), which is used in the hybrid algorithms to overlap CPU work and CPU-to-GPU communications with GPU work on the trailing matrix updates.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

High-performance Cholesky factorization for GPU-only execution

Haidar

Abdelfatah

Tomov

et al. 2017

Proceedings of the General Purpose GPUs

Self Cite

View full text Add to dashboard Cite

We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. We achieve up to 4, 300 GFlop/s in double precision on a P100 GPU, which is about 7-8× faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library. CCS CONCEPTS •General and reference →Design; Performance; •Theory of computation →Algorithm design techniques; •Computing methodologies →Linear algebra algorithms; Optimization

show abstract

Section: Related Workmentioning

confidence: 94%

Section: Related Workmentioning

confidence: 99%

High-performance Cholesky factorization for GPU-only execution

Haidar

Abdelfatah

Tomov

et al. 2017

Proceedings of the General Purpose GPUs

Self Cite

View full text Add to dashboard Cite

show abstract

“…We compare in Figure 8 the best configurations among the ones presented previously with the native implementation of the Cholesky factorization from the Intel MKL for the Intel KNL platform and with the PLASMA library. It is important to note that we were not able to compare our approach with the MAGMA library for Intel KNL architectures (see [25] for more details) because the corresponding software package is not yet available. For the sake of clarity, we only report the results obtained with the best setup for each library.…”

Section: Experimental Evaluation On the Intel Knl Platformmentioning

confidence: 99%

Resource aggregation for task-based Cholesky Factorization on top of modern architectures

Cojean

Guermouche

Hugo³

et al. 2019

Parallel Computing

View full text Add to dashboard Cite

Hybrid computing platforms are now commonplace, featuring a large number of CPU cores and accelerators. This trend makes balancing computations between these heterogeneous resources performance critical. In this paper we propose aggregating several CPU cores in order to execute larger parallel tasks and improve load balancing between CPUs and accelerators. Additionally, we present our approach to exploit internal parallelism within tasks, by combining two runtime system schedulers: a global runtime system to schedule the main task graph and a local one one to cope with internal task parallelism. We demonstrate the relevance of our approach in the context of the dense Cholesky factorization kernel implemented on top of the StarPU task-based runtime system. We present experimental results showing that our solution outperforms state of the art implementations on two architectures: a modern heterogeneous machine and the Intel Xeon Phi Knights Landing.

show abstract

“…Their early results, as shown in their paper, achieved around 80 percentage efficiency in scalability results for caffe application. Haidar et al (2016) have studied the scalability aspects of algorithms such as lower upper (LU), QR, and cholesky factorisations. They proposed a programming model to efficiently utilise many core machines such as KNL.…”

Section: Related Workmentioning

confidence: 99%

Performance Improvement Options of Scientific Applications on XeonPhi KNL Architectures

Benedict

2018

IJKEDM

View full text Add to dashboard Cite

Intel's recent manycore processor KNights Landing (KNL) promises high performance for scientific applications. Careful tuning for the complex chip architecture is required to efficiently exploit the chip's hardware resources. This paper describes performance improvement techniques and demonstrates their effectiveness for scientific applications. Experiments were conducted with some of the National Aeronautics and Space Administration (NASA's) advanced supercomputing (NAS) parallel benchmarks, and the effectiveness of: 1) advanced vector extensions (AVX-512) vectorisation support; 2) manycore threading support; 3) the utilisation of thread affinities for different KNL modes, was analysed.

show abstract

LU, QR, and Cholesky factorizations: Programming model, performance analysis and optimization techniques for the Intel Knights Landing Xeon Phi

Cited by 12 publications

References 18 publications

High-performance Cholesky factorization for GPU-only execution

High-performance Cholesky factorization for GPU-only execution

Resource aggregation for task-based Cholesky Factorization on top of modern architectures

Performance Improvement Options of Scientific Applications on XeonPhi KNL Architectures

Contact Info

Product

Resources

About