Productive Parallel Linear Algebra Programming with Unstructured Topology Adaption

Gottschling, Peter; Hoefler, Torsten

doi:10.1109/ccgrid.2012.51

Cited by 3 publications

(6 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, each destination GPU shares its received border with the other side of its GPU-pair via NVLink (ln. [14][15][16]. Once the inter-GPU communication completes, each GPU processes its remaining part of the sub-image (i.e., sync zone) (ln.…”

Section: Summary Of the Designmentioning

confidence: 99%

Topology-aware optimizations for multi-GPU ptychographic image reconstruction

Biçer

Kettimuthu

et al. 2021

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Ptychography is an advanced high-resolution X-ray imaging technique that can generate extremely large datasets. Ptychographic reconstruction transforms reciprocal space experimental data to high-resolution 2D real-space images. GPUs have been used extensively to meet the computational requirements of the reconstruction. Generic multi-GPU reconstruction solutions use common communication topologies, such as P2P graph and ring, that are provided by MPI and NCCL libraries, to establish inter-GPU communications. However, these common topologies assume homogeneous physical links between GPUs, resulting in sub-optimal performance on heterogeneous configurations that are composed of both high-(e.g., NVLink) and low-speed (e.g., PCIe) interconnects. This mismatch between application-level communication topology and physical interconnection can cause data transfer congestion, inefficient memory access, and under-utilization of network resources. Here we present topology-aware designs and optimizations to address the aforementioned mismatch and boost end-to-end application performance. We introduce topology-aware data splitting, propose a novel communication topology, and incorporate asynchronous data movement and computation. We evaluate our design and optimizations using real and artificial datasets and compare its performance with that of the direct P2P and NCCL-based approaches. The results show that our optimizations always outperform the counterparts and achieve up to 5.13⇥ and 1.63⇥ communication and end-to-end application speedups, respectively.

show abstract

Section: Summary Of the Designmentioning

confidence: 99%

Topology-aware optimizations for multi-GPU ptychographic image reconstruction

Biçer

Kettimuthu

et al. 2021

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…The state type is represented by thrust :: device vector<double>: 1 typedef thrust::device vector<double> state type; The X, Y , and Z components of the state are held in the continuous partitions of the vector. operator() uses the standard technique of packing the state components into a zip iterator and passes the composite sequence to the thrust :: for each algorithm together with the provided device function object: 12 struct lorenz functor ; dxdt.begin(), dxdt.begin() + N, dxdt.begin() + 2 * N ) ), 21 thrust :: make zip iterator( thrust :: make tuple( R.end(), 23 x.begin() + N, x.begin() + 2 * N, x.end(), 24 dxdt.begin() + N, dxdt.begin() + 2 * N, dxdt.end() ) ), 25 lorenz functor () ); 26…”

Section: };mentioning

confidence: 99%

“…The only difference here is that values of neighboring vector elements are needed. In order to access these values, we use Thrust's permutation iterator, so that operator() of the system function object becomes thrust :: make permutation iterator( x.begin(), prev.begin() ), 6 thrust :: make permutation iterator( x.begin(), next.begin() ), 7 omega.begin() , dxdt.begin() ) ), thrust :: make permutation iterator( x.begin(), prev.end() ), 12 thrust :: make permutation iterator( x.begin(), next.end() ), 13 omega.end(), dxdt.end() ) ), 14 phase oscillators functor () 15 );…”

mentioning

confidence: 99%

Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries

Demidov¹,

Ahnert²,

Rupp³

et al. 2013

SIAM J. Sci. Comput.

Self Cite

View full text Add to dashboard Cite

Abstract. We present a comparison of several modern C ++ libraries providing high-level interfaces for programming multi-and many-core architectures on top of CUDA or OpenCL. The comparison focuses on the solution of ordinary differential equations and is based on odeint, a framework for the solution of systems of ordinary differential equations. Odeint is designed in a very flexible way and may be easily adapted for effective use of libraries such as MTL4, VexCL, or ViennaCL, using CUDA or OpenCL technologies. We found that CUDA and OpenCL work equally well for problems of large sizes, while OpenCL has higher overhead for smaller problems. Furthermore, we show that modern high-level libraries allow to effectively use the computational resources of many-core GPUs or multi-core CPUs without much knowledge of the underlying technologies.Key words. GPGPU, OpenCL, CUDA, C ++ , Boost.odeint, MTL4, VexCL, ViennaCL AMS subject classifications. 34-04, 65-04, 65Y05, 65Y10, 97N801. Introduction. Recently, general purpose computing on graphics processing units (GPGPU) has acquired considerable momentum in the scientific community. This is confirmed both by increasing numbers of GPGPU-related publications and GPU-based supercomputers in the TOP500 1 list. Major programming frameworks are NVIDIA CUDA and OpenCL. The former is a proprietary parallel computing architecture developed by NVIDIA for general purpose computing on NVIDIA graphics adapters, and the latter is an open, royalty-free standard for cross-platform, parallel programming of modern processors and GPUs maintained by the Khronos group. By nature, the two frameworks have their distinctive pros and cons. CUDA has a more mature programming environment with a larger set of scientific libraries, but is available for NVIDIA hardware only. OpenCL is supported on a wide range of hardware, but its native API requires a much larger amount of boilerplate code from the developer. Another problem with OpenCL is that it is generally difficult to achieve performance portability across different hardware architectures.Both technologies are able to provide scientists with the vast computational resources of modern GPUs at the price of a steep learning curve. Programmers need to familiarize themselves with a new programming language and, more importantly, with a new programming paradigm. However, the entry barrier may be lowered with the help of specialized libraries. The CUDA Toolkit includes several such libraries (BLAS implementations, Fast Fourier Transform, Thrust and others). OpenCL lacks standard libraries, but there are a number of third-party projects aimed at developing both CUDA and OpenCL programs. This paper presents a comparison of several modern C ++ libraries aimed at ease of GPGPU development. We look at both convenience and performance of the li-

show abstract

“…Algebraic MultiGrid (AMG) [35,37] is a robust preconditioner for elliptic problems. It is appreciated for its extensibility qualities with M-matrix systems: the number of iterations required to converge only depends minimally on the problem size and can be entirely sizeindependent.…”

Section: Algebraic Multigridmentioning

confidence: 99%

“…The Parallel Matrix Template Library v4 (PMTL4) [10,35] provides linear algebra operations on distributed data as a C + + template library. Available data types are distributed vector and sparse and dense matrix types as well as abstractions to conveniently handle distribution and migration.…”

Section: Pmtl4mentioning

confidence: 99%

Survey on Efficient Linear Solvers for Porous Media Flow Models on Recent Hardware Architectures

Anciaux–Sedrakian

Gottschling

Gratien

et al. 2014

Oil Gas Sci. Technol. – Rev. IFP Energies nouvelles

Self Cite

View full text Add to dashboard Cite

Re´sume´-Revue des algorithmes de solveurs line´aires utilise´s en simulation de re´servoir, efficaces sur les architectures mate´rielles modernes -Depuis quelques anne´es, en calculs haute performance les constructeurs ont recours de plus en plus a`des architectures base´es sur des unite´s de calculs multicoeurs e´ventuellement acce´le´re´es avec des cartes de type GPGPU (General Purpose Processing on Graphics Processing Units). L'intereˆt de telles architectures offrant un grand nombre d'unite´s de calcul pourrait eˆtre grand pour le domaine de la simulation d'e´coulements multiphasiques en milieu poreux, utilise´e par exemple dans les applications de type se´questration ge´ologique du CO 2 ou simulateur de re´cupe´ration avance´e de pe´trole dans des re´servoirs. Il faut ne´anmoins ve´rifier si les algorithmes des logiciels actuels sont adapte´s pour eˆtre efficaces avec ces nouvelles technologies. La re´solution de grands syste`mes line´aires creux constitue souvent la partie la plus couˆteuse des simulateurs d'e´coulement en milieu poreux. En effet, ces syste`mes sont souvent mal conditionne´s duˆau caracte`re souvent tre`s he´te´roge`ne et anisotrope des donne´es ge´ologiques. Les solveurs line´aires constituent pour ces raisons un point crucial pour les performances de ces simulateurs. Dans cet article, nous proposons un panorama des diffe´rents algorithmes de solveurs line´aires et de pre´conditionneurs utilise´s dans nos applications. Nous analysons leur efficacite´nume´rique et leur performance en fonction de diffe´rentes configurations mate´rielles. Nous proposons une nouvelle approche, base´e sur la programmation hybride, performante sur des architectures he´te´roge`nes a`base de processeurs multicoeurs ou d'acce´le´rateurs de type GPGPU. Cette approche est valide´e dans l'imple´mentation d'un BiCGStab pre´conditionneá vec des algorithmes de type ILU(0), BSSOR, pre´conditionneur polynomial ou CPR-AMG. Des tests de performances ont alors e´te´effectue´s sur differents cas d'e´tudes d'e´coulement en milieu poreux, utilisant des maillages de grande taille.Abstract -Survey on Efficient Linear Solvers for Porous Media Flow Models on Recent Hardware Architectures -In the past few years, High Performance Computing (HPC) technologies led to General Purpose Processing on Graphics Processing Units (GPGPU) and many-core architectures. These emerging technologies offer massive processing units and are interesting for porous media flow simulators may used for CO 2 geological sequestration or Enhanced Oil Recovery (EOR) simulation. However the crucial point is "are current algorithms and software able to use these new technologies efficiently?" The resolution of large sparse linear systems, almost ill-conditioned, constitutes the most CPUconsuming part of such simulators. This paper proposes a survey on various solver and preconditioner

show abstract

Productive Parallel Linear Algebra Programming with Unstructured Topology Adaption

Cited by 3 publications

References 18 publications

Topology-aware optimizations for multi-GPU ptychographic image reconstruction

Topology-aware optimizations for multi-GPU ptychographic image reconstruction

Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries

Survey on Efficient Linear Solvers for Porous Media Flow Models on Recent Hardware Architectures

Contact Info

Product

Resources

About