Algorithmic performance studies on graphics processing units

Schenk, Olaf; Christen, Matthias; Burkhart, Helmar

doi:10.1016/j.jpdc.2008.05.008

Cited by 54 publications

(16 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…While GPUs can partially hide the off-loading overhead with asynchronous data transfer (i.e., double-buffering), this mechanism currently works only for page-locked memory and incurs additional programming overhead [20]. To amortize the off-loading overhead, GPUs require higher computational intensity than other processors [6,28,16]. However, the Tesla C1060's on-board memory is much larger (4 GB) than the Harpertown or Barcelona's cache memory (12 or 2 MB) or the Cell/B.E.…”

Section: Start-up Overheadmentioning

confidence: 99%

Understanding the design trade-offs among current multicore systems for numerical computations

Kang

Bader

Vuduc

2009

2009 IEEE International Symposium on Parallel &Amp; Distributed Processing

View full text Add to dashboard Cite

show abstract

Section: Start-up Overheadmentioning

confidence: 99%

Understanding the design trade-offs among current multicore systems for numerical computations

Kang

Bader

Vuduc

2009

2009 IEEE International Symposium on Parallel &Amp; Distributed Processing

View full text Add to dashboard Cite

show abstract

“…On current processors, with several levels of cache memory, it is possible to carefully orchestrate the memory accesses for these type of operations achieving high performance. A few studies on modern GPUs [8], [7], [9] show how, for this type of operations, these hardware accelerators can deliver up to 10× speed-ups compared with highly tuned implementations on a general-purpose processor, even taking into account the overhead introduced by the data transfers through the PCI-Express bus.…”

Section: A Flame Methodology: Algorithmic Variantsmentioning

confidence: 99%

Fast development of dense linear algebra codes on graphics processors

Zafont

Martı́n

Igual

et al. 2009

2009 IEEE International Symposium on Parallel &Amp; Distributed Processing

View full text Add to dashboard Cite

Abstract-We present an application programming interface (API) for the C programming language that facilitates the development of dense linear algebra algorithms on graphics processors applying the FLAME methodology. The interface, built on top of the NVIDIA CUBLAS library, implements all the computational functionality of the FLAME/C interface. In addition, the API includes data transference routines to explicitly handle communication between the CPU and GPU memory spaces. The flexibility and simplicity-of-use of this tool are illustrated using a complex operation of dense linear algebra: the Cholesky factorization. For this operation, we implement and evaluate all existing variants on an NVIDIA G80 processor.

show abstract

“…Only implicit schemes are considered as system solving strategy, with performance comparison of various linear system solvers (both direct and iterative). Other extensive studies on the performances of different linear solvers have been carried out for example in [16,17,18,19]. On the other hand, only few GPU implementations of FEM in explicit dynamics are available in the literature.…”

Section: Introductionmentioning

confidence: 99%

An explicit dynamics GPU structural solver for thin shell finite elements

Bartezzaghi

Cremonesi

Parolini

et al. 2015

Computers & Structures

View full text Add to dashboard Cite

With the availability of user oriented software tools, dedicated architectures, such as the parallel computing platform and programming model CUDA (Compute Unified Device Architecture) released by NVIDIA, one of the main producers of graphics cards, and of improved, highly performing GPU (Graphics Processing Unit) boards, GPGPU (General Purpose programming on GPU) is attracting increasing interest in the engineering community, for the development of analysis tools suitable to be used in validation/verification and virtual reality applications. For their inherent explicit and decoupled structure, explicit dynamics finite element formulations appear to be particularly attractive for implementations on hybrid CPU/GPU or pure GPU architectures. The issue of an optimized, double-precision finite element GPU implementation of an explicit dynamics finite element solver for elastic shell problems in small strains and large displacements and rotations, using unstructured meshes, is here addressed. The conceptual difference between a GPU implementation directly adapted from a standard CPU approach and a new optimized formulation, specifically conceived for GPUs, is discussed and comparatively assessed. It is shown that a speedup factor of about 5 can be achieved by an optimized algorithm reformulation and careful memory management. A speedup of more than 40 is achieved with respect of state-of-the art commercial codes running on CPU, obtaining real-time simulations in some cases, on commodity hardware. When a last generation GPU board is used, it is shown that a problem with more than 16 millions degrees of freedom can be solved in just few hours of computing time, opening the way to virtualization approaches for real large scale engineering problems.

show abstract

Algorithmic performance studies on graphics processing units

Cited by 54 publications

References 15 publications

Understanding the design trade-offs among current multicore systems for numerical computations

Understanding the design trade-offs among current multicore systems for numerical computations

Fast development of dense linear algebra codes on graphics processors

An explicit dynamics GPU structural solver for thin shell finite elements

Contact Info

Product

Resources

About