We present several algorithms to compute the solution of a linear system of equations on a GPU, as well as general techniques to improve their performance, such as padding and hybrid GPU-CPU computation. We also show how iterative refinement with mixed-precision can be used to regain full accuracy in the solution of linear systems. Experimental results on a G80 using CUBLAS, the implementation of BLAS for NVIDIA R GPUs with unified architecture, are given to illustrate the performance of the different algorithms and techniques proposed.
The increase in performance of the last generations of graphics processors (GPUs)
Registro de acceso restringido Este recurso no está disponible en acceso abierto por política de la editorial. No obstante, se puede acceder al texto completo desde la Universitat Jaume I o si el usuario cuenta con suscripción. Registre d'accés restringit Aquest recurs no està disponible en accés obert per política de l'editorial. No obstant això, es pot accedir al text complet des de la Universitat Jaume I o si l'usuari compta amb subscripció. Restricted access item This item isn't open access because of publisher's policy. The full--text version is only available from Jaume I University or if the user has a running suscription to the publisher's contents.
SummaryCurrent parallel programming frameworks aid developers to a great extent in implementing applications that exploit parallel hardware resources. Nevertheless, developers require additional expertise to properly use and tune them to operate efficiently on specific parallel platforms. On the other hand, porting applications between different parallel programming models and platforms is not straightforward and demands considerable efforts and specific knowledge.Apart from that, the lack of high-level parallel pattern abstractions, in those frameworks, further increases the complexity in developing parallel applications. To pave the way in this direction, this paper proposes GRPPI, a generic and reusable parallel pattern interface for both stream processing and data-intensive C++ applications. GRPPI accommodates a layer between developers and existing parallel programming frameworks targeting multi-core processors, such as C++ threads, OpenMP and Intel TBB, and accelerators, as CUDA Thrust. Furthermore, thanks to its high-level C++ application programming interface and pattern composability features, GRPPI allows users to easily expose parallelism via standalone patterns or patterns compositions matching in sequential applications. We evaluate this interface using an image processing use case and demonstrate its benefits from the usability, flexibility, and performance points of view. Furthermore, we analyze the impact of using stream and data pattern compositions on CPUs, GPUs and heterogeneous configurations. An approach to relieve developers from this burden is the use of pattern-based parallel programming frameworks, such as SkePU, 2FastFlow 3 , or Intel TBB. 4 In this sense, design patterns provide a way to encapsulate (using a building blocks approach) algorithmic aspects, allowing users to implement more robust, readable, and portable solutions with such a high-level of abstraction. Basically, these patterns instantiate parallelism while hide away the complexity of concurrency mechanisms, eg, thread management, synchronizations, or data sharing. Examples of applications coming from multiple domains (eg, financial, medical, and mathematical) and improving their performance through parallel programming design patterns, can be widely found in the literature. [5][6][7] Nevertheless, although all these skeletons aim to simplify the development of parallel applications, there is not a unified standard. 8 Therefore, users require understanding different frameworks, not only to decide which fits best for their purposes, but also to properly use them. Not to mention the migration efforts of applications among frameworks, which becomes as well an arduous task.In order to mitigate this situation, this paper presents GRPPI, a generic and reusable high-level C++ parallel pattern interface that comprises both stream and data-parallel patterns. In general, the goal of
As sequencing technologies progress, the amount of data produced grows exponentially, shifting the bottleneck of discovery towards the data analysis phase. In particular, currently available mapping solutions for RNA-seq leave room for improvement in terms of sensitivity and performance, hindering an efficient analysis of transcriptomes by massive sequencing. Here, we present an innovative approach that combines re-engineering, optimization and parallelization. This solution results in a significant increase of mapping sensitivity over a wide range of read lengths and substantial shorter runtimes when compared with current RNA-seq mapping methods available.
Registro de acceso restringido Este recurso no está disponible en acceso abierto por política de la editorial. No obstante, se puede acceder al texto completo desde la Universitat Jaume I o si el usuario cuenta con suscripción. Registre d'accés restringit Aquest recurs no està disponible en accés obert per política de l'editorial. No obstant això, es pot accedir al text complet des de la Universitat Jaume I o si l'usuari compta amb subscripció. Restricted access item This item isn't open access because of publisher's policy. The full--text version is only available from Jaume I University or if the user has a running suscription to the publisher's contents.
Abstract-Energy efficiency is a major concern in modern high-performance-computing. Still, few studies provide a deep insight into the power consumption of scientific applications. Especially for algorithms running on hybrid platforms equipped with hardware accelerators, like graphics processors, a detailed energy analysis is essential to identify the most costly parts, and to evaluate possible improvement strategies. In this paper we analyze the computational and power performance of iterative linear solvers applied to sparse systems arising in several scientific applications. We also study the gains yield by dynamic voltage/frequency scaling (DVFS), and illustrate that this technique alone cannot to reduce the energy cost to a considerable amount for iterative linear solvers. We then apply techniques that set the (multi-core processor in the) host system to a low-consuming state for the time that the GPU is executing. Our experiments conclusively reveal how the combination of these two techniques deliver a notable reduction of energy consumption without a noticeable impact on computational performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.