Yuri Torres scite author profile

Abstract-Automatic data distribution is a key feature to obtain efficient implementations from abstract and portable parallel codes. We present a highly efficient and extensible runtime library that integrates techniques for automatic data partition and mapping. It uses a novel approach to define an abstract interface and a plug-in system to encapsulate different types of regular and irregular techniques, helping to generate codes which are independent of the exact mapping functions selected. Currently, it supports hierarchical tiling of arrays with dense and stride domains, that allows the implementation of both data and task parallelism using a SPMD model. It automatically computes appropriate domain partitions for a selected virtual topology, mapping them to available processors with static or dynamic load-balancing techniques. Our library also allows the construction of reusable communication patterns that efficiently exploit MPI communication capabilities. The use of our library greatly reduces the complexity of data distribution and communication, hiding the details of the underlying architecture. The library can be used as an abstract layer for building generic tiling operations as well. Our experimental results show that the use of this library allows to achieve similar performance as carefully-implemented manual versions for several, well-known parallel kernels and benchmarks in distributed and multicore systems, and substantially reduces programming effort.

show abstract

A new GPU-based approach to the Shortest Path problem

Ortega–Arranz

Torres

Llanos

et al. 2013

View full text Add to dashboard Cite

The Single-Source Shortest Path (SSSP) problem arises in many different fields. In this paper we present a GPUbased version of the Crauser et al. SSSP algorithm. Our work significantly speeds up the computation of the SSSP, not only with respect to the CPU-based version, but also to other state-ofthe-art GPU implementation based on Dijkstra, due to Martín et al. Both GPU implementations have been evaluated using the last Nvidia architecture (Kepler). Our experimental results show that the new GPU-Crauser algorithm leads to speed-ups from 13× to 220× with respect to the CPU version and a performance gain of up to 17% with respect the GPU-Martín algorithm.

show abstract

uBench: exposing the impact of CUDA block geometry in terms of performance

2013

View full text Add to dashboard Cite

Understanding the impact of CUDA tuning techniques for Fermi

Torres

González-Escribano

Llanos

2011

View full text Add to dashboard Cite

Optimizing an APSP implementation for NVIDIA GPUs using kernel characterization criteria

et al. 2014

View full text Add to dashboard Cite

Using Fermi Architecture Knowledge to Speed up CUDA and OpenCL Programs

Torres

González-Escribano

Llanos

2012

View full text Add to dashboard Cite

Abstract-The NVIDIA graphics processing units (GPUs) are playing an important role as general purpose programming devices. The implementation of parallel codes to exploit the GPU hardware architecture is a task for experienced programmers. The threadblock size and shape choice is one of the most important user decisions when a parallel problem is coded. The threadblock configuration has a significant impact on the global performance of the program. While in CUDA parallel programming model it is always necessary to specify the threadblock size and shape, the OpenCL standard also offers an automatic mechanism to take this delicate decision.In this paper we present a study of these criteria for Fermi architecture, introducing a general approach for threadblock choice, and showing that there is considerable room for improvement in OpenCL automatic strategy.

show abstract

Comprehensive Evaluation of a New GPU-based Approach to the Shortest Path Problem

Ortega–Arranz

Torres

González-Escribano

et al. 2015

Int J Parallel Prog

View full text Add to dashboard Cite

The Single-Source Shortest Path (SSSP) problem arises in many different fields. In this paper, we present a GPU SSSP algorithm implementation. Our work significantly speeds up the computation of the SSSP, not only with respect to a CPU-based version, but also to other state-of-the-art GPU implementations based on Dijkstra. Both GPU implementations have been evaluated using the latest NVIDIA architectures. The graphs chosen as input sets vary in nature, size, and fan-out degree, in order to evaluate the behavior of the algorithms for different data classes. Additionally, we have enhanced our GPU algorithm implementation using two optimization techniques: The use of a proper choice of threadblock size; and the modification of the GPU L1 cache memory state of NVIDIA devices. These optimizations lead to performance improvements of up to 23% with respect to the non-optimized versions. In addition, we have made a platform comparison of several NVIDIA boards in order to distinguish which one is better for each class of graphs, depending on their features. Finally, we compare our results with an optimized sequential implementation of Dijkstra's algorithm included in the reference Boost library, obtaining an improvement ratio of up to 19× for some graph families, using less memory space.

show abstract

Efficient heterogeneous programming with FPGAs using the Controller model

et al. 2021

View full text Add to dashboard Cite

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yuri Torres

An Extensible System for Multilevel Automatic Data Partition and Mapping

A new GPU-based approach to the Shortest Path problem

uBench: exposing the impact of CUDA block geometry in terms of performance

Understanding the impact of CUDA tuning techniques for Fermi

Optimizing an APSP implementation for NVIDIA GPUs using kernel characterization criteria

Using Fermi Architecture Knowledge to Speed up CUDA and OpenCL Programs

Comprehensive Evaluation of a New GPU-based Approach to the Shortest Path Problem

Efficient heterogeneous programming with FPGAs using the Controller model

Contact Info

Product

Resources

About